chinese hubert base
TencentGameMateIntroduction
The chinese-hubert-base
model by TencentGameMate is a pretrained model designed for audio feature extraction. It has been trained on 10,000 hours of the WenetSpeech L subset. This model is focused on speech processing and requires additional fine-tuning with a tokenizer for speech recognition tasks.
Architecture
The model is based on the HuBERT architecture and utilizes PyTorch for implementation. It is compatible with Transformers library and supports feature extraction and inference endpoints.
Training
The model has been pretrained on a large audio dataset, the WenetSpeech L subset, without a tokenizer. It necessitates the creation of a tokenizer and fine-tuning on labeled text data for speech recognition applications.
Guide: Running Locally
To run the chinese-hubert-base
model locally, follow these steps:
-
Install Requirements:
- Python package:
transformers==4.16.2
- Other Python libraries:
torch
,soundfile
- Python package:
-
Set Up Environment:
- Define
model_path
andwav_path
with appropriate file paths. - Load the feature extractor and model using:
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_path) model = HubertModel.from_pretrained(model_path)
- Define
-
Prepare Model and Audio:
- Move the model to the desired device (e.g., GPU) and set it to evaluation mode:
model = model.to(device) model = model.half() model.eval()
- Move the model to the desired device (e.g., GPU) and set it to evaluation mode:
-
Process Audio:
- Read and process the audio file:
wav, sr = sf.read(wav_path) input_values = feature_extractor(wav, return_tensors="pt").input_values input_values = input_values.half() input_values = input_values.to(device)
- Read and process the audio file:
-
Inference:
- Perform inference without gradient computation:
with torch.no_grad(): outputs = model(input_values) last_hidden_state = outputs.last_hidden_state
- Perform inference without gradient computation:
Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance when processing large audio datasets.
License
The chinese-hubert-base
model is licensed under the MIT License.