chinese hubert base

TencentGameMate

Introduction

The chinese-hubert-base model by TencentGameMate is a pretrained model designed for audio feature extraction. It has been trained on 10,000 hours of the WenetSpeech L subset. This model is focused on speech processing and requires additional fine-tuning with a tokenizer for speech recognition tasks.

Architecture

The model is based on the HuBERT architecture and utilizes PyTorch for implementation. It is compatible with Transformers library and supports feature extraction and inference endpoints.

Training

The model has been pretrained on a large audio dataset, the WenetSpeech L subset, without a tokenizer. It necessitates the creation of a tokenizer and fine-tuning on labeled text data for speech recognition applications.

Guide: Running Locally

To run the chinese-hubert-base model locally, follow these steps:

  1. Install Requirements:

    • Python package: transformers==4.16.2
    • Other Python libraries: torch, soundfile
  2. Set Up Environment:

    • Define model_path and wav_path with appropriate file paths.
    • Load the feature extractor and model using:
      feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_path)
      model = HubertModel.from_pretrained(model_path)
      
  3. Prepare Model and Audio:

    • Move the model to the desired device (e.g., GPU) and set it to evaluation mode:
      model = model.to(device)
      model = model.half()
      model.eval()
      
  4. Process Audio:

    • Read and process the audio file:
      wav, sr = sf.read(wav_path)
      input_values = feature_extractor(wav, return_tensors="pt").input_values
      input_values = input_values.half()
      input_values = input_values.to(device)
      
  5. Inference:

    • Perform inference without gradient computation:
      with torch.no_grad():
          outputs = model(input_values)
          last_hidden_state = outputs.last_hidden_state
      

Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance when processing large audio datasets.

License

The chinese-hubert-base model is licensed under the MIT License.

More Related APIs in Feature Extraction