chinese wav2vec2 base
TencentGameMateIntroduction
The chinese-wav2vec2-base
model, developed by TencentGameMate, is a pretrained model for processing Chinese speech. It is based on the Wav2Vec2 architecture and utilizes the Transformers library for implementation. This model was pretrained on 10k hours of the WenetSpeech L subset.
Architecture
The model is based on the Wav2Vec2 architecture, which is designed for speech processing tasks. It does not include a tokenizer since it was pretrained exclusively on audio data. To use it for speech recognition, a separate tokenizer creation and fine-tuning on labeled text data are required.
Training
The model was pre-trained using the WenetSpeech L subset, which consists of approximately 10k hours of Chinese speech data. The pretraining involved learning audio representations without text labels, necessitating further fine-tuning for specific tasks such as speech recognition.
Guide: Running Locally
To run the chinese-wav2vec2-base
model locally, follow these steps:
-
Install Dependencies: Ensure you have Python and the following packages installed:
transformers==4.16.2
,torch
,soundfile
, andfairseq
. -
Load the Model:
- Import necessary libraries and modules.
- Load the model and feature extractor using the
from_pretrained
method with your specified model path.
-
Prepare Audio Input:
- Use the
soundfile
library to read your audio file. - Extract features using the
Wav2Vec2FeatureExtractor
.
- Use the
-
Run Inference:
- Pass the feature-extracted audio through the model to obtain the last hidden state.
-
Device Setup: Ensure that the model and input values are transferred to the appropriate device (CPU or GPU) for efficient computation.
-
Suggested Cloud GPUs: For computational efficiency and speed, consider using cloud services like AWS, Google Cloud, or Azure that provide access to powerful GPUs.
License
The chinese-wav2vec2-base
model is released under the MIT License, allowing for wide usage and modification with minimal restrictions.