wav2vec2 xls r 2b 21 to en
facebookIntroduction
Wav2Vec2-XLS-R-2B-21-EN is a fine-tuned model by Facebook for speech translation. It translates spoken languages into English using a SpeechEncoderDecoder architecture. The model supports 21 source languages and is based on the Covost2 dataset.
Architecture
This model employs the SpeechEncoderDecoderModel, where the encoder is initialized from facebook/wav2vec2-xls-r-2b
and the decoder from facebook/mbart-large-50
. It is fine-tuned on 21 language pairs from the Covost2 dataset.
Training
The model was trained to translate audio from various languages into English using the Covost2 dataset. It is designed for automatic speech recognition and translation tasks, leveraging the transformer model structure.
Guide: Running Locally
Basic Steps
-
Install Required Libraries: Ensure you have
transformers
anddatasets
libraries installed.pip install transformers datasets
-
Load the Model and Processor:
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-2b-21-to-en") processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-21-to-en")
-
Load Dataset:
from datasets import load_dataset ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
-
Process and Generate Transcription:
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt") generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"]) transcription = processor.batch_decode(generated_ids)
Suggest Cloud GPUs
For optimal performance, particularly with large models and datasets, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The model is licensed under the Apache-2.0 license, allowing for both personal and commercial use.