wav2vec2 xls r 2b 21 to en

facebook

Introduction

Wav2Vec2-XLS-R-2B-21-EN is a fine-tuned model by Facebook for speech translation. It translates spoken languages into English using a SpeechEncoderDecoder architecture. The model supports 21 source languages and is based on the Covost2 dataset.

Architecture

This model employs the SpeechEncoderDecoderModel, where the encoder is initialized from facebook/wav2vec2-xls-r-2b and the decoder from facebook/mbart-large-50. It is fine-tuned on 21 language pairs from the Covost2 dataset.

Training

The model was trained to translate audio from various languages into English using the Covost2 dataset. It is designed for automatic speech recognition and translation tasks, leveraging the transformer model structure.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries: Ensure you have transformers and datasets libraries installed.

    pip install transformers datasets
    
  2. Load the Model and Processor:

    from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
    model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-2b-21-to-en")
    processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-21-to-en")
    
  3. Load Dataset:

    from datasets import load_dataset
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
    
  4. Process and Generate Transcription:

    inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
    generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
    transcription = processor.batch_decode(generated_ids)
    

Suggest Cloud GPUs

For optimal performance, particularly with large models and datasets, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The model is licensed under the Apache-2.0 license, allowing for both personal and commercial use.

More Related APIs in Automatic Speech Recognition