wav2vec2 large xlsr basque

cahya

Introduction

WAV2VEC2-LARGE-XLSR-BASQUE is a model fine-tuned from Facebook's Wav2Vec2-Large-XLSR-53, specifically for the Basque language using the Common Voice dataset. It is designed for automatic speech recognition tasks. This model requires speech input sampled at 16kHz.

Architecture

The model utilizes the Wav2Vec 2.0 architecture, leveraging transformers for processing audio inputs. It is built on the XLSR (cross-lingual speech representations) framework, which allows fine-tuning on multiple languages, in this case, Basque.

Training

The model was trained on the Basque subset of the Common Voice dataset, using the train, validation, and test splits. Detailed training methodologies and scripts can be found here.

Guide: Running Locally

  1. Environment Setup:

    • Install PyTorch and Transformers library.
    • Install torchaudio and datasets.
  2. Load Dataset:

    from datasets import load_dataset
    test_dataset = load_dataset("common_voice", "eu", split="test[:2%]")
    
  3. Model and Processor Initialization:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("cahya-wirawan/wav2vec2-large-xlsr-basque")
    model = Wav2Vec2ForCTC.from_pretrained("cahya-wirawan/wav2vec2-large-xlsr-basque")
    
  4. Preprocess Audio Files:

    import torchaudio
    def speech_file_to_array_fn(batch):
        speech_array, sampling_rate = torchaudio.load(batch["path"])
        resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
        batch["speech"] = resampler(speech_array).squeeze().numpy()
        return batch
    test_dataset = test_dataset.map(speech_file_to_array_fn)
    
  5. Inference:

    inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    print("Prediction:", processor.batch_decode(predicted_ids))
    

Cloud GPUs: For faster processing and leveraging CUDA, consider using cloud services like AWS EC2, Google Cloud, or Azure with GPU instances.

License

This model is licensed under the Apache-2.0 License.

More Related APIs in Automatic Speech Recognition