Introduction

The XLSR-1B Swiss German model is a fine-tuned version of the wav2vec2 model, designed for automatic speech recognition (ASR) in the Swiss German language. It has been specifically trained on the Swiss parliament dataset and is part of Hugging Face's robust speech event and ASR leaderboard.

Architecture

The model is based on the wav2vec2 architecture, leveraging the XLS-R framework, which is known for its capability to handle multilingual speech recognition tasks. It utilizes the transformer architecture within PyTorch for efficient processing and recognition of Swiss German speech inputs.

Training

The XLSR-1B Swiss German model was fine-tuned on a 70-hour dataset from the Swiss parliament provided by FHNW. The model achieved a Word Error Rate (WER) of 34.6% on the Swiss parliament test set and 40% on a private test set of Swiss German dialects. The datasets used for training and testing are accessible through the FHNW datasets and a private Hugging Face dataset.

Guide: Running Locally

To run the XLSR-1B Swiss German model locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install Hugging Face Transformers:

    pip install transformers torch
    
  2. Load the Model: Use the Transformers library to load the model.

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
    
    model_name = "manifoldix/xlsr-sg-lm"
    tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)
    model = Wav2Vec2ForCTC.from_pretrained(model_name)
    
  3. Perform Inference: Process an audio file to transcribe speech.

    from datasets import load_dataset
    import torch
    
    dataset = load_dataset("common_voice", "gsw", split="test")
    audio_input = dataset[0]["audio"]["array"]
    
    input_values = tokenizer(audio_input, return_tensors="pt").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)
    print(transcription)
    
  4. Cloud GPUs: For better performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure to handle large audio datasets or real-time transcription needs.

License

The model and its associated datasets are shared under licenses specified on their respective Hugging Face pages. Users should review these licenses to ensure compliance with any restrictions or usage guidelines.

More Related APIs in Automatic Speech Recognition