xlsr sg lm
manifoldixIntroduction
The XLSR-1B Swiss German model is a fine-tuned version of the wav2vec2 model, designed for automatic speech recognition (ASR) in the Swiss German language. It has been specifically trained on the Swiss parliament dataset and is part of Hugging Face's robust speech event and ASR leaderboard.
Architecture
The model is based on the wav2vec2 architecture, leveraging the XLS-R framework, which is known for its capability to handle multilingual speech recognition tasks. It utilizes the transformer architecture within PyTorch for efficient processing and recognition of Swiss German speech inputs.
Training
The XLSR-1B Swiss German model was fine-tuned on a 70-hour dataset from the Swiss parliament provided by FHNW. The model achieved a Word Error Rate (WER) of 34.6% on the Swiss parliament test set and 40% on a private test set of Swiss German dialects. The datasets used for training and testing are accessible through the FHNW datasets and a private Hugging Face dataset.
Guide: Running Locally
To run the XLSR-1B Swiss German model locally, follow these steps:
-
Install Dependencies: Ensure you have Python and PyTorch installed. Use
pip
to install Hugging Face Transformers:pip install transformers torch
-
Load the Model: Use the Transformers library to load the model.
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer model_name = "manifoldix/xlsr-sg-lm" tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name) model = Wav2Vec2ForCTC.from_pretrained(model_name)
-
Perform Inference: Process an audio file to transcribe speech.
from datasets import load_dataset import torch dataset = load_dataset("common_voice", "gsw", split="test") audio_input = dataset[0]["audio"]["array"] input_values = tokenizer(audio_input, return_tensors="pt").input_values logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = tokenizer.batch_decode(predicted_ids) print(transcription)
-
Cloud GPUs: For better performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure to handle large audio datasets or real-time transcription needs.
License
The model and its associated datasets are shared under licenses specified on their respective Hugging Face pages. Users should review these licenses to ensure compliance with any restrictions or usage guidelines.