viwav2vec2 base 3k LLM Model

Introduction

The viwav2vec2-base-3k model is a Wav2Vec2 base model pre-trained on a large Vietnamese speech corpus. It is designed for Automatic Speech Recognition tasks and is built using the PyTorch framework. The model is specifically pre-trained on 3,000 hours of Vietnamese audio data, including spontaneous, reading, and broadcasting speech.

Architecture

The model is based on the Wav2Vec2 architecture, which learns speech representations from raw audio input. It is pre-trained on 16kHz sampled audio and is intended to be fine-tuned for specific tasks such as Vietnamese Automatic Speech Recognition.

Training

The model is pre-trained on 3,000 hours of Vietnamese speech data. Since it does not include a tokenizer, it requires fine-tuning on labeled text data for speech recognition tasks. This involves creating a tokenizer and aligning the model with the specific downstream task.

Guide: Running Locally

To run the viwav2vec2-base-3k model locally, follow these steps:

Install Dependencies: Ensure you have PyTorch and Transformers installed.
```
pip install torch transformers
```

Load the Model:

import torch
from transformers import Wav2Vec2Model

model = Wav2Vec2Model.from_pretrained("dragonSwing/viwav2vec2-base-3k")

Sanity Check:

inputs = torch.rand([1, 16000])
outputs = model(inputs)

Fine-Tuning: Refer to this notebook for detailed steps on fine-tuning the model.

Recommendation: For efficient training and fine-tuning, consider using cloud GPUs such as those offered by Google Cloud, AWS, or Azure.

License

The viwav2vec2-base-3k model is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0).

More Related APIs in Automatic Speech Recognition