farsi_commonvoice_blstm
espnetIntroduction
The farsi_commonvoice_blstm
is an Automatic Speech Recognition (ASR) model developed using the ESPnet framework. It is specifically designed for Persian language processing and was trained using the CommonVoice dataset.
Architecture
The model employs a bidirectional Long Short-Term Memory (BLSTM) architecture as part of the ESPnet toolkit. The encoder uses a VGG-style RNN configuration with 4 LSTM layers, each having 1024 units. The decoder comprises 2 LSTM layers with similar configurations.
Training
The model was trained using the CommonVoice dataset and follows the ESPnet framework's ASR recipe. It includes features like time and frequency masking (SpecAugment), global mean-variance normalization, and a BPE tokenization strategy. The training process also utilizes Adadelta optimization and various hyperparameter configurations to enhance performance.
Guide: Running Locally
-
Clone the ESPnet repository:
git clone https://github.com/espnet/espnet.git cd espnet
-
Checkout the specific commit for compatibility:
git checkout 716eb8f92e19708acfd08ba3bd39d40890d3a84b
-
Install dependencies:
pip install -e .
-
Navigate to the example directory and run the script:
cd egs2/commonvoice/asr1 ./run.sh --skip_data_prep false --skip_train true --download_model espnet/farsi_commonvoice_blstm
-
Suggest using Cloud GPUs for performance improvement, such as Google Cloud or AWS, to handle intensive computations.
License
This model is licensed under the Creative Commons BY 4.0 license, allowing for sharing and adaptation with appropriate credit.