brianyan918_iwslt22_dialect_train_asr_conformer_ctc0.3_lr2e 3_warmup15k_newspecaug
espnetIntroduction
The ESPnet2 ASR Model is an automatic speech recognition (ASR) system developed by Brian Yan using the iwslt22_dialect
recipe in ESPnet. It is designed for speech processing tasks, leveraging a Conformer architecture with CTC (Connectionist Temporal Classification) for improved performance.
Architecture
The model utilizes a Conformer architecture, which combines convolutional neural networks (CNNs) and transformers to capture both local and global dependencies in audio data. Key components include:
- Encoder: 12 blocks with 256 output size, using 4 attention heads, convolutional layers, and self-attention mechanisms.
- Decoder: 6 blocks with 2048 linear units and similar attention mechanisms.
- CTC Weight: 0.3, used to balance the CTC loss with the main sequence-to-sequence loss.
Training
Training was conducted using the iwslt22_dialect
dataset with a focus on optimizing parameters such as learning rate (0.002
) and utilizing a warm-up learning rate scheduler with 15000
warm-up steps. Data augmentation techniques like SpecAugment were applied to improve generalization.
Guide: Running Locally
To run the model locally, follow these steps:
-
Environment Setup:
- Clone the ESPnet repository:
git clone https://github.com/espnet/espnet.git cd espnet
- Check out the specific commit:
git checkout 77fce65312877a132bbae01917ad26b74f6e2e14
- Install dependencies:
pip install -e .
- Clone the ESPnet repository:
-
Run the ASR Recipe:
- Navigate to the example directory:
cd egs2/iwslt22_dialect/asr1
- Execute the run script:
./run.sh --skip_data_prep false --skip_train true --download_model espnet/brianyan918_iwslt22_dialect_train_asr_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug
- Navigate to the example directory:
-
Optional: Use cloud GPUs from providers like AWS, GCP, or Azure for faster processing.
License
The model is released under the CC-BY-4.0 license, allowing for sharing and adaptation with appropriate credit.