brianyan918_iwslt22_dialect_train_st_conformer_ctc0.3_lr2e 3_warmup15k_newspecaug
espnetIntroduction
The brianyan918_iwslt22_dialect_train_st_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug
model is developed using ESPnet, an end-to-end speech processing toolkit. This model is tailored for speech translation tasks and utilizes a conformer architecture with a CTC loss component.
Architecture
The model employs a Conformer encoder and a Transformer decoder setup. The encoder consists of 12 blocks, each with 256 output size, 4 attention heads, and 1024 linear units. It uses techniques like layer normalization and macaron-style feed-forward modules. The decoder features 6 blocks, each with 2048 linear units and similar attention settings. The architecture also integrates data augmentation techniques such as SpecAugment.
Training
The model was trained using the iwslt22_dialect dataset with a learning rate of 0.002, warmup steps of 15,000, and a CTC weight of 0.3. It uses an Adam optimizer and a warmup learning rate scheduler for optimization. The training process is configured to run for a maximum of 80 epochs.
Guide: Running Locally
-
Clone ESPnet Repository:
git clone https://github.com/espnet/espnet.git cd espnet
-
Checkout Specific Commit:
git checkout 77fce65312877a132bbae01917ad26b74f6e2e14
-
Install Dependencies:
pip install -e .
-
Run the Model: Navigate to the iwslt22_dialect example and execute the provided script:
cd egs2/iwslt22_dialect/st1 ./run.sh --skip_data_prep false --skip_train true --download_model espnet/brianyan918_iwslt22_dialect_train_st_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug
Cloud GPUs: For efficient training and inference, consider using cloud GPUs from providers like AWS, GCP, or Azure.
License
The model is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). This allows for sharing and adaptation, provided appropriate credit is given to the original creators.