ssl_en_nest_xlarge_v1.0
nvidiaIntroduction
The NVIDIA NEST XLARGE EN model is designed for speech self-supervised learning, suitable for use as a frozen speech feature extractor or for weight initialization in downstream speech processing tasks. This model contains approximately 600 million parameters and has been trained on a dataset of roughly 100,000 hours of English audio.
Architecture
The NEST framework employs a FastConformer encoder with 24 layers and a linear classifier decoder. It uses random block masking for masking and incorporates speaker/noise augmentation. The model is optimized with cross-entropy loss applied to masked positions. The input to the model is 16,000 Hz mono-channel audio in WAV format, and the output is a sequence of audio features.
Training
NVIDIA's NeMo Framework is utilized to train the model, using datasets such as LibriLight and Voxpopuli. The model's training involves a hybrid method for data collection and labeling, which includes both automated and human processes.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install NVIDIA NeMo Framework:
- Clone the NeMo repository from GitHub.
- Follow the installation guidelines to set up the environment.
-
Instantiate the Model:
from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_xlarge_v1.0")
-
Use as Weight Initialization:
- Utilize the provided script to initialize weights for downstream tasks like ASR.
-
Use as Feature Extractor:
- Extract audio features using the script for speaker verification or similar tasks.
-
Hardware Recommendations:
- For optimal performance, use cloud GPUs such as NVIDIA A6000 or A100.
License
The model is licensed under the CC-BY-4.0, allowing for both commercial and non-commercial use under the terms specified by the license. More information can be found at CC-BY-4.0 License.