squeezebert uncased
squeezebertIntroduction
SqueezeBERT is a pretrained model designed for English using masked language modeling (MLM) and Sentence Order Prediction (SOP). It introduces efficiencies by replacing pointwise fully-connected layers with grouped convolutions, achieving significantly faster performance than traditional models like BERT-base. This model is case-insensitive and is primarily suited for NLP tasks.
Architecture
The architecture of SqueezeBERT is similar to BERT-base, with a key modification: it replaces pointwise fully-connected layers with grouped convolutions. This change allows for faster processing speeds, especially noted on devices like the Google Pixel 3 smartphone.
Training
Pretraining Data
SqueezeBERT was pretrained using two primary datasets:
- BookCorpus: A collection of thousands of unpublished books.
- English Wikipedia: The comprehensive online encyclopedia.
Pretraining Procedure
Pretraining was conducted using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks. The model employs the LAMB optimizer with the following hyperparameters:
- Global batch size: 8192
- Learning rate: 2.5e-3
- Warmup proportion: 0.28
- Pretraining spans 56,000 steps for a sequence length of 128, and 6,000 steps for a sequence length of 512.
Finetuning
SqueezeBERT offers two finetuning approaches:
- Without bells and whistles: Finetuning directly on each GLUE task.
- With bells and whistles: Involves distillation from a teacher model, starting with MNLI and extending to other tasks.
Although the finetuning implementation with distillation is not yet available in the repository, community interest could prompt its addition.
Guide: Running Locally
Basic Steps
To finetune SqueezeBERT on the MRPC task, use the following command sequence:
- Download the GLUE data:
./utils/download_glue_data.py
- Run the finetuning script:
python examples/text-classification/run_glue.py \ --model_name_or_path squeezebert-base-headless \ --task_name mrpc \ --data_dir ./glue_data/MRPC \ --output_dir ./models/squeezebert_mrpc \ --overwrite_output_dir \ --do_train \ --do_eval \ --num_train_epochs 10 \ --learning_rate 3e-05 \ --per_device_train_batch_size 16 \ --save_steps 20000
Cloud GPUs
For optimal performance, especially during training and finetuning, consider using cloud GPU services such as AWS EC2, Google Cloud's Compute Engine, or Azure's Virtual Machines.
License
SqueezeBERT is licensed under the BSD license, which allows for flexibility in usage and distribution.