squeezebert mnli
squeezebertIntroduction
SqueezeBERT is a pretrained model designed for natural language processing, specifically for Multi-Genre Natural Language Inference (MNLI). It utilizes a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective. The model is based on the BERT architecture but employs grouped convolutions for enhanced efficiency. SqueezeBERT is noted for being significantly faster than BERT-base-uncased on mobile devices.
Architecture
SqueezeBERT retains the core architecture of BERT-base while replacing pointwise fully-connected layers with grouped convolutions. This architectural adjustment results in improved performance, particularly on devices with limited computational resources.
Training
SqueezeBERT's pretraining involves two primary datasets: BookCorpus and English Wikipedia. The model is pretrained using MLM and SOP tasks without distillation. Key hyperparameters include a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. The pretraining consists of 56k steps with a sequence length of 128 and 6k steps with a sequence length of 512. Finetuning includes two approaches: a straightforward method and another involving distillation from a teacher model. However, the latter is not implemented in the repository.
Guide: Running Locally
To finetune SqueezeBERT for tasks like MRPC text classification, follow these steps:
- Download the required GLUE data using the script:
./utils/download_glue_data.py
- Run the finetuning script:
python examples/text-classification/run_glue.py \ --model_name_or_path squeezebert-base-headless \ --task_name mrpc \ --data_dir ./glue_data/MRPC \ --output_dir ./models/squeezebert_mrpc \ --overwrite_output_dir \ --do_train \ --do_eval \ --num_train_epochs 10 \ --learning_rate 3e-05 \ --per_device_train_batch_size 16 \ --save_steps 20000
For optimal performance, consider using cloud GPUs like AWS EC2 P3 instances or Google Cloud's NVIDIA Tesla T4 GPUs.
License
SqueezeBERT is released under the BSD license, which permits redistribution and use in source and binary forms, with or without modification, under certain conditions.