* configuration_squeezebert.py thin wrapper around bert tokenizer fix typos wip sb model code wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working set up squeezebert to use BertModelOutput when returning results. squeezebert documentation formatting allow head mask that is an array of [None, ..., None] docs docs cont'd path to vocab docs and pointers to cloud files (WIP) line length and indentation squeezebert model cards formatting of model cards untrack modeling_squeezebert_scratchpad.py update aws paths to vocab and config files get rid of stub of NSP code, and advise users to pretrain with mlm only fix rebase issues redo rebase of modeling_auto.py fix issues with code formatting more code format auto-fixes move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert tests for squeezebert modeling and tokenization fix typo move squeezebert before bert in modeling_auto.py to fix inheritance problem disable test_head_masking, since squeezebert doesn't yet implement head masking fix issues exposed by the test_modeling_squeezebert.py fix an issue exposed by test_tokenization_squeezebert.py fix issue exposed by test_modeling_squeezebert.py auto generated code style improvement issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head() update copyright resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask docs add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli autogenerated formatting tweaks integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings * tiny change to order of imports
language: en license: bsd datasets:
- bookcorpus
- wikipedia
SqueezeBERT pretrained model
This model, squeezebert-mnli, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the Multi-Genre Natural Language Inference (MNLI) dataset.
SqueezeBERT was introduced in this paper. This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with grouped convolutions.
The authors found that SqueezeBERT is 4.3x faster than bert-base-uncased on a Google Pixel 3 smartphone.
Pretraining
Pretraining data
- BookCorpus, a dataset consisting of thousands of unpublished books
- English Wikipedia
Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks. (Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
From the SqueezeBERT paper:
We pretrain SqueezeBERT from scratch (without distillation) using the LAMB optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
Finetuning
The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the SqueezeBERT paper. Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, squeezebert/squeezebert-mnli, is the "trained with bells and whistles" MNLI-finetuned SqueezeBERT model.
How to finetune
To try finetuning SqueezeBERT on the MRPC text classification task, you can run the following command:
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
BibTeX entry and citation info
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}