Files

Forrest Iandola 02ef825be2 SqueezeBERT architecture (#7083 )

* configuration_squeezebert.py

thin wrapper around bert tokenizer

fix typos

wip sb model code

wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working

set up squeezebert to use BertModelOutput when returning results.

squeezebert documentation

formatting

allow head mask that is an array of [None, ..., None]

docs

docs cont'd

path to vocab

docs and pointers to cloud files (WIP)

line length and indentation

squeezebert model cards

formatting of model cards

untrack modeling_squeezebert_scratchpad.py

update aws paths to vocab and config files

get rid of stub of NSP code, and advise users to pretrain with mlm only

fix rebase issues

redo rebase of modeling_auto.py

fix issues with code formatting

more code format auto-fixes

move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert

tests for squeezebert modeling and tokenization

fix typo

move squeezebert before bert in modeling_auto.py to fix inheritance problem

disable test_head_masking, since squeezebert doesn't yet implement head masking

fix issues exposed by the test_modeling_squeezebert.py

fix an issue exposed by test_tokenization_squeezebert.py

fix issue exposed by test_modeling_squeezebert.py

auto generated code style improvement

issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head()

update copyright

resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask

docs

add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli

autogenerated formatting tweaks

integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings

* tiny change to order of imports

2020-10-05 04:25:43 -04:00

README.md

SqueezeBERT architecture (#7083 )

2020-10-05 04:25:43 -04:00

README.md

language: en license: bsd datasets:

bookcorpus
wikipedia

SqueezeBERT pretrained model

This model, squeezebert-mnli, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the Multi-Genre Natural Language Inference (MNLI) dataset. SqueezeBERT was introduced in this paper. This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with grouped convolutions. The authors found that SqueezeBERT is 4.3x faster than bert-base-uncased on a Google Pixel 3 smartphone.

Pretraining

Pretraining data

BookCorpus, a dataset consisting of thousands of unpublished books
English Wikipedia

Pretraining procedure

The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks. (Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)

From the SqueezeBERT paper:

We pretrain SqueezeBERT from scratch (without distillation) using the LAMB optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.

Finetuning

The SqueezeBERT paper presents 2 approaches to finetuning the model:

"finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
"finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.

A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the SqueezeBERT paper. Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.

This model, squeezebert/squeezebert-mnli, is the "trained with bells and whistles" MNLI-finetuned SqueezeBERT model.

How to finetune

To try finetuning SqueezeBERT on the MRPC text classification task, you can run the following command:

./utils/download_glue_data.py

python examples/text-classification/run_glue.py \
    --model_name_or_path squeezebert-base-headless \
    --task_name mrpc \
    --data_dir ./glue_data/MRPC \
    --output_dir ./models/squeezebert_mrpc \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --num_train_epochs 10 \
    --learning_rate 3e-05 \
    --per_device_train_batch_size 16 \
    --save_steps 20000

BibTeX entry and citation info

@article{2020_SqueezeBERT,
     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
     journal = {arXiv:2006.11316},
     year = {2020}
}