From 866a8ccabb74231fb3f5ea547f213b1d5cffaf8d Mon Sep 17 00:00:00 2001 From: Kevin Canwen Xu Date: Mon, 22 Jun 2020 21:48:14 +0800 Subject: [PATCH] Add model cards for Microsoft's MiniLM (#5178) * Add model cards for Microsoft's MiniLM * XLMRobertaTokenizer * format * Add thumbnail * finishing up --- .../MiniLM-L12-H384-uncased/README.md | 43 +++++++++ .../Multilingual-MiniLM-L12-H384/README.md | 89 +++++++++++++++++++ 2 files changed, 132 insertions(+) create mode 100644 model_cards/microsoft/MiniLM-L12-H384-uncased/README.md create mode 100644 model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md diff --git a/model_cards/microsoft/MiniLM-L12-H384-uncased/README.md b/model_cards/microsoft/MiniLM-L12-H384-uncased/README.md new file mode 100644 index 0000000000..f15f7d6de9 --- /dev/null +++ b/model_cards/microsoft/MiniLM-L12-H384-uncased/README.md @@ -0,0 +1,43 @@ +--- +thumbnail: https://huggingface.co/front/thumbnails/microsoft.png +tags: +- text-classification +license: mit +--- + +## MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation + +MiniLM is a distilled model from the paper "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)". + +Please find the information about preprocessing, training and full details of the MiniLM in the [original MiniLM repository](https://github.com/microsoft/unilm/blob/master/minilm/). + +Please note: This checkpoint can be an inplace substitution for BERT and it needs to be fine-tuned before use! + +### English Pre-trained Models +We release the **uncased** **12**-layer model with **384** hidden size distilled from an in-house pre-trained [UniLM v2](/unilm) model in BERT-Base size. + +- MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base + +#### Fine-tuning on NLU tasks + +We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks. + +| Model | #Param | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE | MRPC | QQP | +|---------------------------------------------------|--------|-----------|--------|-------|------|------|------|------|------| +| [BERT-Base](https://arxiv.org/pdf/1810.04805.pdf) | 109M | 76.8 | 84.5 | 93.2 | 91.7 | 58.9 | 68.6 | 87.3 | 91.3 | +| **MiniLM-L12xH384** | 33M | 81.7 | 85.7 | 93.0 | 91.5 | 58.5 | 73.3 | 89.5 | 91.3 | + +### Citation + +If you find MiniLM useful in your research, please cite the following paper: + +``` latex +@misc{wang2020minilm, + title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers}, + author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou}, + year={2020}, + eprint={2002.10957}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +``` diff --git a/model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md b/model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md new file mode 100644 index 0000000000..f1a520217c --- /dev/null +++ b/model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md @@ -0,0 +1,89 @@ +--- +thumbnail: https://huggingface.co/front/thumbnails/microsoft.png +tags: +- text-classification +license: mit +--- + +## MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation + +MiniLM is a distilled model from the paper "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)". + +Please find the information about preprocessing, training and full details of the MiniLM in the [original MiniLM repository](https://github.com/microsoft/unilm/blob/master/minilm/). + +Please note: This checkpoint uses `BertModel` with `XLMRobertaTokenizer` so `AutoTokenizer` won't work with this checkpoint! + +### Multilingual Pretrained Model +- Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters + +Multilingual MiniLM uses the same tokenizer as XLM-R. But the Transformer architecture of our model is the same as BERT. We provide the fine-tuning code on XNLI based on [huggingface/transformers](https://github.com/huggingface/transformers). Please replace `run_xnli.py` in transformers with [ours](https://github.com/microsoft/unilm/blob/master/minilm/examples/run_xnli.py) to fine-tune multilingual MiniLM. + +We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA). + +#### Cross-Lingual Natural Language Inference - [XNLI](https://arxiv.org/abs/1809.05053) + +We evaluate our model on cross-lingual transfer from English to other languages. Following [Conneau et al. (2019)](https://arxiv.org/abs/1911.02116), we select the best single model on the joint dev set of all the languages. + +| Model | #Layers | #Hidden | #Transformer Parameters | Average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | +|---------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------| +| [mBERT](https://github.com/google-research/bert) | 12 | 768 | 85M | 66.3 | 82.1 | 73.8 | 74.3 | 71.1 | 66.4 | 68.9 | 69.0 | 61.6 | 64.9 | 69.5 | 55.8 | 69.3 | 60.0 | 50.4 | 58.0 | +| [XLM-100](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 16 | 1280 | 315M | 70.7 | 83.2 | 76.7 | 77.7 | 74.0 | 72.7 | 74.1 | 72.7 | 68.7 | 68.6 | 72.9 | 68.9 | 72.5 | 65.6 | 58.2 | 62.4 | +| [XLM-R Base](https://arxiv.org/abs/1911.02116) | 12 | 768 | 85M | 74.5 | 84.6 | 78.4 | 78.9 | 76.8 | 75.9 | 77.3 | 75.4 | 73.2 | 71.5 | 75.4 | 72.5 | 74.9 | 71.1 | 65.2 | 66.5 | +| **mMiniLM-L12xH384** | 12 | 384 | 21M | 71.1 | 81.5 | 74.8 | 75.7 | 72.9 | 73.0 | 74.5 | 71.3 | 69.7 | 68.8 | 72.1 | 67.8 | 70.0 | 66.2 | 63.3 | 64.2 | + +This example code fine-tunes **12**-layer multilingual MiniLM on XNLI. + +```bash +# run fine-tuning on XNLI +DATA_DIR=/{path_of_data}/ +OUTPUT_DIR=/{path_of_fine-tuned_model}/ +MODEL_PATH=/{path_of_pre-trained_model}/ + +python ./examples/run_xnli.py --model_type minilm \ + --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \ + --model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \ + --tokenizer_name xlm-roberta-base \ + --config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 128 \ + --learning_rate 5e-5 \ + --num_train_epochs 5 \ + --per_gpu_eval_batch_size 32 \ + --weight_decay 0.001 \ + --warmup_steps 500 \ + --save_steps 1500 \ + --logging_steps 1500 \ + --eval_all_checkpoints \ + --language en \ + --fp16 \ + --fp16_opt_level O2 +``` + +#### Cross-Lingual Question Answering - [MLQA](https://arxiv.org/abs/1910.07475) + +Following [Lewis et al. (2019b)](https://arxiv.org/abs/1910.07475), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping. + +| Model F1 Score | #Layers | #Hidden | #Transformer Parameters | Average | en | es | de | ar | hi | vi | zh | +|--------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------| +| [mBERT](https://github.com/google-research/bert) | 12 | 768 | 85M | 57.7 | 77.7 | 64.3 | 57.9 | 45.7 | 43.8 | 57.1 | 57.5 | +| [XLM-15](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 12 | 1024 | 151M | 61.6 | 74.9 | 68.0 | 62.2 | 54.8 | 48.8 | 61.4 | 61.1 | +| [XLM-R Base](https://arxiv.org/abs/1911.02116) (Reported) | 12 | 768 | 85M | 62.9 | 77.8 | 67.2 | 60.8 | 53.0 | 57.9 | 63.1 | 60.2 | +| [XLM-R Base](https://arxiv.org/abs/1911.02116) (Our fine-tuned) | 12 | 768 | 85M | 64.9 | 80.3 | 67.0 | 62.7 | 55.0 | 60.4 | 66.5 | 62.3 | +| **mMiniLM-L12xH384** | 12 | 384 | 21M | 63.2 | 79.4 | 66.1 | 61.2 | 54.9 | 58.5 | 63.1 | 59.0 | + +### Citation + +If you find MiniLM useful in your research, please cite the following paper: + +``` latex +@misc{wang2020minilm, + title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers}, + author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou}, + year={2020}, + eprint={2002.10957}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +```