update readmes
This commit is contained in:
@@ -1,23 +1,25 @@
|
||||
# DilBERT
|
||||
|
||||
This section contains examples showcasing how to use DilBERT and the original code to train DilBERT.
|
||||
This folder contains the original code used to train DilBERT as well as examples showcasing how to use DilBERT.
|
||||
|
||||
## What is DilBERT?
|
||||
## What is DilBERT
|
||||
|
||||
DilBERT stands for DistiLlation-BERT. DilBERT is a small, fast, cheap and light Transformer model: it has 40% less parameters than `bert-base-uncased`, runs 40% faster while preserving 96% on the language understanding capabilties (as shown on the GLUE benchmark). DilBERT is trained by distillation: a technique to compress a large model called the teacher into a smaller model called the student. By applying this compression technique, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model, while being lighter, smaller and faster. Thus, DilBERT can be an interesting solution to put large Transformer model into production.
|
||||
DilBERT stands for Distillated-BERT. DilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
|
||||
|
||||
For more information on DilBERT, we refer to [our blog post](TODO(Link)).
|
||||
For more information on DilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
|
||||
).
|
||||
|
||||
## How to use DilBERT?
|
||||
## How to use DilBERT
|
||||
|
||||
PyTorch-Transformers includes two pre-trained models:
|
||||
- `dilbert-base-uncased`: The language model pretrained by distillation under the supervision of `bert-base-uncased`. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
|
||||
- `dilbert-base-uncased-distilled-squad`: The `dilbert-base-uncased` finetune by distillation on SQuAD. It reaches a F1 score of 86.2 on the dev set, while `bert-base-uncased` reaches a 88.5 F1 score.
|
||||
PyTorch-Transformers includes two pre-trained DilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DilBERT):
|
||||
|
||||
Using DilBERT is really similar to using BERT. DilBERT uses the same tokenizer as BERT and more specifically `bert-base-uncased`. You should only use this tookenizer as the only pre-trained weights available for now are supervised by `bert-base-uncased`.
|
||||
- `dilbert-base-uncased`: DilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
|
||||
- `dilbert-base-uncased-distilled-squad`: A finetuned version of `dilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
|
||||
|
||||
Using DilBERT is very similar to using BERT. DilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DilBertTokenizer` name to have a consistent naming between the library models.
|
||||
|
||||
```python
|
||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||
tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
|
||||
model = DilBertModel.from_pretrained('dilbert-base-uncased')
|
||||
|
||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
|
||||
@@ -25,17 +27,17 @@ outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
```
|
||||
|
||||
## How to train DilBERT?
|
||||
## How to train DilBERT
|
||||
|
||||
In the following, we will explain how you can train your own compressed model.
|
||||
|
||||
### A. Preparing the data
|
||||
|
||||
The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as BERT).
|
||||
The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT).
|
||||
|
||||
To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences).
|
||||
|
||||
First, we will binarize the data: we tokenize the data and associate each token to an id.
|
||||
First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary.
|
||||
|
||||
```bash
|
||||
python scripts/binarized_data.py \
|
||||
@@ -44,7 +46,7 @@ python scripts/binarized_data.py \
|
||||
--dump_file data/binarized_text
|
||||
```
|
||||
|
||||
In the masked language modeling loss, we follow [XLM](https://github.com/facebookresearch/XLM) and smooth the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
|
||||
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
|
||||
|
||||
```bash
|
||||
python scripts/token_counts.py \
|
||||
@@ -54,19 +56,20 @@ python scripts/token_counts.py \
|
||||
|
||||
### B. Training
|
||||
|
||||
Launching a distillation is really simple once you have setup the data:
|
||||
Training with distillation is really simple once you have pre-processed the data:
|
||||
|
||||
```bash
|
||||
python train.py \
|
||||
--dump_path serialization_dir/my_first_training \
|
||||
--data_file data/binarized_text.bert-base-uncased.pickle \
|
||||
--token_counts data/token_counts.bert-base-uncased.pickle \
|
||||
--force # It overwrites the `dump_path` if it already exists.
|
||||
```
|
||||
--force # overwrites the `dump_path` if it already exists.
|
||||
```
|
||||
|
||||
By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please refer to `train.py`.
|
||||
By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
|
||||
|
||||
We highly encourage you to distributed training for training DilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
|
||||
|
||||
We also highly encourage using distributed training. Here's an example that launchs a distributed traininng on a single node with 4 GPUs:
|
||||
```bash
|
||||
export NODE_RANK=0
|
||||
export N_NODES=1
|
||||
@@ -92,6 +95,6 @@ python -m torch.distributed.launch \
|
||||
--dump_path serialization_dir/with_transform/last_word
|
||||
```
|
||||
|
||||
**Tips** Start the distillation from some sort of structure initialization is crucial to reach a good final performance. In our experiments, we use initialization from some of the layers of the teacher itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and add `from_pretrained_weights` and `from_pretrained_config` when launching your distillation!
|
||||
**Tips** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
|
||||
|
||||
Happy distillation!
|
||||
|
||||
Reference in New Issue
Block a user