dilbert -> distilbert

2019-08-28 13:59:42 +02:00
parent c9bce1811c
commit 912a377e90
15 changed files with 144 additions and 144 deletions
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -1,33 +1,33 @@
-# DilBERT
+# DistilBERT

-This folder contains the original code used to train DilBERT as well as examples showcasing how to use DilBERT.
+This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.

-## What is DilBERT
+## What is DistilBERT

-DilBERT stands for Distillated-BERT. DilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.

-For more information on DilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
+For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
 ).

-## How to use DilBERT
+## How to use DistilBERT

-PyTorch-Transformers includes two pre-trained DilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DilBERT):
+PyTorch-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):

- `dilbert-base-uncased`: DilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
- `dilbert-base-uncased-distilled-squad`: A finetuned version of `dilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
+- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).

-Using DilBERT is very similar to using BERT. DilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DilBertTokenizer` name to have a consistent naming between the library models.
+Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.

 ```python
-tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
-model = DilBertModel.from_pretrained('dilbert-base-uncased')
+tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+model = DistilBertModel.from_pretrained('distilbert-base-uncased')

 input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
 outputs = model(input_ids)
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```

-## How to train DilBERT
+## How to train DistilBERT

 In the following, we will explain how you can train your own compressed model.

@@ -68,7 +68,7 @@ python train.py \

 By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.

-We highly encourage you to distributed training for training DilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
+We highly encourage you to distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:

 ```bash
 export NODE_RANK=0
--- a/examples/distillation/dataset.py
+++ b/examples/distillation/dataset.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Dataloaders to train DilBERT.
+Dataloaders to train DistilBERT.
 """
 from typing import List
 import math
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-The distiller to distil DilBERT.
+The distiller to distil DistilBERT.
 """
 import os
 import math
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DilBERT.
+Preprocessing script before training DistilBERT.
 """
 import argparse
 import pickle
--- a/examples/distillation/scripts/extract_for_distil.py
+++ b/examples/distillation/scripts/extract_for_distil.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DilBERT.
+Preprocessing script before training DistilBERT.
 """
 from pytorch_transformers import BertForPreTraining
 import torch
@@ -33,32 +33,32 @@ if __name__ == '__main__':
    compressed_sd = {}

    for w in ['word_embeddings', 'position_embeddings']:
-        compressed_sd[f'dilbert.embeddings.{w}.weight'] = \
+        compressed_sd[f'distilbert.embeddings.{w}.weight'] = \
            state_dict[f'bert.embeddings.{w}.weight']
    for w in ['weight', 'bias']:
-        compressed_sd[f'dilbert.embeddings.LayerNorm.{w}'] = \
+        compressed_sd[f'distilbert.embeddings.LayerNorm.{w}'] = \
            state_dict[f'bert.embeddings.LayerNorm.{w}']

    std_idx = 0
    for teacher_idx in [0, 2, 4, 7, 9, 11]:
        for w in ['weight', 'bias']:
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.query.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.key.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.value.{w}']

-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.dense.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}']

-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.intermediate.dense.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.output.dense.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
                state_dict[f'bert.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
        std_idx += 1

--- a/examples/distillation/scripts/token_counts.py
+++ b/examples/distillation/scripts/token_counts.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DilBERT.
+Preprocessing script before training DistilBERT.
 """
 from collections import Counter
 import argparse
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Training DilBERT.
+Training DistilBERT.
 """
 import os
 import argparse
@@ -24,7 +24,7 @@ import numpy as np
 import torch

 from pytorch_transformers import BertTokenizer, BertForMaskedLM
-from pytorch_transformers import DilBertForMaskedLM, DilBertConfig
+from pytorch_transformers import DistilBertForMaskedLM, DistilBertConfig

 from distiller import Distiller
 from utils import git_log, logger, init_gpu_params, set_seed
@@ -201,13 +201,13 @@ def main():
        assert os.path.isfile(os.path.join(args.from_pretrained_config))
        logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
        logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
-        stu_architecture_config = DilBertConfig.from_json_file(args.from_pretrained_config)
-        student = DilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
+        stu_architecture_config = DistilBertConfig.from_json_file(args.from_pretrained_config)
+        student = DistilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
                                                     config=stu_architecture_config)
    else:
        args.vocab_size_or_config_json_file = args.vocab_size
-        stu_architecture_config = DilBertConfig(**vars(args))
-        student = DilBertForMaskedLM(stu_architecture_config)
+        stu_architecture_config = DistilBertConfig(**vars(args))
+        student = DistilBertForMaskedLM(stu_architecture_config)


    if args.n_gpu > 0:
--- a/examples/distillation/utils.py
+++ b/examples/distillation/utils.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Utils to train DilBERT.
+Utils to train DistilBERT.
 """
 import git
 import json