Model templates encoder only (#8509)

* Model templates * TensorFlow * Remove pooler * CI * Tokenizer + Refactoring * Encoder-Decoder * Let's go testing * Encoder-Decoder in TF * Let's go testing in TF * Documentation * README * Fixes * Better names * Style * Update docs * Choose to skip either TF or PT * Code quality fixes * Add to testing suite * Update file path * Cookiecutter path * Update `transformers` path * Handle rebasing * Remove seq2seq from model templates * Remove s2s config * Apply Sylvain and Patrick comments * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Last fixes from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-13 11:59:30 -05:00
parent 42e2d02e44
commit 826f04576f
29 changed files with 3315 additions and 1977 deletions
--- a/templates/adding_a_new_model/README.md
+++ b/templates/adding_a_new_model/README.md
@@ -1,89 +1,104 @@
-# How to add a new model in 🤗 Transformers
+# Using `cookiecutter` to generate models

-This folder describes the process to add a new model in 🤗 Transformers and provide templates for the required files.
+This folder contains templates to generate new models that fit the current API and pass all tests. It generates
+models in both PyTorch and TensorFlow, completes the `__init__.py` and auto-modeling files, and creates the
+documentation.

-The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model
-usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in
-the present repository.
+## Usage

-One important point though is that the library has the following goals impacting the way models are incorporated:
+Using the `cookiecutter` utility requires to have all the `dev` dependencies installed. Let's first clone the 
+repository and install it in our environment:

- One specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus
-  often have to be slightly adapted to allow for running in the python interpreter.
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In
-  consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the
-  inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include
-  `sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a
-  new one.
+```shell script
+git clone https://github.com/huggingface/transformers
+cd transformers
+pip install -e ".[dev]"
+```

-For a quick overview of the general philosphy of the library and its organization, please check the
-[QuickStart section of the documentation](https://huggingface.co/transformers/philosophy.html).
+Once the installation is done, you can use the CLI command `add-new-model` to generate your models:

-# Typical workflow for including a model
+```shell script
+transformers-cli add-new-model
+```

-Here an overview of the general workflow:
+This should launch the `cookiecutter` package which should prompt you to fill in the configuration.

- [ ] Add model/configuration/tokenization classes.
- [ ] Add conversion scripts.
- [ ] Add tests and a @slow integration test.
- [ ] Document your model.
- [ ] Finalize.
+The `modelname` should be cased according to the plain text casing, i.e., BERT, RoBERTa, DeBERTa.
+```
+modelname [<ModelNAME>]:
+uppercase_modelname [<MODEL_NAME>]: 
+lowercase_modelname [<model_name>]: 
+camelcase_modelname [<ModelName>]: 
+```

-Let's detail what should be done at each step.
+Fill in the `authors` with your team members:
+```
+authors [The HuggingFace Team]: 
+```

-## Adding model/configuration/tokenization classes
+The checkpoint identifier is the checkpoint that will be used in the examples across the files. Put the name you wish,
+as it will appear on the modelhub. Do not forget to include the organisation.
+```
+checkpoint_identifier [organisation/<model_name>-base-cased]: 
+```

-Here is the workflow for adding model/configuration/tokenization classes:
+The tokenizer should either be based on BERT if it behaves exactly like the BERT tokenizer, or a standalone otherwise.
+```
+Select tokenizer_type:
+1 - Based on BERT
+2 - Standalone
+Choose from 1, 2 [1]: 
+```
+<!---
+Choose if your model is an encoder-decoder, or an encoder-only architecture.

- [ ] Copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model
-  name.
- [ ] Edit the files to replace `XXX` (with various casing) with your model name.
- [ ] Copy-paste or create a simple configuration class for your model in the `configuration_...` file.
- [ ] Copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0).
- [ ] Copy-paste or create a tokenizer class for your model in the `tokenization_...` file.
+If your model is an encoder-only architecture, the generated architecture will be based on the BERT model. 
+If your model is an encoder-decoder architecture, the generated architecture will be based on the BART model. You can,
+of course, edit the files once the generation is complete.
+```
+Select is_encoder_decoder_model:
+1 - True
+2 - False
+Choose from 1, 2 [1]: 
+```
+-->

-## Adding conversion scripts
+Once the command has finished, you should have a total of 7 new files spread across the repository:
+```
+docs/source/model_doc/<model_name>.rst
+src/transformers/configuration_<model_name>.py
+src/transformers/modeling_<model_name>.py
+src/transformers/modeling_tf_<model_name>.py
+src/transformers/tokenization_<model_name>.py
+tests/test_modeling_<model_name>.py
+tests/test_modeling_tf_<model_name>.py
+```

-Here is the workflow for the conversion scripts:
+You can run the tests to ensure that they all pass:

- [ ] Copy the conversion script (`convert_...`) from the present folder to the main folder.
- [ ] Edit this script to convert your original checkpoint weights to the current pytorch ones.
+```
+python -m pytest ./tests/test_*<model_name>*.py
+```

-## Adding tests:
+Feel free to modify each file to mimic the behavior of your model. 

-Here is the workflow for the adding tests:
+⚠ You should be careful about the classes preceded by the following line:️ 

- [ ] Copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main
-  folder and rename them, replacing `xxx` with your model name.
- [ ] Edit the tests files to replace `XXX` (with various casing) with your model name.
- [ ] Edit the tests code as needed.
+```python
+# Copied from transformers.[...]
+```

-## Documenting your model:
+This line ensures that the copy does not diverge from the source. If it *should* diverge, because the implementation
+is different, this line needs to be deleted. If you don't delete this line and run `make fix-copies`,
+your changes will be overwritten.

-Here is the workflow for documentation:
+Once you have edited the files to fit your architecture, simply re-run the tests (and edit them if a change 
+is needed!) afterwards to make sure everything works as expected. 

- [ ] Make sure all your arguments are properly documented in your configuration and tokenizer.
- [ ] Most of the documentation of the models is automatically generated, you just have to make sure that
-  `XXX_START_DOCSTRING` contains an introduction to the model you're adding and a link to the original
-  article and that `XXX_INPUTS_DOCSTRING` contains all the inputs of your model.
- [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.
+Once the files are generated and you are happy with your changes, here's a checklist to ensure that your contribution
+will be merged quickly:

-Make sure to check you have no sphinx warnings when building the documentation locally and follow our
-[documentation guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
-
-## Final steps
-
-You can then finish the addition step by adding imports for your classes in the common files:
-
- [ ] Add import for all the relevant classes in `__init__.py`.
- [ ] Add your configuration in `configuration_auto.py`.
- [ ] Add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py`.
- [ ] Add your tokenizer in `tokenization_auto.py`.
- [ ] Add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
- [ ] Edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py`
-  file.
- [ ] Add a mention of your model in the doc: `README.md` and the documentation itself
-  in `docs/source/pretrained_models.rst`. Rune `make fix-copies` to update `docs/source/index.rst` with your changes.
- [ ] Upload the pretrained weights, configurations and vocabulary files.
- [ ] Create model card(s) for your models on huggingface.co. For those last two steps, check the
-  [model sharing documentation](https://huggingface.co/transformers/model_sharing.html).
+- You should run the `make fixup` utility to fix the style of the files and to ensure the code quality meets the
+  library's standards.
+- You should complete the documentation file (`docs/source/model_doc/<model_name>.rst`) so that your model may be
+  usable.