Model templates encoder only (#8509)

* Model templates

* TensorFlow

* Remove pooler

* CI

* Tokenizer + Refactoring

* Encoder-Decoder

* Let's go testing

* Encoder-Decoder in TF

* Let's go testing in TF

* Documentation

* README

* Fixes

* Better names

* Style

* Update docs

* Choose to skip either TF or PT

* Code quality fixes

* Add to testing suite

* Update file path

* Cookiecutter path

* Update `transformers` path

* Handle rebasing

* Remove seq2seq from model templates

* Remove s2s config

* Apply Sylvain and Patrick comments

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Last fixes from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
Lysandre Debut
2020-11-13 11:59:30 -05:00
committed by GitHub
parent 42e2d02e44
commit 826f04576f
29 changed files with 3315 additions and 1977 deletions

View File

@@ -1,89 +1,104 @@
# How to add a new model in 🤗 Transformers
# Using `cookiecutter` to generate models
This folder describes the process to add a new model in 🤗 Transformers and provide templates for the required files.
This folder contains templates to generate new models that fit the current API and pass all tests. It generates
models in both PyTorch and TensorFlow, completes the `__init__.py` and auto-modeling files, and creates the
documentation.
The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model
usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in
the present repository.
## Usage
One important point though is that the library has the following goals impacting the way models are incorporated:
Using the `cookiecutter` utility requires to have all the `dev` dependencies installed. Let's first clone the
repository and install it in our environment:
- One specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus
often have to be slightly adapted to allow for running in the python interpreter.
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In
consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the
inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include
`sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a
new one.
```shell script
git clone https://github.com/huggingface/transformers
cd transformers
pip install -e ".[dev]"
```
For a quick overview of the general philosphy of the library and its organization, please check the
[QuickStart section of the documentation](https://huggingface.co/transformers/philosophy.html).
Once the installation is done, you can use the CLI command `add-new-model` to generate your models:
# Typical workflow for including a model
```shell script
transformers-cli add-new-model
```
Here an overview of the general workflow:
This should launch the `cookiecutter` package which should prompt you to fill in the configuration.
- [ ] Add model/configuration/tokenization classes.
- [ ] Add conversion scripts.
- [ ] Add tests and a @slow integration test.
- [ ] Document your model.
- [ ] Finalize.
The `modelname` should be cased according to the plain text casing, i.e., BERT, RoBERTa, DeBERTa.
```
modelname [<ModelNAME>]:
uppercase_modelname [<MODEL_NAME>]:
lowercase_modelname [<model_name>]:
camelcase_modelname [<ModelName>]:
```
Let's detail what should be done at each step.
Fill in the `authors` with your team members:
```
authors [The HuggingFace Team]:
```
## Adding model/configuration/tokenization classes
The checkpoint identifier is the checkpoint that will be used in the examples across the files. Put the name you wish,
as it will appear on the modelhub. Do not forget to include the organisation.
```
checkpoint_identifier [organisation/<model_name>-base-cased]:
```
Here is the workflow for adding model/configuration/tokenization classes:
The tokenizer should either be based on BERT if it behaves exactly like the BERT tokenizer, or a standalone otherwise.
```
Select tokenizer_type:
1 - Based on BERT
2 - Standalone
Choose from 1, 2 [1]:
```
<!---
Choose if your model is an encoder-decoder, or an encoder-only architecture.
- [ ] Copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model
name.
- [ ] Edit the files to replace `XXX` (with various casing) with your model name.
- [ ] Copy-paste or create a simple configuration class for your model in the `configuration_...` file.
- [ ] Copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0).
- [ ] Copy-paste or create a tokenizer class for your model in the `tokenization_...` file.
If your model is an encoder-only architecture, the generated architecture will be based on the BERT model.
If your model is an encoder-decoder architecture, the generated architecture will be based on the BART model. You can,
of course, edit the files once the generation is complete.
```
Select is_encoder_decoder_model:
1 - True
2 - False
Choose from 1, 2 [1]:
```
-->
## Adding conversion scripts
Once the command has finished, you should have a total of 7 new files spread across the repository:
```
docs/source/model_doc/<model_name>.rst
src/transformers/configuration_<model_name>.py
src/transformers/modeling_<model_name>.py
src/transformers/modeling_tf_<model_name>.py
src/transformers/tokenization_<model_name>.py
tests/test_modeling_<model_name>.py
tests/test_modeling_tf_<model_name>.py
```
Here is the workflow for the conversion scripts:
You can run the tests to ensure that they all pass:
- [ ] Copy the conversion script (`convert_...`) from the present folder to the main folder.
- [ ] Edit this script to convert your original checkpoint weights to the current pytorch ones.
```
python -m pytest ./tests/test_*<model_name>*.py
```
## Adding tests:
Feel free to modify each file to mimic the behavior of your model.
Here is the workflow for the adding tests:
⚠ You should be careful about the classes preceded by the following line:
- [ ] Copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main
folder and rename them, replacing `xxx` with your model name.
- [ ] Edit the tests files to replace `XXX` (with various casing) with your model name.
- [ ] Edit the tests code as needed.
```python
# Copied from transformers.[...]
```
## Documenting your model:
This line ensures that the copy does not diverge from the source. If it *should* diverge, because the implementation
is different, this line needs to be deleted. If you don't delete this line and run `make fix-copies`,
your changes will be overwritten.
Here is the workflow for documentation:
Once you have edited the files to fit your architecture, simply re-run the tests (and edit them if a change
is needed!) afterwards to make sure everything works as expected.
- [ ] Make sure all your arguments are properly documented in your configuration and tokenizer.
- [ ] Most of the documentation of the models is automatically generated, you just have to make sure that
`XXX_START_DOCSTRING` contains an introduction to the model you're adding and a link to the original
article and that `XXX_INPUTS_DOCSTRING` contains all the inputs of your model.
- [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.
Once the files are generated and you are happy with your changes, here's a checklist to ensure that your contribution
will be merged quickly:
Make sure to check you have no sphinx warnings when building the documentation locally and follow our
[documentation guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
## Final steps
You can then finish the addition step by adding imports for your classes in the common files:
- [ ] Add import for all the relevant classes in `__init__.py`.
- [ ] Add your configuration in `configuration_auto.py`.
- [ ] Add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py`.
- [ ] Add your tokenizer in `tokenization_auto.py`.
- [ ] Add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
- [ ] Edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py`
file.
- [ ] Add a mention of your model in the doc: `README.md` and the documentation itself
in `docs/source/pretrained_models.rst`. Rune `make fix-copies` to update `docs/source/index.rst` with your changes.
- [ ] Upload the pretrained weights, configurations and vocabulary files.
- [ ] Create model card(s) for your models on huggingface.co. For those last two steps, check the
[model sharing documentation](https://huggingface.co/transformers/model_sharing.html).
- You should run the `make fixup` utility to fix the style of the files and to ensure the code quality meets the
library's standards.
- You should complete the documentation file (`docs/source/model_doc/<model_name>.rst`) so that your model may be
usable.