Model templates encoder only (#8509)
* Model templates * TensorFlow * Remove pooler * CI * Tokenizer + Refactoring * Encoder-Decoder * Let's go testing * Encoder-Decoder in TF * Let's go testing in TF * Documentation * README * Fixes * Better names * Style * Update docs * Choose to skip either TF or PT * Code quality fixes * Add to testing suite * Update file path * Cookiecutter path * Update `transformers` path * Handle rebasing * Remove seq2seq from model templates * Remove s2s config * Apply Sylvain and Patrick comments * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Last fixes from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -1,89 +1,104 @@
|
||||
# How to add a new model in 🤗 Transformers
|
||||
# Using `cookiecutter` to generate models
|
||||
|
||||
This folder describes the process to add a new model in 🤗 Transformers and provide templates for the required files.
|
||||
This folder contains templates to generate new models that fit the current API and pass all tests. It generates
|
||||
models in both PyTorch and TensorFlow, completes the `__init__.py` and auto-modeling files, and creates the
|
||||
documentation.
|
||||
|
||||
The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model
|
||||
usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in
|
||||
the present repository.
|
||||
## Usage
|
||||
|
||||
One important point though is that the library has the following goals impacting the way models are incorporated:
|
||||
Using the `cookiecutter` utility requires to have all the `dev` dependencies installed. Let's first clone the
|
||||
repository and install it in our environment:
|
||||
|
||||
- One specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus
|
||||
often have to be slightly adapted to allow for running in the python interpreter.
|
||||
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In
|
||||
consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the
|
||||
inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include
|
||||
`sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a
|
||||
new one.
|
||||
```shell script
|
||||
git clone https://github.com/huggingface/transformers
|
||||
cd transformers
|
||||
pip install -e ".[dev]"
|
||||
```
|
||||
|
||||
For a quick overview of the general philosphy of the library and its organization, please check the
|
||||
[QuickStart section of the documentation](https://huggingface.co/transformers/philosophy.html).
|
||||
Once the installation is done, you can use the CLI command `add-new-model` to generate your models:
|
||||
|
||||
# Typical workflow for including a model
|
||||
```shell script
|
||||
transformers-cli add-new-model
|
||||
```
|
||||
|
||||
Here an overview of the general workflow:
|
||||
This should launch the `cookiecutter` package which should prompt you to fill in the configuration.
|
||||
|
||||
- [ ] Add model/configuration/tokenization classes.
|
||||
- [ ] Add conversion scripts.
|
||||
- [ ] Add tests and a @slow integration test.
|
||||
- [ ] Document your model.
|
||||
- [ ] Finalize.
|
||||
The `modelname` should be cased according to the plain text casing, i.e., BERT, RoBERTa, DeBERTa.
|
||||
```
|
||||
modelname [<ModelNAME>]:
|
||||
uppercase_modelname [<MODEL_NAME>]:
|
||||
lowercase_modelname [<model_name>]:
|
||||
camelcase_modelname [<ModelName>]:
|
||||
```
|
||||
|
||||
Let's detail what should be done at each step.
|
||||
Fill in the `authors` with your team members:
|
||||
```
|
||||
authors [The HuggingFace Team]:
|
||||
```
|
||||
|
||||
## Adding model/configuration/tokenization classes
|
||||
The checkpoint identifier is the checkpoint that will be used in the examples across the files. Put the name you wish,
|
||||
as it will appear on the modelhub. Do not forget to include the organisation.
|
||||
```
|
||||
checkpoint_identifier [organisation/<model_name>-base-cased]:
|
||||
```
|
||||
|
||||
Here is the workflow for adding model/configuration/tokenization classes:
|
||||
The tokenizer should either be based on BERT if it behaves exactly like the BERT tokenizer, or a standalone otherwise.
|
||||
```
|
||||
Select tokenizer_type:
|
||||
1 - Based on BERT
|
||||
2 - Standalone
|
||||
Choose from 1, 2 [1]:
|
||||
```
|
||||
<!---
|
||||
Choose if your model is an encoder-decoder, or an encoder-only architecture.
|
||||
|
||||
- [ ] Copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model
|
||||
name.
|
||||
- [ ] Edit the files to replace `XXX` (with various casing) with your model name.
|
||||
- [ ] Copy-paste or create a simple configuration class for your model in the `configuration_...` file.
|
||||
- [ ] Copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0).
|
||||
- [ ] Copy-paste or create a tokenizer class for your model in the `tokenization_...` file.
|
||||
If your model is an encoder-only architecture, the generated architecture will be based on the BERT model.
|
||||
If your model is an encoder-decoder architecture, the generated architecture will be based on the BART model. You can,
|
||||
of course, edit the files once the generation is complete.
|
||||
```
|
||||
Select is_encoder_decoder_model:
|
||||
1 - True
|
||||
2 - False
|
||||
Choose from 1, 2 [1]:
|
||||
```
|
||||
-->
|
||||
|
||||
## Adding conversion scripts
|
||||
Once the command has finished, you should have a total of 7 new files spread across the repository:
|
||||
```
|
||||
docs/source/model_doc/<model_name>.rst
|
||||
src/transformers/configuration_<model_name>.py
|
||||
src/transformers/modeling_<model_name>.py
|
||||
src/transformers/modeling_tf_<model_name>.py
|
||||
src/transformers/tokenization_<model_name>.py
|
||||
tests/test_modeling_<model_name>.py
|
||||
tests/test_modeling_tf_<model_name>.py
|
||||
```
|
||||
|
||||
Here is the workflow for the conversion scripts:
|
||||
You can run the tests to ensure that they all pass:
|
||||
|
||||
- [ ] Copy the conversion script (`convert_...`) from the present folder to the main folder.
|
||||
- [ ] Edit this script to convert your original checkpoint weights to the current pytorch ones.
|
||||
```
|
||||
python -m pytest ./tests/test_*<model_name>*.py
|
||||
```
|
||||
|
||||
## Adding tests:
|
||||
Feel free to modify each file to mimic the behavior of your model.
|
||||
|
||||
Here is the workflow for the adding tests:
|
||||
⚠ You should be careful about the classes preceded by the following line:️
|
||||
|
||||
- [ ] Copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main
|
||||
folder and rename them, replacing `xxx` with your model name.
|
||||
- [ ] Edit the tests files to replace `XXX` (with various casing) with your model name.
|
||||
- [ ] Edit the tests code as needed.
|
||||
```python
|
||||
# Copied from transformers.[...]
|
||||
```
|
||||
|
||||
## Documenting your model:
|
||||
This line ensures that the copy does not diverge from the source. If it *should* diverge, because the implementation
|
||||
is different, this line needs to be deleted. If you don't delete this line and run `make fix-copies`,
|
||||
your changes will be overwritten.
|
||||
|
||||
Here is the workflow for documentation:
|
||||
Once you have edited the files to fit your architecture, simply re-run the tests (and edit them if a change
|
||||
is needed!) afterwards to make sure everything works as expected.
|
||||
|
||||
- [ ] Make sure all your arguments are properly documented in your configuration and tokenizer.
|
||||
- [ ] Most of the documentation of the models is automatically generated, you just have to make sure that
|
||||
`XXX_START_DOCSTRING` contains an introduction to the model you're adding and a link to the original
|
||||
article and that `XXX_INPUTS_DOCSTRING` contains all the inputs of your model.
|
||||
- [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.
|
||||
Once the files are generated and you are happy with your changes, here's a checklist to ensure that your contribution
|
||||
will be merged quickly:
|
||||
|
||||
Make sure to check you have no sphinx warnings when building the documentation locally and follow our
|
||||
[documentation guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
|
||||
|
||||
## Final steps
|
||||
|
||||
You can then finish the addition step by adding imports for your classes in the common files:
|
||||
|
||||
- [ ] Add import for all the relevant classes in `__init__.py`.
|
||||
- [ ] Add your configuration in `configuration_auto.py`.
|
||||
- [ ] Add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py`.
|
||||
- [ ] Add your tokenizer in `tokenization_auto.py`.
|
||||
- [ ] Add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
|
||||
- [ ] Edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py`
|
||||
file.
|
||||
- [ ] Add a mention of your model in the doc: `README.md` and the documentation itself
|
||||
in `docs/source/pretrained_models.rst`. Rune `make fix-copies` to update `docs/source/index.rst` with your changes.
|
||||
- [ ] Upload the pretrained weights, configurations and vocabulary files.
|
||||
- [ ] Create model card(s) for your models on huggingface.co. For those last two steps, check the
|
||||
[model sharing documentation](https://huggingface.co/transformers/model_sharing.html).
|
||||
- You should run the `make fixup` utility to fix the style of the files and to ensure the code quality meets the
|
||||
library's standards.
|
||||
- You should complete the documentation file (`docs/source/model_doc/<model_name>.rst`) so that your model may be
|
||||
usable.
|
||||
Reference in New Issue
Block a user