* Created using Colaboratory
* [examples] reorganize files
* remove run_tpu_glue.py as superseded by TPU support in Trainer
* Bugfix: int, not tuple
* move files around
* Rewritten batch support in pipelines.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix imports sorting 🔧
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Set pad_to_max_length=True by default on Pipeline.
* Set pad_to_max_length=False for generation pipelines.
Most of generation models doesn't have padding token.
* Address @joeddav review comment: Uniformized *args.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Address @joeddav review comment: Uniformized *args (second).
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* first copy & past commit from Bert and morgans LSH code
* add easy way to compare to trax original code
* translate most of function
* make trax lsh self attention deterministic with numpy seed + copy paste code
* add same config
* add same config
* make layer init work
* implemented hash_vectors function for lsh attention
* continue reformer translation
* hf LSHSelfAttentionLayer gives same output as trax layer
* refactor code
* refactor code
* refactor code
* refactor
* refactor + add reformer config
* delete bogus file
* split reformer attention layer into two layers
* save intermediate step
* save intermediate step
* make test work
* add complete reformer block layer
* finish reformer layer
* implement causal and self mask
* clean reformer test and refactor code
* fix merge conflicts
* fix merge conflicts
* update init
* fix device for GPU
* fix chunk length init for tests
* include morgans optimization
* improve memory a bit
* improve comment
* factorize num_buckets
* better testing parameters
* make whole model work
* make lm model work
* add t5 copy paste tokenizer
* add chunking feed forward
* clean config
* add improved assert statements
* make tokenizer work
* improve test
* correct typo
* extend config
* add complexer test
* add new axial position embeddings
* add local block attention layer
* clean tests
* refactor
* better testing
* save intermediate progress
* clean test file
* make shorter input length work for model
* allow variable input length
* refactor
* make forward pass for pretrained model work
* add generation possibility
* finish dropout and init
* make style
* refactor
* add first version of RevNet Layers
* make forward pass work and add convert file
* make uploaded model forward pass work
* make uploaded model forward pass work
* refactor code
* add namedtuples and cache buckets
* correct head masks
* refactor
* made reformer more flexible
* make style
* remove set max length
* add attention masks
* fix up tests
* fix lsh attention mask
* make random seed optional for the moment
* improve memory in reformer
* add tests
* make style
* make sure masks work correctly
* detach gradients
* save intermediate
* correct backprob through gather
* make style
* change back num hashes
* rename to labels
* fix rotation shape
* fix detach
* update
* fix trainer
* fix backward dropout
* make reformer more flexible
* fix conflict
* fix
* fix
* add tests for fixed seed in reformer layer
* fix trainer typo
* fix typo in activations
* add fp16 tests
* add fp16 training
* support fp16
* correct gradient bug in reformer
* add fast gelu
* re-add dropout for embedding dropout
* better naming
* better naming
* renaming
* finalize test branch
* finalize tests
* add more tests
* finish tests
* fix
* fix type trainer
* fix fp16 tests
* fix tests
* fix tests
* fix tests
* fix issue with dropout
* fix dropout seeds
* correct random seed on gpu
* finalize random seed for dropout
* finalize random seed for dropout
* remove duplicate line
* correct half precision bug
* make style
* refactor
* refactor
* docstring
* remove sinusoidal position encodings for reformer
* move chunking to modeling_utils
* make style
* clean config
* make style
* fix tests
* fix auto tests
* pretrained models
* fix docstring
* update conversion file
* Update pretrained_models.rst
* fix rst
* fix rst
* update copyright
* fix test path
* fix test path
* fix small issue in test
* include reformer in generation tests
* add docs for axial position encoding
* finish docs
* Update convert_reformer_trax_checkpoint_to_pytorch.py
* remove isort
* include sams comments
* remove wrong comment in utils
* correct typos
* fix typo
* Update reformer.rst
* applied morgans optimization
* make style
* make gpu compatible
* remove bogus file
* big test refactor
* add example for chunking
* fix typo
* add to README
* First commit to add a TF version of the trainer.
* Make the TF trainer closer to what looks the PT trainer
* Refactoring common code between the PT and TF trainer into an util file.
* Some bugfix + better similarity with the PT trainer
* Add missing class in transformers init
* Bugfix over prediction + use classification report instead of simple metrics
* Fix name error
* Fix optimization tests + style
* Apply style
* Several bugfix for multi-gpu training
* Apply style
* Apply style
* Add glue example for the TF trainer
* Several bugix + address the reviews
* Fix on the TF training args file
* Add a debug mode
* Bugfix in utils_ner.py when segment_ids is None
* Apply style
* Apply style
* Add TPU strategy
* Fix selection strategy
There's an inconsistency right now where:
- we load some models into CACHE_DIR
- and some models in the default cache
- and often, in both for the same models
When running the RUN_SLOW tests, this takes a lot of disk space, time, and bandwidth.
I'd rather always use the default cache
* Update sqrt computation so it can survive a torch.jit.trace
* Update modeling_gpt2.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Add GenerationPipeline
* Fix parameter names
* Correct parameter __call__ parameters
* Add model type attribute and correct function calls for prepare_input
* Take out trailing commas from init attributes
* Remove unnecessary tokenization line
* Implement support for multiple text inputs
* Apply generation support for multiple input text prompts
* Take out tensor coersion
* Take out batch index
* Add text prompt to return sequence
* Squeeze token tensore before decoding
* Return only a single list of sequences if only one prompt was used
* Correct results variable name
* Add GenerationPipeline to SUPPORTED_TASKS with the alias , initalized w GPT2
* Registedred AutoModelWithLMHead for both pt and t
* Update docstring for GenerationPipeline
* Add kwargs parameter to mode.generate
* Take out kwargs parameter after all
* Add generation pipeline example in pipeline docstring
* Fix max length by squeezing tokens tensor
* Apply ensure_tensor_on_device to pytorch tensor
* Include generation step in torch.no_grad
* Take out input from prepare_xlm_input and set 'en' as default xlm_language
* Apply framework specific encoding during prepare_input
* Format w make style
* Move GenerationPipeline import to follow proper import sorting
* Take out training comma from generation dict
* Apply requested changes
* Change name to TextGenerationPipeline
* Apply TextGenerationPipeline rename to __init___
* Changing alias to
* Set input mapping as input to ensure_tensor_on_device
* Fix assertion placement
* Add test_text_generation
* Add TextGenerationPipeline to PipelineCommonTests
* Take out whitespace
* Format __init__ w black
* Fix __init__ style
* Forman __init___
* Add line to end of __init__
* Correct model tokenizer set for test_text_generation
* Ensure to return list of list, not list of string (to pass test)
* Limit test models to only 3 to limit runtime to address circleCI timeout error
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update tests/test_pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Remove argument docstring, __init__, add additional __call__ arguments, and reformat results to list of dict
* Fix blank result list
* Add TextGenerationPipeline to pipelines.rst
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Fix typos from adding PADDING_TEXT_TOKEN_LENGTH
* Fix incorrectly moved result list
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
* Update src/transformers/pipelines.py
Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
* Add back generation line and make style
* Take out blank whitespace
* Apply new alis, text-generation, to test_pipelines
* Fix text generation alias in test
* Update src/transformers/pipelines.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* doc
* [tests] Add sample files for a regression task
* [HUGE] Trainer
* Feedback from @sshleifer
* Feedback from @thomwolf + logging tweak
* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes
* [glue] Use default max_seq_length of 128 like before
* [glue] move DataTrainingArguments around
* [ner] Change interface of InputExample, and align run_{tf,pl}
* Re-align the pl scripts a little bit
* ner
* [ner] Add integration test
* Fix language_modeling with API tweak
* [ci] Tweak loss target
* Don't break console output
* amp.initialize: model must be on right device before
* [multiple-choice] update for Trainer
* Re-align to 827d6d6ef0
* First pass on utility classes and python tokenizers
* finishing cleanup pass
* style and quality
* Fix tests
* Updating following @mfuntowicz comment
* style and quality
* Fix Roberta
* fix batch_size/seq_length inBatchEncoding
* add alignement methods + tests
* Fix OpenAI and Transfo-XL tokenizers
* adding trim_offsets=True default for GPT2 et RoBERTa
* style and quality
* fix tests
* add_prefix_space in roberta
* bump up tokenizers to rc7
* style
* unfortunately tensorfow does like these - removing shape/seq_len for now
* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>
* Adding doc and docstrings
* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>
token_type_id is converted into the segment embedding. For question answering,
this needs to highlight whether a token belongs to sequence 0 or 1.
encode_plus takes care of correctly setting this parameter automatically.
* Refactored use of newstest2013 to newstest2014. Fixed bug where argparse consumed first command line argument as model_size argument rather than using default model_size by forcing explicit --model_size flag inclusion
* More pythonic file handling through 'with' context
* COSMETIC - ran Black and isort
* Fixed reference to number of lines in newstest2014
* Fixed failing test. More pythonic file handling
* finish PR from tholiao
* remove outcommented lines
* make style
* make isort happy
Co-authored-by: Thomas Liao <tholiao@gmail.com>
* remove output_past from pt
* make style
* add optional input length for gpt2
* add use cache to prepare input
* save memory in gpt2
* correct gpt2 test inputs
* make past input optional for gpt2
* finish use_cache for all models
* make style
* delete modeling_gpt2 change in test file
* correct docstring
* correct is true statements for gpt2
* added model_cards for polish squad models
* corrected mistake in polish design cards
* updated model_cards for squad2_dutch model
* added links to benchmark models
Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>
* Initial commit to get BERT + run_glue.py on TPU
* Add README section for TPU and address comments.
* Cleanup TPU bits from run_glue.py (#3)
TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.
We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.
* Cleanup TPU bits from run_glue.py
TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.
We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.
* No need to call `xm.mark_step()` explicitly (#4)
Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.
* Resolve R/W conflicts from multiprocessing (#5)
* Add XLNet in list of models for `run_glue_tpu.py` (#6)
* Add RoBERTa to list of models in TPU GLUE (#7)
* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)
* Use barriers to reduce duplicate work/resources (#9)
* Shard eval dataset and aggregate eval metrics (#10)
* Shard eval dataset and aggregate eval metrics
Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.
* Change defaultdict to float
* Reduce the pred, label tensors instead of metrics
As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.
* Only use tb_writer from master (#11)
* Apply huggingface black code formatting
* Style
* Remove `--do_lower_case` as example uses cased
* Add option to specify tensorboard logdir
This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.
* Using configuration for `xla_device`
* Prefix TPU specific comments.
* num_cores clarification and namespace eval metrics
* Cache features file under `args.cache_dir`
Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.
* Rename `run_glue_tpu` to `run_tpu_glue`
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
* [examples] Generate argparsers from type hints on dataclasses
* [HfArgumentParser] way simpler API
* Restore run_language_modeling.py for easier diff
* [HfArgumentParser] final tweaks from code review
* Big cleanup of `glue_convert_examples_to_features`
* Use batch_encode_plus
* Cleaner wrapping of glue_convert_examples_to_features for TF
@lysandrejik
* Cleanup syntax, thanks to @mfuntowicz
* Raise explicit error in case of user error
* Optimize causal mask using torch.where
Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance.
* Maintain compatiblity with torch 1.0.0 - thanks for PR feedback
* Fix typo
* reformat line for CI
* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Make fast tokenizers unittests work on Windows.
* Entirely refactored unittest for tokenizers fast.
* Remove ABC class for CommonFastTokenizerTest
* Added embeded_special_tokens tests from allenai @dirkgr
* Make embeded_special_tokens tests from allenai more generic
* Uniformize vocab_size as a property for both Fast and normal tokenizers
* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
* Ensure providing None input raise the same ValueError than Python tokenizer + tests.
* Fix invalid input for assert_padding when testing batch_encode_plus
* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
* Ensure tokenize() correctly forward add_special_tokens to rust.
* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.
* unittests ensure tokenize() also throws a ValueError if provided None
* Added add_special_tokens unittest for all supported models.
* Style
* Make sure TransfoXL test run only if PyTorch is provided.
* Split up tokenizers tests for each model type.
* Fix invalid unittest with new tokenizers API.
* Filter out Roberta openai detector models from unittests.
* Introduce BatchEncoding on fast tokenizers path.
This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.
* Introduce BatchEncoding on slow tokenizers path.
Backward compatibility.
* Improve error message on BatchEncoding for slow path
* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
* Style and format.
* Added typing on all methods for PretrainedTokenizerFast
* Style and format
* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
* Style and format
* encode_plus now supports pretokenized inputs.
* Remove user warning about add_special_tokens when working on pretokenized inputs.
* Always go through the post processor.
* Added support for pretokenized input pairs on encode_plus
* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
* Added pretokenized inputs support on batch_encode_plus
* Update BatchEncoding methods name to match Encoding.
* Bump setup.py tokenizers dependency to 0.7.0rc1
* Remove unused parameters in BertTokenizerFast
* Make sure Roberta returns token_type_ids for unittests.
* Added missing typings
* Update add_tokens prototype to match tokenizers side and allow AddedToken
* Bumping tokenizers to 0.7.0rc2
* Added documentation for BatchEncoding
* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
* Fix text-classification pipeline using the wrong tokenizer
* Make pipelines works with BatchEncoding
* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove update_special_tokens from fast tokenizers.
* Ensure TransfoXL unittests are run only when torch is available.
* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Style
* Style 🙏🙏
* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
* Remove Roberta warning on __init__.
* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py
`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.
* Simplifying change to match recent commits
* add some t5 integration tests
* finish summarization and translation integration tests for T5 - results loook good
* add tf test
* fix == vs is bug
* fix tf beam search error and make tf t5 tests pass
* Using loaded checkpoint with --do_predict
Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the `model` variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix).
* Update checkpoint loading
* Fixing model loading
* Update the NER TF script to remove the softmax and make the pad token label id to -1
* Reformat the quality and style
Co-authored-by: Julien Plu <julien.plu@adevinta.com>
* make decoder input ids optional for t5 training
* lm_lables should not be shifted in t5
* add tests
* finish shift right functionality for PT T5
* move shift right to correct class
* cleaner code
* replace -100 values with pad token id
* add assert statement
* remove unnecessary for loop
* make style
* Add clear description of how to train T5
* correct docstring in T5
* correct typo
* correct docstring format
* update t5 model docs
* implement collins feedback
* fix typo and add more explanation for sentinal tokens
* delete unnecessary todos
* force bleu
* fix wrong file name
* rename file
* different filenames for each example test
* test files should clean up after themselves
* test files should clean up after themselves
* do not force bleu
* correct typo
* fix isort
* Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* run_ner.py - Do not add a label to the labels_ids if word_tokens is empty.
This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add the missing token classification for XLM
* fix styling
* Add XLMForTokenClassification to AutoModelForTokenClassification class
* Fix docstring typo for non-existing class
* Add the missing token classification for XLM
* fix styling
* fix styling
* Add XLMForTokenClassification to AutoModelForTokenClassification class
* Fix docstring typo for non-existing class
* Add missing description for AlbertForTokenClassification
* fix styling
* Add missing docstring for AlBert
* Slow tests should be slow
Co-authored-by: Sakares Saengkaew <s.sakares@gmail.com>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
* [ci] Also run test_examples in py37
(will revert at the end of the experiment)
* InputExample: use immutable dataclass
* [deps] Install dataclasses for Py<3.7
* [skip ci] Revert "[ci] Also run test_examples in py37"
This reverts commit d29afd9959786b77759b0b8fa4e6b4335b952015.
The CONTRIBUTING file pins to a specific version of isort, so we might as well install that in `dev` . This makes it easier for contributors so they don't have to manually install the specific commit.
I found there are two grammar errors or typo issues in the explanation of the encoding properties.
The original sentences:
If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to
If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts.
I think "input" should be inserted after the phrase "If your".
* fix conflicts
* update bart max length test
* correct spelling mistakes
* implemented model specific encode function
* fix merge conflicts
* better naming
* save intermediate state -> need to rethink strucuture a bit
* leave tf problem as it is for now
* current version
* add layers.pop
* remove ipdb
* make style
* clean return cut decoding
* remove ipdbs
* Fix restoring layers in the decoders that doesnt exists.
* push good intermediate solution for now
* fix conflicts
* always good to refuse to merge conflicts when rebasing
* fix small bug
* improve function calls
* remove unused file
* add correct scope behavior for t5_generate
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
For the tutorial of "How to generate text", the URL link was wrong (it was linked to the tutorial of "How to train a language model").
I fixed the URL.
* added return_token_type_ids argument for tokenizers which do not generate return_type_ids by default
* fixed styling
* Style
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
* first commit
* work in progress
* make language generation task pass
* update to working version for LM
* delete print
* remove dead code
* make style
* passing
* Undo stupid chg
* docs
* undo rename
* delete-cruft
* only import if you have torch
* Dont rely on dict ordering
* Fix dict ordering upstream
* docstring link
* docstring link
* remove trailing comma for 3.5 compat
* new name
* delegate kwarging
* Update kwargs
* ✨ Alter base pl transformer to use automodels
* 🐛 Add batch size env variable to function call
* 💄 Apply black code style from Makefile
* 🚚 Move lightning base out of ner directory
* ✨ Add lightning glue example
* 💄 self
* move _feature_file to base class
* ✨ Move eval logging to custom callback
* 💄 Apply black code style
* 🐛 Add parent to pythonpath, remove copy command
* 🐛 Add missing max_length kwarg
* memory benchmark rss
* have both forward pass and line-by-line mem tracing
* cleaned up tracing
* refactored and cleaning up API
* no f-strings yet...
* add GPU mem logging
* fix GPU memory monitoring
* style and quality
* clean up and doc
* update with comments
* Switching to python 3.6+
* fix quality
This model card is intended to be shared among all models under google/bert_uncased_*
(We'll need some support from HuggingFace to get this card cross-linked from all models)
* Add TF2 version of FlauBERT
* Add TF2 version of FlauBERT
* Add documentation
* Apply style and quality
* Apply style once again
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* add empty model cards for every current DeepPavlov model
* fix: replace cyrillic `с` with `c`
* docs: add model cards for current DeepPavlov BERT models
* docs: add links for arXiv preprints
Be explicit that this is config for the transformers package (as these
layers may coexist with other custom stuff in a Keras model, plus the
Keras container itself is called config, and config["config"] is not
great)
Add explicit error handling for initializer calls that have neither
the `config` nor the `transformers_config` argument, or have both.
This was the beginnings of an attempt to address the test failure on
this layer, and instead I backed out of making this layer
keras-serializable at all ... so it was a mistake to commit this.
* Added transformers-pytorch-cpu and gpu Docker images
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added automatic jupyter launch for Docker image.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move image from alpine to Ubuntu to align with NVidia container images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added TRANSFORMERS_VERSION argument to Dockerfile.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added Pytorch-GPU based Docker image
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added Tensorflow images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Use python 3.7 as Tensorflow doesnt provide 3.8 compatible wheel.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove double FROM instructions on transformers-pytorch-cpu image.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added transformers-tensorflow-gpu Docker image.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* use the correct ubuntu version for tensorflow-gpu
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added pipelines example notebook
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added transformers-cpu and transformers-gpu (including both PyTorch and TensorFlow) images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Docker images doesnt start jupyter notebook by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Tokenizers notebook
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update images links
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update Docker images to python 3.7.6 and transformers 2.5.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added 02-transformers notebook.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Trying to realign 02-transformers notebook ?
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added Transformer image schema
* Some tweaks on tokenizers notebook
* Removed old notebooks.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Attempt to provide table of content for each notebooks
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Second attempt.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Reintroduce transformer image.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Keep trying
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* It's going to fly !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remaining of the Table of Content
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix inlined elements for the table of content
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Removed anaconda dependencies for Docker images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Removing notebooks ToC
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added LABEL to each docker image.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Removed old Dockerfile
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Directly use the context and include transformers from here.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Reduce overall size of compiled Docker images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Install jupyter by default and use CMD for easier launching of the images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Reduce number of layers in the images.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added README.md for notebooks.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix notebooks link in README
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix some wording issues.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added blog notebooks too.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing spelling errors in review comments.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
When supplied by Keras deserialization, the config parameter to initializers
will be a dict. So intercept it and convert to PretrainedConfig object (and
store in instance attribute for get_config to get at it) before passing to the
actual initializer. To accomplish this, and repeat as little code as possible,
use a class decorator on TF*MainLayer classes.
* Rename and improve example
* Add test
* slightly faster test
* style
* This breaks remy prolly
* shorter test string
* no slow
* newdir structure
* New tree
* Style
* shorter
* docs
* clean
* Attempt future import
* more import hax
* add first copy past test to tf 2 generate
* add tf top_k_top_p_filter fn
* add generate function for TF
* add generate function for TF
* implemented generate for all models expect transfoXL
* implemented generate for all models expect transfoXL
* implemented generate for all models expect transfoXL
* make style
* change permission of test file to correct ones
* delete ipdb
* delete ipdb
* fix bug and finish simple gpt2 integration test
* clean test file
* clean test file
* make style
* make style
* make style
* make style
* change import style
* change import style
* make style
* make style
* add decorators
* add decorators
* fix tf ctrl bug dim => axis in TF
* make style
* make style
* refactored test file
* refactored test file
* take out test_torch_tf_conversion if nothing is defined
* take out test_torch_tf_conversion if nothing is defined
* remove useless files
* remove useless files
* fix conflicts
* fix conflicts
* fix conflicts
* fix conflicts
* fix conflicts
* solve conflicts
* solve conflicts
* fix conflicts
* fix conflicts
* merge conflicts
* delete ipdb
* exposed top_k_top_p_filtering fns
* delete weirdly created w! file
* add comment to test tf common modeling
* fix conflicts
* fix conflicts
* make style
* merge conflicts
* make style
* change tf.tensor.shape to shape_list(tensor)
* Pipeline doc initial commit
* pipeline abstraction
* Remove modelcard argument from pipeline
* Task-specific pipelines can be instantiated with no model or tokenizer
* All pipelines doc
* Create self-hosted.yml
* Update self-hosted.yml
* Update self-hosted.yml
* Update self-hosted.yml
* Update self-hosted.yml
* Update self-hosted.yml
* do not run slow tests, for now
* [ci] For comparison with circleci, let's also run CPU-tests
* [ci] reorganize
* clearer filenames
* [ci] Final tweaks before merging
* rm slow tests on circle ci
* Trigger CI
* On GPU this concurrency was way too high
* * Added support for Albert when fine-tuning for NER
* Added support for Albert in NER pipeline
* Added command-line options to examples/ner/run_ner.py to better control tokenization
* Added class AlbertForTokenClassification
* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens
* Added ,
* Now passes style guide enforcement
* Changes from reviews.
* Code now passes style enforcement
* Added test for AlbertForTokenClassification
* Added test for AlbertForTokenClassification
* Usage: Sequence Classification & Question Answering
* Pipeline example
* Language modeling
* TensorFlow code for Sequence classification
* Custom TF/PT toggler in docs
* QA + LM for TensorFlow
* Finish Usage for both PyTorch and TensorFlow
* Addressing Julien's comments
* More assertive
* cleanup
* Favicon
- added favicon option in conf.py along with the favicon image
- udpated 🤗 logo. slightly smaller and should appear more consistent across editing programs (no more tongue on the outside of the mouth)
Co-authored-by: joshchagani <joshua@joshuachagani.com>
* Renamed file generate by tokenizers when calling save_pretrained to match python.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added save_vocabulary tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove python quick and dirty fix for clean Rust impl.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizers dependency to 0.5.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added some save_pretrained / from_pretrained unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update tokenizers to 0.5.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* flake8
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Making sure there is really a bug in unittest
* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* add preprocessing to add space before punctuation for transfo_xl
* improve warning messages
* make style
* compile regex at instantination of tokenizer object
* Add disable_outgoing to pretrained items
Setting disable_outgoing=True disables outgonig traffic:
- etags are not looked up
- models are not downloaded
* parameter name change
* Remove forgotten print
* Testing that encode_plus and batch_encode_plus behave the same way
Spoiler alert: they don't
* Testing rest of arguments in batch_encode_plus
* Test tensor return in batch_encode_plus
* Addressing Sam's comments
* flake8
* Simplified with `num_added_tokens`
* Added support for Albert in NER pipeline
* Added command-line options to examples/ner/run_ner.py to better control tokenization
* Added class AlbertForTokenClassification
* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens
- I added an example using the model with pipelines to show that we have set```{"use_fast": False}``` in the tokenizer.
- I added a Colab to play with the model and pipelines
- I added a Colab to discover Huggingface pipelines at the end of the document
* enable_padding should pad up to max_length if set.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added more testing on padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* improving generation
* finalized special token behaviour for no_beam_search generation
* solved modeling_utils merge conflict
* solve merge conflicts in modeling_utils.py
* add run_generation improvements from PR #2749
* adapted language generation to not use hardcoded -1 if no padding token is available
* remove the -1 removal as hard coded -1`s are not necessary anymore
* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown
* add slow language generation tests for pretrained models using hardcoded output with pytorch seed
* delete ipdb
* check that all generated tokens are valid
* renaming
* renaming Generation -> Generate
* make style
* updated so that generate_beam_search has same token behavior than generate_no_beam_search
* consistent return format for run_generation.py
* deleted pretrain lm generate tests -> will be added in another PR
* cleaning of unused if statements and renaming
* run_generate will always return an iterable
* make style
* consistent renaming
* improve naming, make sure generate function always returns the same tensor, add docstring
* add slow tests for all lmhead models
* make style and improve example comments modeling_utils
* better naming and refactoring in modeling_utils
* improving generation
* finalized special token behaviour for no_beam_search generation
* solved modeling_utils merge conflict
* solve merge conflicts in modeling_utils.py
* add run_generation improvements from PR #2749
* adapted language generation to not use hardcoded -1 if no padding token is available
* remove the -1 removal as hard coded -1`s are not necessary anymore
* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown
* add slow language generation tests for pretrained models using hardcoded output with pytorch seed
* delete ipdb
* check that all generated tokens are valid
* renaming
* renaming Generation -> Generate
* make style
* updated so that generate_beam_search has same token behavior than generate_no_beam_search
* consistent return format for run_generation.py
* deleted pretrain lm generate tests -> will be added in another PR
* cleaning of unused if statements and renaming
* run_generate will always return an iterable
* make style
* consistent renaming
* improve naming, make sure generate function always returns the same tensor, add docstring
* add slow tests for all lmhead models
* make style and improve example comments modeling_utils
* better naming and refactoring in modeling_utils
* changed fast random lm generation testing design to more general one
* delete in old testing design in gpt2
* correct old variable name
* temporary fix for encoder_decoder lm generation tests - has to be updated when t5 is fixed
* adapted all fast random generate tests to new design
* better warning description in modeling_utils
* better comment
* better comment and error message
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* Remove warning when pad_to_max_length is not set.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move RoberTa warning to RoberTa and not GPT2 base tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Correctly return the tuple of generated file(s) when calling save_pretrained
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Override build_inputs_with_special_tokens for fast impl + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Quality + format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Implemented fast version of tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bumped tokenizers version requirements to latest 0.2.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added matching tests
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Matching OpenAI GPT tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Matching GPT2 on tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Expose add_prefix_space as constructor parameter for GPT2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Matching Roberta tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Removed fast implementation of CTRL.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Binding TransformerXL tokenizers to Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Updating tests accordingly.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added tokenizers as top-level modules.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Black & isort.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Rename LookupTable to WordLevel to match Rust side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Black.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Use "fast" suffix instead of "ru" for rust tokenizers implementations.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Introduce tokenize() method on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* encode_plus dispatchs to batch_encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* batch_encode_plus now dispatchs to encode if there is only one input element.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizers dependency to 0.3.0
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Formatting.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix tokenization_auto with support for new (python, fast) mapping schema.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Give correct fixtures path in test_tokenization_fast.py for the CLI.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Expose max_len_ properties on BertTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* _convert_encoding should keep the batch axis tensor if only one sample in the batch.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add warning message for RobertaTokenizerFast if used for MLM.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().
This allows to easily enable/disable Rust-based tokenizer instantiation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Let's tokenizers handle all the truncation and padding stuff.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Allow to provide tokenizer arguments during pipeline creation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update test_fill_mask pipeline to not use fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix too much parameters for convert_encoding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* When enabling padding, max_length should be set to None.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Avoid returning nested tensors of length 1 when calling encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Ensure output is padded when return_tensor is not None.
Tensor creation requires the inital list input to be of the exact same size.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Disable transfoxl unittest if pytorch is not available (required to load the model)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* encode_plus should not remove the leading batch axis if return_tensor is set
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Temporary disable fast tokenizers on QA pipelines.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix formatting issues.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update tokenizers to 0.4.0
* Update style
* Enable truncation + stride unit test on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add unittest ensuring special_tokens set match between Python and Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Ensure special_tokens are correctly set during construction.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Give more warning feedback to the user in case of padding without pad_token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* quality & format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added possibility to add a single token as str
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added unittest for add_tokens and add_special_tokens on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix rebase mismatch on pipelines qa default model.
QA requires cased input while the tokenizers would be uncased.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Using offset mapping relative to the original string + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: save_vocabulary requires folder and file name
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Simplify import for Bert.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Remove private member access in tokenize()
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Bump tokenizers dependency to 0.4.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Use named arguments when applicable.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Relax type checking to include tuple and list object.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Document the truncate_and_pad manager behavior.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Raise an exception if return_offsets_mapping is not available with the current tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* On pytorch we need to stack tensor to get proper new axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Generalize tests to different framework removing hard written return_tensors="..."
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizer dependency for num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Improved error message for padding strategy without pad token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bumping tokenizers dependency to 0.5.0 for release.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Optimizing convert_encoding around 4x improvement. 🚀
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove perfect alignment constraint for Roberta (allowing 1% difference max)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Triggering final CI
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
I trained the model for more epochs so I improved the results. This commit will update the results of the model and add a gif using it with **transformers/pipelines**
* Created model card for nlptown/bert-base-multilingual-sentiment
* Delete model card
* Created model card for bert-base-multilingual-uncased-sentiment as README
* Preserve spaces in GPT-2 tokenizers
Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.
* Add tokenization preprocessing method
* Add framework argument to pipeline factory
Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.
PyTorch < 1.3 requires multiplication operands to be of the same type.
This was violated when using default attention mask (i.e.,
attention_mask=None in arguments) given BERT in the decoder mode.
In particular, this was breaking Model2Model and made tutorial
from the quickstart failing.
Tensorflow 2.1.0 introduce a new dependency model where pip install tensorflow would install tf with GPU support.
Before it would just install with CPU support, thus CircleCI is looking for NVidia driver version at initialization of the
tensorflow related tests but fails as their is no NVidia Driver running.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* pass langs parameter to certain XLM models
Adding an argument that specifies the language the SQuAD dataset is in so language-sensitive XLMs (e.g. `xlm-mlm-tlm-xnli15-1024`) don't default to language `0`.
Allows resolution of issue #1799 .
* fixing from `make style`
* fixing style (again)
* add "info" command to CLI
As a convenience, add the info directive to CLI. Running `python transformers-cli info` will return a string containing the transformers version, platform, python version, PT/TF version and GPU support
* Swap f-strings for .format
Still supporting 3.5 so can't use f-strings (sad face)
* Add reference in issue to CLI
* Add the expected fields to issue template
This way, people can still add the information manually if they want. (Though I fear they'll just ignore it.)
* Remove heading from output
* black-ify
* order of imports
Should ensure isort test passes
* use is_X_available over import..pass
* style
* fix copy-paste bug
* Rename command info -> env
Also adds the command to CONTRIBUTING.md in "Did you find a bug" section
The FlauBERT configuration file inherits from XLMConfig, and is recognized as such when loading from AutoModels as the XLMConfig is checked before the FlaubertConfig.
Changing the order solves this problem, but a test should be added.
* fill_mask helper
* [poc] FillMaskPipeline
* Revert "[poc] FillMaskPipeline"
This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335.
* Revert "fill_mask helper"
This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa.
* README: clarify that Pipelines can also do text-classification
cf. question at the AI&ML meetup last week, @mfuntowicz
* Fix test: test feature-extraction pipeline
* Test tweaks
* Slight refactor of existing pipeline (in preparation of new FillMaskPipeline)
* Extraneous doc
* More robust way of doing this
@mfuntowicz as we don't rely on the model name anymore (see AutoConfig)
* Also add RobertaConfig as a quickfix for wrong token_type_ids
* cs
* [BIG] FillMaskPipeline
In batch_encode_plus we have to ensure that the tokenizer has a pad_token_id so that, when padding, no None values are added as padding. That would happen with gpt2, openai, transfoxl.
closes https://github.com/huggingface/transformers/issues/2640
- mostly stylistic streamlining
- removed 'additional context' sections. They seem to be rarely used and might cause confusion. If more details are needed, users can add them to the 'details' section
T5WithLMHeadModel's doc string claims that indices of -1 are
ignored while computing the cross-entropy loss in the forward
pass; however, indices of -1 throw an error while indices of -100
are ignored. This commit updates the doc string to be consistent
with the class's behavior.
- It appears that `tqdm` only introduced `tqdm.auto` in 4.27.
- See https://github.com/tqdm/tqdm/releases/tag/v4.27.0.
- Without the lower bound I received the following stack trace in an environment where I already had tqdm installed:
```
File "/home/brendanr/anaconda3/envs/allennlp/lib/python3.6/site-packages/transformers/__init__.py", line 20, in <module>
from .file_utils import (TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE,
File "/home/brendanr/anaconda3/envs/allennlp/lib/python3.6/site-packages/transformers/file_utils.py", line 24, in <module>
from tqdm.auto import tqdm
ModuleNotFoundError: No module named 'tqdm.auto'
```
Created a link between the linear layer bias and the model attribute bias. This does not change anything for the user nor for the conversion scripts, but allows the `resize_token_embeddings` method to resize the bias as well as the weights of the decoder.
Added a test.
Modified QA pipeline to consider all features for each example before generating topk answers.
Current pipeline only takes one SquadExample, one SquadFeature, one start logit list, one end logit list to retrieve the answer, this is not correct as one SquadExample can produce multiple SquadFeatures.
Use -e only in docs targeted at contributors.
If a user copy-pastes command line with [--editable], they will hit
an error. If they don't know the --editable option, we're giving them
a choice to make before they can move forwards, but this isn't a choice
they need to make right now.
If a user or contributor ran `pip install -e .` on transformers < 3.0,
pip created a transformers.egg-info directory next to the transformers
directory at the root of the repository.
In transformers 3.0, the source is in a `src` subdirectory.
`pip install -e .` creates a transformers.egg-info directory there.
However, pip will still pick transformers.egg-info from the previous
location. This is a bug: https://github.com/pypa/pip/issues/5466
Users and contributors are likely to hit this problem because the
documentation for transformers 3.0 relies heavily on extra_requires
which didn't exist in earlier versions, so aren't defined in a stale
transformers.egg-info directory.
If such a directory exists, remove it. It's autogenerated, gitignored
and not supposed to contain anything of value.
I suspect the wrapper classes were created in order to prevent the
abstract base class (TF)CommonModelTester from being included in test
discovery and running, because that would fail.
I solved this by replacing the abstract base class with a mixin.
Code changes are just de-indenting and automatic reformattings
performed by black to use the extra line space.
This construct isn't used anymore these days.
Running python tests/test_foo.py puts the tests/ directory on
PYTHONPATH, which isn't representative of how we run tests.
Use python -m unittest tests/test_foo.py instead.
This prevents transformers from being importable simply because the CWD
is the root of the git repository, while not being importable from other
directories. That led to inconsistent behavior, especially in examples.
Once you fetch this commit, in your dev environment, you must run:
$ pip uninstall transformers
$ pip install -e .
These libraries aren't always installed in the virtual environment where
isort is running. Declaring them properly avoids mixing these
third-party imports with local imports.
This change is mostly autogenerated with:
$ python -m autoflake --in-place --recursive --remove-all-unused-imports --ignore-init-module-imports examples templates transformers utils hubconf.py setup.py
I made minor changes in the generated diff.
This change is mostly autogenerated with:
$ python -m autoflake --in-place --recursive examples templates transformers utils hubconf.py setup.py
I made minor changes in the generated diff.
This is the result of:
$ black --line-length 119 examples templates transformers utils hubconf.py setup.py
There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.
This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.
We're already using as many processes in parallel as we have CPU cores.
Furthermore, the number of core may be incorrectly calculated as 36
(we've seen this in pytest-xdist) which make compound the problem.
PyTorch performance craters without this.
Set the number of CPUs manually based on the Circle CI resource class,
or else we're getting 36 CPUs, which is far too much (perhaps that's
the underlying hardware and not what Circle CI allocates to us).
Don't parallelize the custom tokenizers tests because they take less
than one second to run and parallelization actually makes them slower.
Since the file is written to the filesystem, a filesystem lock is the
way to go here. Add a dependency on the third-party filelock library to
get cross-platform functionality.
Caching models across test cases and across runs of the test suite makes
slow tests somewhat more bearable.
Use gettempdir() instead of /tmp in tests. This makes it easier to
change the location of the cache with semi-standard TMPDIR/TEMP/TMP
environment variables.
Fix#2222.
This allows moving the file instead of copying it, which is more
reliable. Also it avoids writing large amounts of data to /tmp,
which may not be large enough to accomodate it.
Refs #2222.
- Empty the output directory, if it contains any files or subdirectories.
- Create the "encoder" directory inside "save_directory", if not exists.
- Create the "decoder" directory inside "save_directory", if not exists.
- Save the encoder and the decoder in the previous two directories, respectively.
* Small clarification
Matches line 431 to line 435 for additional clarity and consistency.
* Fixed minor typo
The letter "s" was previously omitted from the word "docstrings".
We currently save the pretrained_weights of the encoder and decoder in
two separate directories `encoder` and `decoder`. However, for the
`from_pretrained` function to operate with automodels we need to
specify the type of model in the path to the weights.
The path to the encoder/decoder weights is handled by the
`PreTrainedEncoderDecoder` class in the `save_pretrained` function. Sice
there is no easy way to infer the type of model that was initialized for
the encoder and decoder we add a parameter `model_type` to the function.
This is not an ideal solution as it is error prone, and the model type
should be carried by the Model classes somehow.
This is a temporary fix that should be changed before merging.
We currently create encoder attention masks (when they're not provided)
based on the shape of the inputs to the encoder. This is obviously
wrong; sequences can be of different lengths. We now create the encoder
attention mask based on the batch_size and sequence_length of the
encoder hidden states.
* Switch to plain unittest for skipping slow tests.
Add a RUN_SLOW environment variable for running them.
* Switch to plain unittest for PyTorch dependency.
* Switch to plain unittest for TensorFlow dependency.
* Avoid leaking open files in the test suite.
This prevents spurious warnings when running tests.
* Fix unicode warning on Python 2 when running tests.
The warning was:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
* Support running PyTorch tests on a GPU.
Reverts 27e015bd.
* Tests no longer require pytest.
* Make tests pass on cuda
When evaluating, shouldn't we always use the SequentialSampler instead of DistributedSampler? Evaluation only runs on 1 GPU no matter what, so if you use the DistributedSampler with N GPUs, I think you'll only evaluate on 1/N of the evaluation set. That's at least what I'm finding when I run an older/modified version of this repo.
Whenever target_mapping is provided to the input, XLNet outputs two different attention streams.
Based on that the attention output would be on of the two:
- a list of tensors (usual case for most transformers)
- a list of 2-tuples of tensors, one tesor for each of attention streams
Docs and unit-tests have been updated
Custom schedulers are currently initiated by wrapping Pytorch's LambdaLR
class and passing a method of the wrapping class to the __init__
function of LambdaLR. This approach is not appropriate for several
reasons:
1. one does not need to define a class when it only defines a
__init__() method;
2. instantiating the parent class by passing a method of the child class
creates a cyclical reference which leads to memory leaks. See issues #1742 and #1134.
In this commit we replace the wrapper classes with functions that
instantiate `LambdaLR` with a custom learning rate function. We use a
closure to specify the parameter of the latter. We also do a bit of
renaming within the function to explicit the behaviour and removed
docstrings that were subsequently not necessary.
As pointed out in #1545, when using an uncased model, and adding
a new uncased token, the tokenizer does not correctly identify this
in the case that the input text contains the token in a cased format.
For instance, if we load bert-base-uncased into BertTokenizer, and
then use .add_tokens() to add "cool-token", we get the expected
result for .tokenize('this is a cool-token'). However, we get a
possibly unexpected result for .tokenize('this is a cOOl-Token'),
which in fact mirrors the result for the former from before the new
token was added.
This commit adds
- functionality to PreTrainedTokenizer to handle this
situation in case a tokenizer (currently Bert, DistilBert,
and XLNet) has the do_lower_case=True kwarg by:
1) lowercasing tokens added with .add_tokens()
2) lowercasing text at the beginning of .tokenize()
- new common test case for tokenizers
https://github.com/huggingface/transformers/issues/1545
We currently initialize `encoder_attention_mask` when it is `None`,
whether the stack is that of an encoder or a decoder. Since this
may lead to bugs that are difficult to tracks down, I added a condition
that assesses whether the current stack is a decoder.
Mish is a new activation function proposed here - https://arxiv.org/abs/1908.08681
It has seen some recent success and has been adopted in SpaCy, Thic, TensorFlow Addons and FastAI-dev.
All benchmarks recorded till now (including against ReLU, Swish and GELU) is present in the repository - https://github.com/digantamisra98/Mish
Might be a good addition to experiment with especially in the Bert Model.
- Fix hanging when loading pretrained models from the cache without having internet access. This is a widespread issue on supercomputers whose internal compute nodes are firewalled.
The introduction of a decoder introduces 2 changes:
- We need to be able to specify a separate mask in the cross
attention to mask the positions corresponding to padding tokens in the
encoder state.
- The self-attention in the decoder needs to be causal on top of not
attending to padding tokens.
the definition of `get_masks` would blow with the proper combination of
arguments. It was just a matter of moving a definition outside of a
control structure.
We currenctly instantiate encoders and decoders for the seq2seq by
passing the `is_decoder` keyword argument to the `from_pretrained`
classmethod. On the other hand, the model class looks for the value
of the `is_decoder` attribute in its config.
In order for the value to propagate from the kwarg to the configuration
we simply need to define `is_decoder` as an attribute to the base
`PretrainedConfig`, with a default at `False`.
the data provided by Li Dong et al. were already tokenized, which means
that they are not compatible with all the models in the library. We
thus process the raw data directly and tokenize them using the models'
tokenizers.
We write a function to load an preprocess the CNN/Daily Mail dataset as
provided by Li Dong et al. The issue is that this dataset has already
been tokenized by the authors, so we actually need to find the original,
plain-text dataset if we want to apply it to all models.
In Rothe et al.'s "Leveraging Pre-trained Checkpoints for Sequence
Generation Tasks", Bert2Bert is initialized with pre-trained weights for
the encoder, and only pre-trained embeddings for the decoder. The
current version of the code completely randomizes the weights of the
decoder.
We write a custom function to initiliaze the weights of the decoder; we
first initialize the decoder with the weights and then randomize
everything but the embeddings.
Since the preloading of weights relies on the name of the class's
attributes changing the namespace breaks loading pretrained weights on
Bert and all related models. I reverted `self_attention` to `attention`
and us `crossattention` for the decoder instead.
In the seq2seq model we need to both load pretrained weights in the
encoder and initialize the decoder randomly. Because the
`from_pretrained` method defined in the base class relies on module
names to assign weights, it would also initialize the decoder with
pretrained weights. To avoid this we override the method to only
initialize the encoder with pretrained weights.
The modifications that I introduced in a previous commit did break
Bert's internal API. I reverted these changes and added more general
classes to handle the encoder-decoder attention case.
There may be a more elegant way to deal with retro-compatibility (I am
not comfortable with the current state of the code), but I cannot see it
right now.
There is currently no way to specify the quey, key and value separately
in the Attention module. However, the decoder's "encoder-decoder
attention" layers take the decoder's last output as a query, the
encoder's states as key and value. We thus modify the existing code so
query, key and value can be added separately.
This obviously poses some naming conventions; `BertSelfAttention` is not
a self-attention module anymore. The way the residual is forwarded is
now awkard, etc. We will need to do some refacto once the decoder is
fully implemented.
adding conversion script
adding first draft of modeling & tokenization
adding placeholder for test files
bunch of changes
registering the tokenizer/model/etc
tests
change link; something is very VERY wrong here
weird end-of-word thingy going on
i think the tokenization works now ; wrote the unit tests
overall structure works;load w next
the monster is alive!
works after some cleanup as well
adding emacs autosave to gitignore
currently only supporting the 48 layer one; seems to infer fine on my macbook
cleanup
fixing some documentation
fixing some documentation
tests passing?
now works on CUDA also
adding greedy?
adding greedy sampling
works well
Attention output was in bnij ordering instead of ijbn which everything
else will expect. This was an oversight on my part, and keeps the
attention inputs/outputs identical to the original code.
Also moved back from tensor slicing to index_select in rel_shift_bnij to
make the tracer happy.
Significant performance boost over the original orderings
on an already somewhat optimised branch this gave me > 2x end-to-end
throughput on a squad xlnet fine-tuning task (batch 8, seq-length 612,
fp16)
Lines 183 - 200, fixed indentation. Line 198, replaced `tokenizer_class` with `BertTokenizer`, since `tokenizer_class` is not defined in the loop it belongs to.
`glue_convert_examples_to_features` assumed that tensorflow_dataset
examples contains the features `'sentence1'` and `'sentence2'`. This
commit encapsulates the choice of features in the glue processor and
uses that to parse examples.
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Now raises a warning when a head to be deleted already has been deleted. An integration test verifying the total pipeline (-> from config -> save model -> load model -> additional head pruning) has been added.
I assume that it should test the `re-load` functionality after testing the `save` functionality, however I'm also surprised that nobody points this out after such a long time, so maybe I've misunderstood the purpose. This PR is just in case :)
Currently the L2 regularization is hard-coded to "0.01", even though there is a --weight_decay flag implemented (that is unused). I'm making this flag control the weight decay used for fine-tuning in this script.
splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', ''].
Perhaps we should use readlines() instead.
Reason for issue was that optimzation steps where computed from example size, which is different from actual size of dataloader when an example is chunked into multiple instances.
Solution in this pull request is to compute num_optimization_steps directly from len(data_loader).
about: Submit a bug report to help us improve transformers
title: ''
labels: ''
assignees: ''
---
# 🐛 Bug
## Information
Model I am using (Bert, XLNet ...):
Language I am using the model on (English, Chinese ...):
The problem arises when using:
* [ ] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)
The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [ ] my own task or dataset: (give details below)
## To reproduce
Steps to reproduce the behavior:
1.
2.
3.
<!-- If you have code snippets, error messages, stack traces please provide them here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->
## Expected behavior
<!-- A clear and concise description of what you would expect to happen. -->
## Environment info
<!-- You can run the command `transformers-cli env` and copy-and-paste its output below.
Don't forget to fill out the missing fields in that output! -->
-`transformers` version:
- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
name: "\U0001F4DA Migration from pytorch-pretrained-bert or pytorch-transformers"
about: Report a problem when migrating from pytorch-pretrained-bert or pytorch-transformers
to transformers
title: ''
labels: Migration
assignees: ''
---
# 📚 Migration
## Information
<!-- Important information -->
Model I am using (Bert, XLNet ...):
Language I am using the model on (English, Chinese ...):
The problem arises when using:
* [ ] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)
The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [ ] my own task or dataset: (give details below)
## Details
<!-- A clear and concise description of the migration issue.
If you have code snippets, please provide it here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
-->
## Environment info
<!-- You can run the command `python transformers-cli env` and copy-and-paste its output below.
Don't forget to fill out the missing fields in that output! -->
-`transformers` version:
- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
<!-- IMPORTANT: which version of the former library do you use? -->
*`pytorch-transformers` or `pytorch-pretrained-bert` version (or branch):
## Checklist
- [ ] I have read the migration guide in the readme.
Everyone is welcome to contribute, and we value everybody's contribution. Code
is thus not the only way to help the community. Answering questions, helping
others, reaching out and improving the documentations are immensely valuable to
the community.
It also helps us if you spread the word: reference the library from blog posts
on the awesome projects it made possible, shout out on Twitter every time it has
helped you, or simply star the repo to say "thank you".
## You can contribute in so many ways!
There are 4 ways you can contribute to transformers:
* Fixing outstanding issues with the existing code;
* Implementing new models;
* Contributing to the examples or to the documentation;
* Submitting issues related to bugs or desired new features.
*All are equally valuable to the community.*
## Submitting a new issue or feature request
Do your best to follow these guidelines when submitting an issue or a feature
request. It will make it easier for us to come back to you quickly and with good
feedback.
### Did you find a bug?
The transformers are robust and reliable thanks to the users who notify us of
the problems they encounter. So thank you for reporting an issue.
First, we would really appreciate it if you could **make sure the bug was not
already reported** (use the search bar on Github under Issues).
Did not find it? :( So we can act quickly on it, please follow these steps:
* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
**Tensorflow** when applicable;
* A short, self-contained, code snippet that allows us to reproduce the bug in
less than 30s;
* Provide the *full* traceback if an exception is raised.
To get the OS and software versions automatically, you can run the following command:
```bash
python transformers-cli env
```
### Do you want to implement a new model?
Awesome! Please provide the following information:
* Short description of the model and link to the paper;
* Link to the implementation if it is open-source;
* Link to the model weights if they are available.
If you are willing to contribute the model yourself, let us know so we can best
guide you.
We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder.
### Do you want a new feature (that is not a model)?
A world-class feature request addresses the following points:
1. Motivation first:
* Is it related to a problem/frustration with the library? If so, please explain
why. Providing a code snippet that demonstrates the problem is best.
* Is it related to something you would need for a project? We'd love to hear
about it!
* Is it something you worked on and think could benefit the community?
Awesome! Tell us what problem it solved for you.
2. Write a *full paragraph* describing the feature;
3. Provide a **code snippet** that demonstrates its future use;
4. In case this is related to a paper, please attach a link;
5. Attach any additional information (drawings, screenshots, etc.) you think may help.
If your issue is well written we're already 80% of the way there by the time you
post it.
We have added **templates** to guide you in the process of adding a new example script for training or testing the models in the library. You can find them in the [`templates`](./templates) folder.
## Start contributing! (Pull Requests)
Before writing code, we strongly advise you to search through the exising PRs or
issues to make sure that nobody is already working on the same thing. If you are
unsure, it is always a good idea to open an issue to get some feedback.
You will need basic `git` proficiency to be able to contribute to
`transformers`. `git` is not the easiest tool to use but it has the greatest
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
Git](https://git-scm.com/book/en/v2) is a very good reference.
Follow these steps to start contributing:
1. Fork the [repository](https://github.com/huggingface/transformers) by
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
under your GitHub user account.
2. Clone your fork to your local disk, and add the base repository as a remote:
6. Once you are satisfied (**and the checklist below is happy too**), go to the
webpage of your fork on GitHub. Click on 'Pull request' to send your changes
to the project maintainers for review.
7. It's ok if maintainers ask you for changes. It happens to core contributors
too! So everyone can see the changes in the Pull request, work in your local
branch and push the changes to your fork. They will automatically appear in
the pull request.
### Checklist
1. The title of your pull request should be a summary of its contribution;
2. If your pull request adresses an issue, please mention the issue number in
the pull request description to make sure they are linked (and people
consulting the issue know you are working on it);
3. To indicate a work in progress please prefix the title with `[WIP]`. These
are useful to avoid duplicated work, and to differentiate it from PRs ready
to be merged;
4. Make sure existing tests pass;
5. Add high-coverage tests. No quality test, no merge.
- If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
- If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`.
CircleCI does not run them.
6. All public methods must have informative docstrings;
### Tests
You can run 🤗 Transformers tests with `unittest` or `pytest`.
We like `pytest` and `pytest-xdist` because it's faster. From the root of the
repository, here's how to run tests with `pytest` for the library:
```bash
$ python -m pytest -n auto --dist=loadfile -s -v ./tests/
```
and for the examples:
```bash
$ pip install -r examples/requirements.txt # only needed the first time
$ python -m pytest -n auto --dist=loadfile -s -v ./examples/
```
In fact, that's how `make test` and `make test-examples` are implemented!
You can specify a smaller set of tests in order to test only the feature
you're working on.
By default, slow tests are skipped. Set the `RUN_SLOW` environment variable to
`yes` to run them. This will download many gigabytes of models — make sure you
have enough disk space and a good Internet connection, or a lot of patience!
#### This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md)
This section is dedicated to the Benchmarks done by the library, both by maintainers, contributors and users. These
benchmark will help keep track of the preformance improvements that are brought to our models across versions.
## Benchmarking all models for inference
As of version 2.1 we have benchmarked all models for inference, across many different settings: using PyTorch, with
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
TensorFlow XLA) and GPUs.
The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2)
The results are available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing).
## TF2 with mixed precision, XLA, Distribution (@tlkh)
This work was done by [Timothy Liu](https://github.com/tlkh).
There are very positive results to be gained from the various TensorFlow 2.0 features:
- Automatic Mixed Precision (AMP)
- XLA compiler
- Distribution strategies (multi-GPU)
The benefits are listed here (tested on CoLA, MRPC, SST-2):
- AMP: Between 1.4x to 1.6x decrease in overall time without change in batch size
- AMP+XLA: Up to 2.5x decrease in overall time on SST-2 (larger dataset)
- Distribution: Between 1.4x to 3.4x decrease in overall time on 4xV100
- Combined: Up to 5.7x decrease in overall training time, or 9.1x training throughput
The model quality (measured by the validation accuracy) fluctuates slightly. Taking an average of 4 training runs
on a single GPU gives the following results:
- CoLA: AMP results in slighter lower acc (0.820 vs 0.824)
- MRPC: AMP results in lower acc (0.823 vs 0.835)
- SST-2: AMP results in slighter lower acc (0.918 vs 0.922)
However, in a distributed setting with 4xV100 (4x batch size), AMP can yield in better results:
CoLA: AMP results in higher acc (0.828 vs 0.812)
MRPC: AMP results in lower acc (0.817 vs 0.827)
SST-2: AMP results in slightly lower acc (0.926 vs 0.929)
The benchmark script is available [here](https://github.com/NVAITC/benchmarking/blob/master/tf2/bert_dist.py).
Note: on some tasks (e.g. MRPC), the dataset is too small. The overhead due to the model compilation with XLA as well
as the distribution strategy setup does not speed things up. The XLA compile time is also the reason why although throughput
can increase a lot (e.g. 2.7x for single GPU), overall (end-to-end) training speed-up is not as fast (as low as 1.4x)
The benefits as seen on SST-2 (larger dataset) is much clear.
All results can be seen on this [Google Sheet](https://docs.google.com/spreadsheets/d/1538MN224EzjbRL239sqSiUy6YY-rAjHyXhTzz_Zptls/edit#gid=960868445).
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
* accessing all the hidden-states of BERT/GPT/GPT-2,
* accessing all the attention weights for each head of BERT/GPT/GPT-2,
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
..note::
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
available in any transformers >= 2.3.0 installation.
The documentation below reflects the **transformers-cli convert** command format.
BERT
^^^^
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/transformers/convert_tf_checkpoint_to_pytorch.py>`_ script.
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
OpenAI GPT
^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## TensorFlow 2.0 Bert models on GLUE
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
These options and the below benchmark are provided by @tlkh.
Quick benchmarks from the script (no other modifications):
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
are fine-tuned using a masked language modeling (MLM) loss.
Before running the following example, you should get a file that contains text on which the language model will be
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
text that will be used for evaluation.
### GPT-2/GPT and causal language modeling
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.
```bash
exportTRAIN_FILE=/path/to/dataset/wiki.train.raw
exportTEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
--do_train \
--train_data_file=$TRAIN_FILE\
--do_eval \
--eval_data_file=$TEST_FILE
```
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.
### RoBERTa/BERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
slightly slower (over-fitting takes more epochs).
We use the `--mlm` flag so that the script may change its loss function.
```bash
exportTRAIN_FILE=/path/to/dataset/wiki.train.raw
exportTEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE\
--do_eval \
--eval_data_file=$TEST_FILE\
--mlm
```
## Language generation
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
can try out the different models available in the library.
Example usage:
```bash
python run_generation.py \
--model_type=gpt2 \
--model_name_or_path=gpt2
```
## GLUE
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
##### Command for SQuAD1.0:
```bash
exportSQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2\
--max_seq_length 384\
--doc_stride 128\
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=4\
--per_gpu_train_batch_size=4\
--save_steps 5000
```
##### Command for SQuAD2.0:
```bash
exportSQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--learning_rate 3e-5 \
--num_train_epochs 4\
--max_seq_length 384\
--doc_stride 128\
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=2\
--per_gpu_train_batch_size=2\
--save_steps 5000
```
Larger batch size may improve the performance while costing more memory.
##### Results for SQuAD1.0 with the previously defined hyper-parameters:
```python
{
"exact":85.45884578997162,
"f1":92.5974600601065,
"total":10570,
"HasAns_exact":85.45884578997162,
"HasAns_f1":92.59746006010651,
"HasAns_total":10570
}
```
##### Results for SQuAD2.0 with the previously defined hyper-parameters:
```python
{
"exact":80.4177545691906,
"f1":84.07154997729623,
"total":11873,
"HasAns_exact":76.73751686909581,
"HasAns_f1":84.05558584352873,
"HasAns_total":5928,
"NoAns_exact":84.0874684608915,
"NoAns_f1":84.0874684608915,
"NoAns_total":5945
}
```
## XNLI
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
#### Fine-tuning on XNLI
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
Training with the previously defined hyper-parameters yields the following results on the **test** set:
```bash
acc= 0.7093812375249501
```
## MM-IMDb
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
### Training on MM-IMDb
```
python run_mmimdb.py \
--data_dir /path/to/mmimdb/dataset/ \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir /path/to/save/dir/ \
--do_train \
--do_eval \
--max_seq_len 512 \
--gradient_accumulation_steps 20 \
--num_image_embeds 3 \
--num_train_epochs 100 \
--patience 5
```
## Adversarial evaluation of model performances
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
The first sequence, the "context" used for the question, has all its tokens represented by :obj:`0`, whereas the
question has all its tokens represented by :obj:`1`. Some models, like :class:`~transformers.XLNetModel` use an
additional token represented by a :obj:`2`.
Position IDs
--------------------------
The position IDs are used by the model to identify which token is at which position. Contrary to RNNs that have the
position of each token embedded within them, transformers are unaware of the position of each token. The position
IDs are created for this purpose.
They are an optional parameter. If no position IDs are passed to the model, they are automatically created as absolute
positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
Feed Forward Chunking
--------------------------
In transformers two feed forward layers usually follows the self attention layer in each residual attention block. The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (*e.g.* for ``bert-base-uncased``).
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`` individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically **equivalent** result.
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity.
If ``chunk_size`` is set to 0, no feed forward chunking is done.
- Low barrier to entry for educators and practitioners
State-of-the-art NLP for everyone:
- Deep learning researchers
- Hands-on practitioners
- AI/ML/NLP teachers and educators
Lower compute costs, smaller carbon footprint:
- Researchers can share trained models instead of always retraining
- Practitioners can reduce compute time and production costs
- 8 architectures with over 30 pretrained models, some in more than 100 languages
Choose the right framework for every part of a model's lifetime:
- Train state-of-the-art models in 3 lines of code
- Deep interoperability between TensorFlow 2.0 and PyTorch models
- Move a single model between TF2.0/PyTorch frameworks at will
- Seamlessly pick the right framework for training, evaluation, production
Contents
---------------------------------
The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:
1.`BERT <https://github.com/google-research/bert>`_ (from Google) released with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
2.`GPT <https://github.com/openai/finetune-transformer-lm>`_ (from OpenAI) released with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
3.`GPT-2 <https://blog.openai.com/better-language-models>`_ (from OpenAI) released with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models>`_ by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
4.`Transformer-XL <https://github.com/kimiyoung/transformer-xl>`_ (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
5.`XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6.`XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
7.`RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
8.`DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
9.`CTRL <https://github.com/pytorch/fairseq/tree/master/examples/ctrl>`_ (from Salesforce), released together with the paper `CTRL: A Conditional Transformer Language Model for Controllable Generation <https://www.github.com/salesforce/ctrl>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
10.`CamemBERT <https://huggingface.co/transformers/model_doc/camembert.html>`_ (from FAIR, Inria, Sorbonne Université) released together with the paper `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`_ by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot.
11.`ALBERT <https://github.com/google-research/ALBERT>`_ (from Google Research), released together with the paper a `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
12.`XLM-RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`_ (from Facebook AI), released together with the paper `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_ by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
13.`FlauBERT <https://github.com/getalp/Flaubert>`_ (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
Refer to the [contributing guide](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#tests) for details about running tests.
## OpenAI GPT original tokenization workflow
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` and `SpaCy`:
``` bash
pip install spacy ftfy==4.4.3
python -m spacy download en
```
If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
## Note on model downloads (Continuous Integration or large-scale deployments)
If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
## Do you want to run a Transformer model on a mobile device?
You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
The base class ``PretrainedConfig`` implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
The base class ``PreTrainedModel`` implements the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
``PreTrainedModel`` also implements a few methods which are common among all the models to:
- resize the input token embeddings when new tokens are added to the vocabulary
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
XNLI
~~~~~~~~~~~~~~~~~~~~~
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
the quality of cross-lingual text representations.
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment
annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
This library hosts a processor for each of the two versions:
A tokenizer is in charge of preparing the inputs for a model. The library comprise tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library `tokenizers`. The "Fast" implementations allows (1) a significant speed-up in particular when doing batched tokenization and (2) additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token). Currently no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLMRoBERTa and XLNet models).
The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` implements the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` thus implements the main methods for using all the tokenizers:
- tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers),
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `transformers`
### Models always output `tuples`
The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
Here is a `pytorch-pretrained-bert` to `transformers` conversion example for a `BertForSequenceClassification` classification model:
1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method.
Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
- it only implements weights decay correction,
- schedules are now externals (see below),
- gradient clipping is now also external (see below).
The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:
In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary:
Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will directly create a class of the relevant architecture (ex: ``model = AutoModel.from_pretrained('bert-base-cased')`` will create a instance of ``BertModel``).
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer
Paper
~~~~~
The Bart model was `proposed <https://arxiv.org/abs/1910.13461>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
According to the abstract,
- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_
Implementation Notes
~~~~~~~~~~~~~~~~~~~~
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use BartTokenizer.encode to get the proper splitting.
- The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``) if they are not passed. This is different than some other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
-``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings.
We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems.
The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.*
Tips:
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card <https://huggingface.co/microsoft/DialoGPT-medium>`_.
Training:
In order to train or fine-tune DialoGPT, one can use causal language modeling training.
To cite the official paper:
*We follow the OpenAI GPT-2 to model a multiturn dialogue session
as a long text and frame the generation task as language modeling. We first
concatenate all dialog turns within a dialogue session into a long text
x_1,..., x_N (N is the sequence length), ended by the end-of-text token.*
For more information please confer to the original paper.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring <https://huggingface.co/transformers/model_doc/gpt2.html>`_.
The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
The DistilBERT model was proposed in the blog post
`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__,
and the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less
parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on
the GLUE language understanding benchmark.
The abstract from the paper is the following:
*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
operating these large models in on-the-edge and/or under constrained computational training or inference budgets
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
counterparts. While most prior work investigated the use of distillation for building task-specific models, we
leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage
the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language
modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train
and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative
on-device study.*
Tips:
- DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
- DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
This class can wrap an encoder model, such as ``BertModel`` and a decoder modeling with a language modeling head, such as ``BertForMaskedLM`` into a encoder-decoder model.
The ``EncoderDecoderModel`` class allows to instantiate a encoder decoder model using the ``from_encoder_decoder_pretrain`` class method taking a pretrained encoder and pretrained decoder model as an input.
The ``EncoderDecoderModel`` is saved using the standard ``save_pretrained()`` method and can also again be loaded using the standard ``from_pretrained()`` method.
An application of this architecture could be *summarization* using two pretrained Bert models as is shown in the paper: `Text Summarization with Pretrained Encoders <https://arxiv.org/abs/1910.13461>`_ by Yang Liu and Mirella Lapata.
OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional)
transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
The abstract from the paper is the following:
*Natural language understanding comprises a wide range of diverse tasks such
as textual entailment, question answering, semantic similarity assessment, and
document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task. In contrast to previous approaches, we make use of task-aware input
transformations during fine-tuning to achieve effective transfer while requiring
minimal changes to the model architecture. We demonstrate the effectiveness of
our approach on a wide range of benchmarks for natural language understanding.
Our general task-agnostic model outperforms discriminatively trained models that
use architectures specifically crafted for each task, significantly improving upon the
state of the art in 9 out of the 12 tasks studied.*
Tips:
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by
Hugging Face showcasing the generative capabilities of several models. GPT is one of them.
The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`_.
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview
~~~~~
The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
Here the abstract:
*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ .
Axial Positional Encodings
~~~~~~~~~~~~~~~~~~~~
Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix:
..math::
X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right]
which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
..math::
X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]
and
..math::
X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right]
with:
..math::
d = d^1 + d^2 \text{ and } n_s = n_s^1 \times n_s^2 .
Therefore the following holds:
..math::
X_{i,j} = \begin{cases}
X^{1}_{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \\
X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
\end{cases}
Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the ``config.max_embedding_size`` dimension :math:`j` is factorized into :math:`k \text{ and } l`.
This design ensures that each position embedding vector :math:`x_j` is unique.
Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}` can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
In practice, the parameter ``config.axial_pos_embds_dim`` is set to ``list``:math:`(d^1, d^2)` which sum has to be equal to ``config.hidden_size`` and ``config.axial_pos_shape`` is set to ``list``:math:`(n_s^1, n_s^2)` and which product has to be equal to ``config.max_embedding_size`` which during training has to be equal to the ``sequence length`` of the ``input_ids``.
LSH Self Attention
~~~~~~~~~~~~~~~~~~~~
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key query embedding vectors are also tied.
LSH self attention uses the locality sensitive
hashing mechanism proposed in `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`_ to assign each of the tied key query embedding vectors to one of ``config.num_buckets`` possible buckets. The premise is that the more "similar" key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to the same bucket.
The accuracy of the LSH mechanism can be improved by increasing ``config.num_hashes`` or directly the argument ``num_hashes`` of the forward function so that the output of the LSH self attention better approximates the output of the "normal" full self attention.
The buckets are then sorted and chunked into query key embedding vector chunks each of length ``config.lsh_chunk_length``. For each chunk, the query embedding vectors attend to its key vectors (which are tied to themselves) and to the key embedding vectors of ``config.lsh_num_chunks_before`` previous neighboring chunks and ``config.lsh_num_chunks_after`` following neighboring chunks.
For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`_ or this great `blog post <https://www.pragmatic.ml/reformer-deep-dive/>`_.
Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.
It is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length, a good value for ``num_buckets`` are calculated on the fly.
Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
Local Self Attention
~~~~~~~~~~~~~~~~~~~~
Local self attention is essentially a "normal" self attention layer with
key, query and value projections, but is chunked so that in each chunk of length ``config.local_chunk_length`` the query embedding vectors only attends to the key embedding vectors in its chunk and to the key embedding vectors of ``config.local_num_chunks_before`` previous neighboring chunks and ``config.local_num_chunks_after`` following neighboring chunks.
Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
Training
~~~~~~~~~~~~~~~~~~~~
During training, we must ensure that the sequence length is set to a value that can be divided by the least common multiple of ``config.lsh_chunk_length`` and ``config.local_chunk_length`` and that the parameters of the Axial Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can easily be trained on sequences as long as 64000 tokens.
For training, the ``ReformerModelWithLMHead`` should be used as follows:
::
input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_
by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
objective and training with much larger mini-batches and learning rates.
The abstract from the paper is the following:
*Language model pretraining has led to significant performance gains but careful comparison between different
approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
results highlight the importance of previously overlooked design choices, and raise questions about the source
of recently reported improvements. We release our models and code.*
Tips:
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
setup for Roberta pretrained models.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
different pre-training scheme.
- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
-`Camembert <./camembert.html>`__ is a wrapper around RoBERTa. Refer to this page for usage examples.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_.
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview
~~~~~
The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in
Here the abstract:
*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format.
Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks.
By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .
Training
~~~~~~~~~~~~~~~~~~~~
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
This means that for training we always need an input sequence and a target sequence.
The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``lm_labels``. The PAD token is hereby used as the start-sequence token.
T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
- Unsupervised denoising training
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens)
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens.
Each sentinel token represents a unique mask token for this sentence and should start with ``<extra_id_1>``, ``<extra_id_2>``, ... up to ``<extra_id_100>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
*E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:
::
input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
- Supervised training
In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should
be processed as follows:
::
input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
lm_labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
Tips
~~~~~~~~~~~~~~~~~~~~
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised
and supervised tasks and for which each task is converted into a text-to-text format.
T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_.
Starting with `v2.2.2`, you can now upload and share your fine-tuned models with the community, using the <abbr title="Command-line interface">CLI</abbr> that's built-in to the library.
**First, create an account on [https://huggingface.co/join](https://huggingface.co/join)**. Optionally, join an existing organization or create a new one. Then:
```shell
transformers-cli login
# log in using the same credentials as on huggingface.co
# (you can optionally override its filename, which can be nested inside a folder)
```
If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command:
```shell
--organization organization_name
```
Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above:
```python
"username/pretrained_model"
# or if an org:
"organization_name/pretrained_model"
```
**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc.
Your model now has a page on huggingface.co/models 🔥
Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.
The library was designed with two strong goals in mind:
- be as easy and fast to use as possible:
- we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
- all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
- as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
- provide state-of-the-art models with performances as close as possible to the original models:
- we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
- the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
A few other goals:
- expose the models' internals as consistently as possible:
- we give access, using a single API to the full hidden-states and attention weights,
- tokenizer and base model's API are standardized to easily switch between models.
- incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
- a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
- simple ways to mask and prune transformer heads.
## Main concepts
The library is build around three type of classes for each models:
- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 8 models architectures currently provided in the library, e.g. `BertModel`
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
All these classes can be instantiated from pretrained instances and saved locally using two methods:
-`from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
-`save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
## Quick tour: Usage
Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
See full API reference for examples for each model class.
### BERT example
Let's start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using `BertTokenizer`
Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.
First let's prepare a tokenized input from our text string using `GPT2Tokenizer`
assertpredicted_text=='Who was Jim Henson? Jim Henson was a man'
```
Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).
#### Using the past
GPT-2 as well as some other models (GPT, XLNet, Transfo-XL, CTRL) make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
Here is a fully-working example using the `past` with `GPT2LMHeadModel` and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):
To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
*``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
*
``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
*
the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
*``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
*``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
*``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
*``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
*``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
*``bert-base-german-dbmdz-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://github.com/dbmdz/german-bert>`__
*``bert-base-german-dbmdz-uncased``: Trained on (uncased) German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://github.com/dbmdz/german-bert>`__
*``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
*``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
*``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
*``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
*
a path or url to a pretrained model archive containing:
*``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
*``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
*
``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
*``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
*``state_dict``\ : an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
*``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
When using an ``uncased model``\ , make sure your tokenizer has ``do_lower_case=True`` (either in its configuration, or passed as an additional parameter).
Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
Serialization best-practices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
There are three types of files you need to save to be able to reload a fine-tuned model:
* the model itself which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
* the configuration file of the model which is saved as a JSON file, and
* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
The *default filenames* of these files are as follow:
* the model weights file: ``pytorch_model.bin``\ ,
* the configuration file: ``config.json``\ ,
* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
..code-block::python
fromtransformersimportWEIGHTS_NAME,CONFIG_NAME
output_dir="./models/"
# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
Summarization is the task of summarizing a text / an article into a shorter text.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization.
If you would like to fine-tune a model on a summarization task, you may leverage the ``examples/summarization/bart/run_train.sh`` (leveraging pytorch-lightning) script.
Here is an example using the pipelines do to summarization.
It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set.
::
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` and ``min_length`` above.
This outputs the following summary:
::
Liana Barrientos has been married 10 times, sometimes within two weeks of each other. Prosecutors say the marriages were part of an immigration scam. She pleaded not guilty at State Supreme Court in the Bronx on Friday.
Here is an example doing summarization using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
- Define the article that should be summarizaed.
- Leverage the ``PretrainedModel.generate()`` method.
- Add the T5 specific prefix "summarize: ".
Here Google`s T5 model is used that was only pre-trained on a multi-task mixed data set (including CNN / Daily Mail), but nevertheless yields very good results.
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
Translation is the task of translating a text from one language to another.
An example of a translation dataset is the WMT English to German dataset, which has English sentences as the input data
and German sentences as the target data.
Here is an example using the pipelines do to translation.
It leverages a T5 model that was only pre-trained on a multi-task mixture dataset (including WMT), but yields impressive
translation results nevertheless.
::
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
This outputs the following translation into German:
::
Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
Here is an example doing translation using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
- Define the article that should be summarizaed.
- Leverage the ``PretrainedModel.generate()`` method.
- Add the T5 specific prefix "translate English to German: "
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
This folder contains examples which are not actively maintained (mostly contributed by the community).
Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
log_str='| {0} loss {1:5.2f} | {0} ppl {2:9.3f}'.format(
split,loss,math.exp(loss))
log_str="| {0} loss {1:5.2f} | {0} ppl {2:9.3f}".format(split,loss,math.exp(loss))
returnlog_str
log_str=''
log_str=""
ifvalid_lossisnotNone:
log_str+=format_log(valid_loss,'valid')
log_str+=format_log(valid_loss,"valid")
iftest_lossisnotNone:
log_str+=format_log(test_loss,'test')
log_str+=format_log(test_loss,"test")
logger.info('='*100)
logger.info("="*100)
logger.info(log_str)
logger.info('='*100)
logger.info("="*100)
if__name__=='__main__':
if__name__=="__main__":
main()
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.