Compare commits

...

129 Commits

Author SHA1 Message Date
LysandreJik
c781171dfa Release: v4.0.0 2020-11-30 11:33:35 -05:00
LysandreJik
ab597c84d1 Remove deprecated evalutate_during_training (#8852)
* Remove deprecated `evalutate_during_training`

* Update src/transformers/training_args_tf.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-30 11:17:43 -05:00
Sylvain Gugger
e72b4fafeb Add a direct link to the big table (#8850) 2020-11-30 10:40:02 -05:00
Fraser Greenlee
dc0dea3e42 Correct docstring. (#8845)
Related issue: https://github.com/huggingface/transformers/issues/8837
2020-11-30 10:39:52 -05:00
Patrick von Platen
4d8f5d12b3 add xlnet mems and fix merge conflicts 2020-11-30 09:45:12 +01:00
Lysandre Debut
710b0108c9 Migration guide from v3.x to v4.x (#8763)
* Migration guide from v3.x to v4.x

* Better wording

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Sylvain's comments

* Better wording.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-29 20:13:31 -05:00
Patrick von Platen
87199dee00 fix mt5 config (#8832) 2020-11-29 20:12:38 -05:00
Sylvain Gugger
68879472c4 Big model table (#8774)
* First draft

* Styling

* With all changes staged

* Update docs/source/index.rst

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Styling

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-29 20:12:24 -05:00
Patrick von Platen
8c5a2b8e36 [Flax test] Add require pytorch to flix flax test (#8816)
* try flax fix

* same for roberta
2020-11-29 20:11:34 -05:00
Kristian Holsheimer
911d8486e8 [FlaxBert] Fix non-broadcastable attention mask for batched forward-passes (#8791)
* [FlaxBert] Fix non-broadcastable attention mask for batched forward-passes

* [FlaxRoberta] Fix non-broadcastable attention mask

* Use jax.numpy instead of ordinary numpy (otherwise not jit-able)

* Partially revert "Use jax.numpy ..."

* Add tests for batched forward passes

* Avoid unnecessary OOMs due to preallocation of GPU memory by XLA

* Auto-fix style

* Re-enable GPU memory preallocation but with mem fraction < 1/paralleism
2020-11-29 20:11:23 -05:00
Lysandre
563efd36ab Fix dpr<>bart config for RAG (#8808)
* correct dpr test and bert pos fault

* fix dpr bert config problem

* fix layoutlm

* add config to dpr as well
2020-11-29 20:10:33 -05:00
Lysandre Debut
5a63232a8a Fix QA argument handler (#8765)
* Fix QA argument handler

* Attempt to get a better fix for QA (#8768)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2020-11-29 20:06:10 -05:00
Lysandre Debut
e46890f699 MT5 should have an autotokenizer (#8743)
* MT5 should have an autotokenizer

* Different configurations should be able to point to same tokenizers
2020-11-24 09:51:34 -05:00
Lysandre Debut
df2cdd84f3 Fix slow tests v2 (#8746)
* Fix BART test

* Fix MBART tests

* Remove erroneous line from yaml

* Update tests/test_modeling_bart.py

* Quality
2020-11-24 09:51:28 -05:00
LysandreJik
c6e2876cd4 TF BERT test update 2020-11-23 18:19:54 -05:00
LysandreJik
5580cccd81 Update TF BERT test 2020-11-23 18:19:34 -05:00
Stas Bekman
ccc4f64044 consistent ignore keys + make private (#8737)
* consistent ignore keys + make private

* style

* - authorized_missing_keys    => _keys_to_ignore_on_load_missing
  - authorized_unexpected_keys => _keys_to_ignore_on_load_unexpected

* move public doc of private attributes to private comment
2020-11-23 17:55:15 -05:00
Sylvain Gugger
3408e6ffcd Change default cache path (#8734)
* Change default cache path

* Document changes

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-23 17:54:45 -05:00
Santiago Castro
a986b02e49 Fix many typos (#8708) 2020-11-23 17:54:20 -05:00
Sylvain Gugger
b6ec39e41f Document adam betas TrainingArguments (#8688) 2020-11-23 17:53:49 -05:00
Sylvain Gugger
f80ea27f80 Add sentencepiece to the CI and fix tests (#8672)
* Fix the CI and tests

* Fix quality

* Remove that m form nowhere
2020-11-23 17:53:27 -05:00
Sylvain Gugger
0603564e93 Merge remote-tracking branch 'origin/master' 2020-11-19 12:18:57 -05:00
Sylvain Gugger
1e08af383a Forgot to save... 2020-11-19 12:18:50 -05:00
LysandreJik
d86b5ffc6f Release: v4.0.0-rc-1 2020-11-19 12:00:07 -05:00
Sylvain Gugger
cb3e5c33f7 Fix a few last paths for the new repo org (#8666) 2020-11-19 11:56:42 -05:00
Matthias
a79a96ddaa fix small typo (#8644)
Fixed a small typo on the XLNet and permutation language modelling section
2020-11-19 11:24:11 -05:00
Sylvain Gugger
4208f496ee Better filtering of the model outputs in Trainer (#8633)
* Better filtering of the model outputs in Trainer

* Fix examples tests

* Add test for Lysandre
2020-11-19 10:43:15 -05:00
Lysandre Debut
f2e07e7272 Fix a bunch of slow tests (#8634)
* CI should install `sentencepiece`

* Requiring TF

* Fixing some TFDPR bugs

* remove return_dict=False/True hack

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2020-11-19 10:41:41 -05:00
elk-cloner
5362bb8a6b Tf longformer for sequence classification (#8231)
* working on LongformerForSequenceClassification

* add TFLongformerForMultipleChoice

* add TFLongformerForTokenClassification

* use add_start_docstrings_to_model_forward

* test TFLongformerForSequenceClassification

* test TFLongformerForMultipleChoice

* test TFLongformerForTokenClassification

* remove test from repo

* add test and doc for TFLongformerForSequenceClassification, TFLongformerForTokenClassification, TFLongformerForMultipleChoice

* add requested classes to modeling_tf_auto.py
update dummy_tf_objects
fix tests
fix bugs in requested classes

* pass all tests except test_inputs_embeds

* sync with master

* pass all tests except test_inputs_embeds

* pass all tests

* pass all tests

* work on test_inputs_embeds

* fix style and quality

* make multi choice work

* fix TFLongformerForTokenClassification signature

* fix TFLongformerForMultipleChoice, TFLongformerForSequenceClassification signature

* fix mult choice

* fix mc hint

* fix input embeds

* fix input embeds

* refactor input embeds

* fix copy issue

* apply sylvains changes and clean more

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-11-19 10:37:27 -05:00
Quentin Lhoest
62cd9ce9f8 fix missing return dict (#8653) 2020-11-19 15:17:18 +01:00
Amine Abdaoui
0c2677f529 [model card] : fix bert-base-15lang-cased (#8655)
the table was badly formatted because of a single line break
2020-11-19 05:41:02 -05:00
Amine Abdaoui
0a80959bdd Add cards for all Geotrend models (#8617)
* docs(bert-base-15lang-cased): add model card

* add cards for all Geotrend models

* [model cards] fix language tag for all Geotrend models
2020-11-19 04:47:24 -05:00
cronoik
dcc9c64299 Updated the Extractive Question Answering code snippets (#8636)
* Updated the Extractive Question Answering code snippets

The Extractive Question Answering code snippets do not work anymore since the models return task-specific output objects. This commit fixes the pytorch and tensorflow examples but adding `.values()` to the model call.

* Update task_summary.rst
2020-11-18 18:56:47 -05:00
Tim Isbister
28d16e7ac5 Update README.md (#8635) 2020-11-18 18:35:23 -05:00
cronoik
b290195ac7 grammar (#8639) 2020-11-18 18:04:25 -05:00
Stas Bekman
d86d57faa3 [s2s] distillation apex breaks return_dict obj (#8631)
* apex breaks return_dict obj

* style
2020-11-18 12:51:29 -08:00
Perez Ogayo
bf3611b2ab Created ModelCard for Hel-ach-en MT model (#8496)
* Updated ModelCard

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-18 14:42:13 -05:00
Yifan Peng
c95b26a719 Create README.md (#8362) 2020-11-18 13:37:14 -05:00
Manuel Romero
fdbbb6c17a Model card: T5-base fine-tuned on QuaRTz (#8369)
* Model card: T5-base fine-tuned on QuaRTz

* Update model_cards/mrm8488/t5-base-finetuned-quartz/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-18 13:34:27 -05:00
Yifan Peng
6e6d24c5d8 Create README.md (#8363) 2020-11-18 13:33:04 -05:00
Divyanshu Kakwani
35fd3d64e3 Add model card for ai4bharat/indic-bert (#8464) 2020-11-18 13:28:49 -05:00
dartrevan
38f01dfe03 Update README.md (#8405)
* Update README.md

* Update README.md
2020-11-18 13:23:08 -05:00
Abhilash Majumder
2d8fbf012a Model Card for abhilash1910/financial_roberta (#8625)
* Model Card for abhilash1910/financial_roberta

* Update model_cards/abhilash1910/financial_roberta/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-18 13:22:28 -05:00
Vishal Singh
26dc6593f3 Update README.md (#8544)
Modified Model in Action section. The class `AutoModelWithLMHead` is deprecated so changed it to `AutoModelForSeq2SeqLM` for encoder-decoder models. Removed duplicate eos token.
2020-11-18 13:19:32 -05:00
smanjil
6c8fad4f0d replace performance table with markdown (#8565)
* replace performance table with markdown

* Update model_cards/smanjil/German-MedBERT/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-18 13:17:46 -05:00
hhou435
e7f77fc52a model_cards for Chinese Couplet and Poem GPT2 models (#8620) 2020-11-18 13:06:30 -05:00
Sylvain Gugger
a0c62d2493 Fix training from scratch in new scripts (#8623) 2020-11-18 12:15:26 -05:00
Sylvain Gugger
1e62e999e8 Fixes the training resuming with gradient accumulation (#8624) 2020-11-18 12:00:11 -05:00
Patrick von Platen
cdfa56afe0 [Tokenizer Doc] Improve tokenizer summary (#8622)
* improve summary

* small fixes

* cleaned line length

* correct "" formatting

* apply sylvains suggestions
2020-11-18 17:14:15 +01:00
Nicola De Cao
2f9d49b389 Adding PrefixConstrainedLogitsProcessor (#8529)
* Adding PrefixConstrainedLogitsProcessor

* fixing RAG and style_doc

* fixing black (v20 instead of v19)

* Improving doc in generation_logits_process.py

* Improving docs and typing in generation_utils.py

* docs improvement

* adding test and fixing doc typo

* fixing doc_len

* isort on test

* fixed test

* improve docstring a bit

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-11-18 17:06:25 +01:00
Julien Plu
3bc1540070 New TF loading weights (#8490)
* New TF loading weights

* apply style

* Better naming

* Largely comment the loading method

* Apply style

* Address Patrick's comments

* Remove useless line of code

* Update Docstring

* Address Sylvain's and Lysandre's comments

* Simplify the names computation

* Typos
2020-11-18 10:48:31 -05:00
Ratthachat (Jung)
0df91ee4f7 self.self.activation_dropout -> self.activation_dropout (#8611)
(one line typo)
2020-11-18 10:30:29 -05:00
Stas Bekman
cdf1b7ae82 fix to adjust for #8530 changes (#8612) 2020-11-18 10:25:00 -05:00
Stas Bekman
2819da02f7 [s2s] broken test (#8613) 2020-11-18 10:15:53 -05:00
Michał Pogoda
9fa3ed1a7f Fix missing space in multiline warning (#8593)
Multiline string informing about missing PyTorch/TensorFlow had missing space.
2020-11-18 10:09:26 -05:00
Sylvain Gugger
8fcb6935a1 Fix DataCollatorForLanguageModeling (#8621) 2020-11-18 10:02:50 -05:00
Benjamin Minixhofer
f6fe41c96b Reset loss to zero on logging in Trainer to avoid bfloat16 issues (#8561)
* make tr_loss regular float

* Revert "make tr_loss regular float"

This reverts commit c9d7ccfaf0c4387187b0841694f01ec0ffd5f4ba.

* reset loss at each logging step

* keep track of total loss with _total_loss_scalar

* add remaining tr_loss at the end
2020-11-18 09:58:08 -05:00
cronoik
b592728eff Fixed link to the wrong paper. (#8607) 2020-11-17 19:00:44 -05:00
Sylvain Gugger
0512444ee5 Remove old doc 2020-11-17 17:34:25 -05:00
Caitlin Ostroff
5cf9c79665 Add Harry Potter Model Card (#8605)
* Add Harry Potter Model

* Update model_cards/ceostroff/harry-potter-gpt2-fanfiction/README.md

* Update model_cards/ceostroff/harry-potter-gpt2-fanfiction/README.md

* Update model_cards/ceostroff/harry-potter-gpt2-fanfiction/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-17 16:50:58 -05:00
Sylvain Gugger
dd52804f5f Remove deprecated (#8604)
* Remove old deprecated arguments

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

* Remove needless imports

* Fix tests

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-11-17 15:11:29 -05:00
Lysandre Debut
3095ee9dab Tokenizers should be framework agnostic (#8599)
* Tokenizers should be framework agnostic

* Run the slow tests

* Not testing

* Fix documentation

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-11-17 14:03:03 -05:00
Sylvain Gugger
7f3b41a306 Fix check repo utils (#8600) 2020-11-17 14:01:46 -05:00
Stas Bekman
f0435f5a61 these should run fine on multi-gpu (#8582) 2020-11-17 14:00:41 -05:00
Sylvain Gugger
36a19915ea Fix model templates (#8595)
* First fixes

* Fix imports and add init

* Fix typo

* Move init to final dest

* Fix tokenization import

* More fixes

* Styling
2020-11-17 10:35:38 -05:00
Julien Chaumond
042a6aa777 Tokenizers: ability to load from model subfolder (#8586)
* <small>tiny typo</small>

* Tokenizers: ability to load from model subfolder

* use subfolder for local files as well

* Uniformize model shortcut name => model id

* from s3 => from huggingface.co

Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
2020-11-17 08:58:45 -05:00
Sylvain Gugger
48395d6b8e Fix init for MT5 (#8591) 2020-11-17 08:52:13 -05:00
sgugger
a6cf9ca00b Add __init__ to the models folder 2020-11-17 07:39:37 -05:00
Patrick von Platen
5104223552 [MT5] More docs (#8589)
* add docs

* make style
2020-11-17 12:47:57 +01:00
Patrick von Platen
86822a358b T5 & mT5 (#8552)
* add mt5 and t5v1_1 model

* fix tests

* correct some imports

* add tf model

* finish tf t5

* improve examples

* fix copies

* clean doc
2020-11-17 12:23:09 +01:00
fajri91
9e01f988dd model_card for indolem/indobert-base-uncased (#8579) 2020-11-17 03:36:50 -05:00
Sylvain Gugger
c89bdfbe72 Reorganize repo (#8580)
* Put models in subfolders

* Styling

* Fix imports in tests

* More fixes in test imports

* Sneaky hidden imports

* Fix imports in doc files

* More sneaky imports

* Finish fixing tests

* Fix examples

* Fix path for copies

* More fixes for examples

* Fix dummy files

* More fixes for example

* More model import fixes

* Is this why you're unhappy GitHub?

* Fix imports in conver command
2020-11-16 21:43:42 -05:00
Julien Plu
901507335f Fix mixed precision issue for GPT2 (#8572)
* Fix mixed precision issue for GPT2

* Forgot one cast

* oops

* Forgotten casts
2020-11-16 14:44:19 -05:00
Sylvain Gugger
1073a2bde5 Switch return_dict to True by default. (#8530)
* Use the CI to identify failing tests

* Remove from all examples and tests

* More default switch

* Fixes

* More test fixes

* More fixes

* Last fixes hopefully

* Use the CI to identify failing tests

* Remove from all examples and tests

* More default switch

* Fixes

* More test fixes

* More fixes

* Last fixes hopefully

* Run on the real suite

* Fix slow tests
2020-11-16 11:43:00 -05:00
Sylvain Gugger
0d0a0785fd Update version to v4.0.0-dev (#8568) 2020-11-16 10:21:19 -05:00
LSinev
afb50c663a Fix GPT2DoubleHeadsModel to work with model.generate() (#6601)
* Fix passing token_type_ids during GPT2DoubleHeadsModel.generate() if used

and for GPT2LMHeadModel too

* Update tests to check token_type_ids usage in GPT2 models
2020-11-16 14:35:44 +01:00
Yusuke Mori
04d8136bde Adding the prepare_seq2seq_batch function to ProphetNet (#8515)
* Simply insert T5Tokenizer's prepare_seq2seq_batch

* Update/Add some 'import'

* fix RunTimeError caused by '.view'

* Moves .view related error avoidance from seq2seq_trainer to inside prophetnet

* Update test_tokenization_prophetnet.py

* Format the test code with black

* Re-format the test code

* Update test_tokenization_prophetnet.py

* Add importing require_torch in the test code

* Add importing BatchEncoding in the test code

* Re-format the test code on Colab
2020-11-16 14:18:25 +01:00
Stas Bekman
931b10978e [doc] typo fix (#8535)
* [doc] typo fix

@sgugger

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-16 08:05:30 -05:00
Branden Chan
6db21a06ae Clearer Model Versioning Example (#8562) 2020-11-16 06:59:10 -05:00
Mehrdad Farahani
daaa68451e Readme for Wiki Summary [Persian] bert2bert (#8558) 2020-11-16 05:04:46 -05:00
Mehrdad Farahani
06d468d3f0 Readme for News Headline Generation (bert2bert) (#8557) 2020-11-16 05:04:38 -05:00
zhezhaoa
9b7fb8a368 Create README.md for Chinese RoBERTa Miniatures (#8550)
* Create README.md

* Update model_cards/uer/chinese_roberta_L-2_H-128/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-16 05:01:28 -05:00
Thomas Wolf
f4e04cd2c6 [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)
* Fixing roberta for slow-fast tests

* WIP getting equivalence on pipelines

* slow-to-fast equivalence - working on question-answering pipeline

* optional FAISS tests

* Pipeline Q&A

* Move pipeline tests to their own test job again

* update tokenizer to add sequence id methods

* update to tokenizers 0.9.4

* set sentencepiecce as optional

* clean up squad

* clean up pipelines to use sequence_ids

* style/quality

* wording

* Switch to use_fast = True by default

* update tests for use_fast at True by default

* fix rag tokenizer test

* removing protobuf from required dependencies

* fix NER test for use_fast = True by default

* fixing example tests (Q&A examples use slow tokenizers for now)

* protobuf in main deps extras["sentencepiece"] and example deps

* fix protobug install test

* try to fix seq2seq by switching to slow tokenizers for now

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-15 22:50:59 +01:00
Julien Plu
24184e73c4 Rework some TF tests (#8492)
* Update some tests

* Small update

* Apply style

* Use max_position_embeddings

* Create a fake attribute

* Create a fake attribute

* Update wrong name

* Wrong TransfoXL model file

* Keep the common tests agnostic
2020-11-13 17:07:17 -05:00
Patrick von Platen
f6cdafdec7 fix load weights (#8528)
* fix load weights

* delete line
2020-11-13 20:31:40 +01:00
Joe Davison
f6f4da8dd4 Add bart-large-mnli model card (#8527) 2020-11-13 14:07:25 -05:00
Julien Chaumond
725269746b Model sharing doc: more tweaks (#8520)
* More doc tweaks

* Update model_sharing.rst

* make style

* missing newline

* Add email tip

Co-authored-by: Pierric Cistac <pierric@huggingface.co>
2020-11-13 12:10:26 -05:00
LysandreJik
9d519dabb7 Fix paths in github YAML 2020-11-13 12:04:17 -05:00
Lysandre Debut
826f04576f Model templates encoder only (#8509)
* Model templates

* TensorFlow

* Remove pooler

* CI

* Tokenizer + Refactoring

* Encoder-Decoder

* Let's go testing

* Encoder-Decoder in TF

* Let's go testing in TF

* Documentation

* README

* Fixes

* Better names

* Style

* Update docs

* Choose to skip either TF or PT

* Code quality fixes

* Add to testing suite

* Update file path

* Cookiecutter path

* Update `transformers` path

* Handle rebasing

* Remove seq2seq from model templates

* Remove s2s config

* Apply Sylvain and Patrick comments

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Last fixes from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-13 11:59:30 -05:00
Patrick von Platen
42e2d02e44 [T5] Bug correction & Refactor (#8518)
* fix bug

* T5 refactor

* refactor tf

* apply sylvains suggestions
2020-11-13 16:57:31 +01:00
Sylvain Gugger
42f63e3871 Merge remote-tracking branch 'origin/master' 2020-11-13 10:30:04 -05:00
Sylvain Gugger
bb03a14edd Update doc for v3.5.1 2020-11-13 10:29:58 -05:00
Branden Chan
4df6b59318 Update deepset/roberta-base-squad2 model card (#8522)
* Update README.md

* Update README.md
2020-11-13 09:58:27 -05:00
Sylvain Gugger
0c9bae0934 Remove typo 2020-11-12 22:39:57 -05:00
Julien Plu
5d80539488 Add pretraining loss computation for TF Bert pretraining (#8470)
* Add pretraining loss computation for TF Bert pretraining

* Fix labels creation

* Fix T5 model

* restore T5 kwargs

* try a generic fix for pretraining models

* Apply style

* Overide the prepare method for the BERT tests
2020-11-12 14:08:26 -05:00
Julien Plu
91a67b7506 Use LF instead of os.linesep (#8491) 2020-11-12 13:52:40 -05:00
Julien Plu
27b3ff316a Try to understand and apply Sylvain's comments (#8458) 2020-11-12 13:43:00 -05:00
Forrest Iandola
0fa0349883 fix SqueezeBertForMaskedLM (#8479) 2020-11-12 12:19:37 -05:00
Sylvain Gugger
7933054638 Model sharing doc (#8498)
* Model sharing doc

* Style
2020-11-12 11:53:23 -05:00
Chengxi Guo
d65e0bfea3 Fix doc bug (#8500)
* fix doc bug

Signed-off-by: mymusise <mymusise1@gmail.com>

* fix example bug

Signed-off-by: mymusise <mymusise1@gmail.com>
2020-11-12 11:47:23 -05:00
zeyuyun1
924c624a46 quick fix on concatenating text to support more datasets (#8474) 2020-11-12 09:47:08 -05:00
Antonio Lanza
17b1fd804f Fix typo in roberta-base-squad2-v2 model card (#8489) 2020-11-12 05:29:37 -05:00
Julien Chaumond
c6c08ebf61 [model_cards] other chars than [\w\-_] not allowed anymore in model names
cc @Pierrci
2020-11-12 10:45:29 +01:00
Funtowicz Morgan
121c24efa4 Update deploy-docs dependencies on CI to enable Flax (#8475)
* Update deploy-docs dependencies on CI to enable Flax

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added pair of ""

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-11-11 18:31:41 -05:00
Sumithra Bhakthavatsalam
81ebd70671 [s2s] distill t5-large -> t5-small (#8376)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-11 17:58:45 -05:00
Funtowicz Morgan
a5b682329c Flax/Jax documentation (#8331)
* First addition of Flax/Jax documentation

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* make style

* Ensure input order match between Bert & Roberta

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Install dependencies "all" when building doc

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* wraps build_doc deps with ""

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing @sgugger comments.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use list to highlight JAX features.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Let's not look to much into the future for now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-11 14:53:36 -05:00
Lysandre
c7b6bbec5c Skip test until investigation 2020-11-11 12:59:40 -05:00
Beomsoo Kim
aa2a2c6579 Replaced some iadd operations on lists with proper list methods. (#8433) 2020-11-11 12:29:57 -05:00
Ratthachat (Jung)
026a2ff225 Add TFDPR (#8203)
* Create modeling_tf_dpr.py

* Add TFDPR

* Add back TFPegasus, TFMarian, TFMBart, TFBlenderBot

last commit accidentally deleted these 4 lines, so I recover them back

* Add TFDPR

* Add TFDPR

* clean up some comments, add TF input-style doc string

* Add TFDPR

* Make return_dict=False as default

* Fix return_dict bug (in .from_pretrained)

* Add get_input_embeddings()

* Create test_modeling_tf_dpr.py

The current version is already passed all 27 tests!
Please see the test run at : 
https://colab.research.google.com/drive/1czS_m9zy5k-iSJbzA_DP1k1xAAC_sdkf?usp=sharing

* fix quality

* delete init weights

* run fix copies

* fix repo consis

* del config_class, load_tf_weights

They shoud be 'pytorch only'

* add config_class back

after removing it, test failed ... so totally only removing "use_tf_weights = None" on Lysandre suggestion

* newline after .. note::

* import tf, np (Necessary for ModelIntegrationTest)

* slow_test from_pretrained with from_pt=True

At the moment we don't have TF weights (since we don't have official official TF model)
Previously, I did not run slow test, so I missed this bug

* Add simple TFDPRModelIntegrationTest

Note that this is just a test that TF and Pytorch gives approx. the same output.
However, I could not test with the official DPR repo's output yet

* upload correct tf model

* remove position_ids as missing keys

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: patrickvonplaten <patrick@huggingface.co>
2020-11-11 12:28:09 -05:00
sarnoult
a38d1c7c31 Example NER script predicts on tokenized dataset (#8468)
The new run_ner.py script tries to run prediction on the input
test set `datasets["test"]`, but it should be the tokenized set
`tokenized_datasets["test"]`
2020-11-11 10:28:23 -05:00
Julien Plu
069b63844c Fix next sentence output (#8466) 2020-11-11 15:41:39 +01:00
Julien Plu
da842e4e72 Add next sentence prediction loss computation (#8462)
* Add next sentence prediction loss computation

* Apply style

* Fix tests

* Add forgotten import

* Add forgotten import

* Use a new parameter

* Remove kwargs and use positional arguments
2020-11-11 15:02:06 +01:00
Julien Plu
23290836c3 Fix TF Longformer (#8460) 2020-11-11 12:54:15 +01:00
Julien Chaumond
8dda9167de [model_cards] harmonization 2020-11-11 12:42:50 +01:00
Pedro
eb3bd73ce3 Bug fix for modeling utilities function: apply_chunking_to_forward, chunking should be in the chunking dimension, an exception was raised if the complete shape of the inputs was not the same rather than only the chunking dimension (#8391)
Co-authored-by: pedro <pe25171@mit.edu>
2020-11-10 21:33:11 +01:00
Patrick von Platen
70708cca1a fix t5 token type ids (#8437) 2020-11-10 14:21:54 -05:00
Lysandre Debut
9fd1f56236 [No merge] TF integration testing (#7621)
* stash

* TF Integration testing for ELECTRA, BERT, Longformer

* Trigger slow tests

* Apply suggestions from code review
2020-11-10 14:02:33 -05:00
Santiago Castro
8fe6629bb4 Add missing tasks to pipeline docstring (#8428) 2020-11-10 13:44:25 -05:00
Stas Bekman
02bdfc0251 using multi_gpu consistently (#8446)
* s|multiple_gpu|multi_gpu|g; s|multigpu|multi_gpu|g'

* doc
2020-11-10 13:23:58 -05:00
Patrick von Platen
b93569457f fix t5 special tokens (#8435) 2020-11-10 18:54:17 +01:00
Julien Plu
cace39af97 Add missing import (#8444)
* Add missing import

* Fix dummy objects
2020-11-10 18:01:32 +01:00
Stas Bekman
e21340da7a [testing utils] get_auto_remove_tmp_dir more intuitive behavior (#8401)
* [testing utils] get_auto_remove_tmp_dir default change

Now that I have been using `get_auto_remove_tmp_dir default change` for a while, I realized that the defaults aren't most optimal.

99% of the time we want the tmp dir to be empty at the beginning of the test - so changing the default to `before=True` - this shouldn't impact any tests since this feature is used only during debug.

* simplify things

* update docs

* fix doc layout

* style

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* better 3-state doc

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* s/tmp/temporary/ + style

* correct the statement

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-10 11:57:21 -05:00
Julien Plu
e7e1549895 Windows dev section in the contributing file (#8436)
* Add a Windows dev section in the contributing file.

* Forgotten link

* Trigger CI

* Rework description

* Trigger CI
2020-11-10 11:19:16 -05:00
Julien Plu
8551a99232 Add auto next sentence prediction (#8432)
* Add auto next sentence prediction

* Fix style

* Add mobilebert next sentence prediction
2020-11-10 11:11:48 -05:00
Sam Shleifer
c314b1fd3b [docs] improve bart/marian/mBART/pegasus docs (#8421) 2020-11-10 10:18:34 -05:00
Sylvain Gugger
3213d3bfae Question template (#8440)
* Remove SO from question template

* Styling
2020-11-10 10:07:56 -05:00
Stas Bekman
5d4972e608 [examples] better PL version check (#8429) 2020-11-10 09:33:23 -05:00
Shichao Sun
ae1cb4ec22 [s2s/distill] hparams.tokenizer_name = hparams.teacher (#8382) 2020-11-10 09:32:01 -05:00
Lysandre
aec51e5696 v3.5.0 documentation 2020-11-10 08:58:47 -05:00
673 changed files with 15801 additions and 6757 deletions

View File

@@ -77,7 +77,7 @@ jobs:
- v0.4-torch_and_tf-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,tf-cpu,torch,testing]
- run: pip install .[sklearn,tf-cpu,torch,testing,sentencepiece]
- save_cache:
key: v0.4-{{ checksum "setup.py" }}
paths:
@@ -103,7 +103,7 @@ jobs:
- v0.4-torch-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing]
- run: pip install .[sklearn,torch,testing,sentencepiece]
- save_cache:
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
@@ -129,7 +129,7 @@ jobs:
- v0.4-tf-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,tf-cpu,testing]
- run: pip install .[sklearn,tf-cpu,testing,sentencepiece]
- save_cache:
key: v0.4-tf-{{ checksum "setup.py" }}
paths:
@@ -155,7 +155,7 @@ jobs:
- v0.4-flax-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: sudo pip install .[flax,sklearn,torch,testing]
- run: sudo pip install .[flax,sklearn,torch,testing,sentencepiece]
- save_cache:
key: v0.4-flax-{{ checksum "setup.py" }}
paths:
@@ -181,7 +181,7 @@ jobs:
- v0.4-torch-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing]
- run: pip install .[sklearn,torch,testing,sentencepiece]
- save_cache:
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
@@ -207,7 +207,7 @@ jobs:
- v0.4-tf-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,tf-cpu,testing]
- run: pip install .[sklearn,tf-cpu,testing,sentencepiece]
- save_cache:
key: v0.4-tf-{{ checksum "setup.py" }}
paths:
@@ -231,7 +231,7 @@ jobs:
- v0.4-custom_tokenizers-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[ja,testing]
- run: pip install .[ja,testing,sentencepiece]
- run: python -m unidic download
- save_cache:
key: v0.4-custom_tokenizers-{{ checksum "setup.py" }}
@@ -258,7 +258,7 @@ jobs:
- v0.4-torch_examples-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing]
- run: pip install .[sklearn,torch,sentencepiece,testing]
- run: pip install -r examples/requirements.txt
- save_cache:
key: v0.4-torch_examples-{{ checksum "setup.py" }}
@@ -281,7 +281,7 @@ jobs:
- v0.4-build_doc-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[tf,torch,sentencepiece,docs]
- run: pip install ."[all, docs]"
- save_cache:
key: v0.4-build_doc-{{ checksum "setup.py" }}
paths:
@@ -303,7 +303,7 @@ jobs:
keys:
- v0.4-deploy_doc-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install .[tf,torch,sentencepiece,docs]
- run: pip install ."[all,docs]"
- save_cache:
key: v0.4-deploy_doc-{{ checksum "setup.py" }}
paths:
@@ -324,7 +324,7 @@ jobs:
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install isort
- run: pip install .[tf,torch,flax,quality]
- run: pip install .[all,quality]
- save_cache:
key: v0.4-code_quality-{{ checksum "setup.py" }}
paths:

View File

@@ -51,4 +51,5 @@ deploy_doc "7fb8bdf" v3.0.2
deploy_doc "4b3ee9c" v3.1.0
deploy_doc "3ebb1b3" v3.2.0
deploy_doc "0613f05" v3.3.1
deploy_doc "eb0e0ce" # v3.4.0 Latest stable release
deploy_doc "eb0e0ce" v3.4.0
deploy_doc "818878d" # v3.5.1 Latest stable release

View File

@@ -1,6 +1,6 @@
---
name: "❓ Questions & Help"
about: Post your general questions on the Hugging Face forum or Stack Overflow tagged huggingface-transformers
about: Post your general questions on the Hugging Face forum: https://discuss.huggingface.co/
title: ''
labels: ''
assignees: ''
@@ -10,18 +10,17 @@ assignees: ''
# ❓ Questions & Help
<!-- The GitHub issue tracker is primarly intended for bugs, feature requests,
new models and benchmarks, and migration questions. For all other questions,
new models, benchmarks, and migration questions. For all other questions,
we direct you to the Hugging Face forum: https://discuss.huggingface.co/ .
You can also try Stack Overflow (SO) where a whole community of PyTorch and
Tensorflow enthusiast can help you out. In this case, make sure to tag your
question with the right deep learning framework as well as the
huggingface-transformers tag:
https://stackoverflow.com/questions/tagged/huggingface-transformers
-->
## Details
<!-- Description of your issue -->
<!-- You should first ask your question on the forum or SO, and only if
you didn't get an answer ask it here on GitHub. -->
**A link to original question on the forum/Stack Overflow**:
<!-- You should first ask your question on the forum, and only if
you didn't get an answer after a few days ask it here on GitHub. -->
**A link to original question on the forum**:
<!-- Your issue will be closed if you don't fill this part. -->

View File

@@ -20,7 +20,7 @@ Fixes # (issue)
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
to the it if that's the case.
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
[documentation guidelines](https://github.com/huggingface/transformers/tree/master/docs), and
[here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/master/docs#writing-source-documentation).

View File

@@ -8,6 +8,9 @@ on:
jobs:
torch_hub_integration:
runs-on: ubuntu-latest
env:
# TODO quickfix but may need more investigation
ACTIONS_ALLOW_UNSECURE_COMMANDS: True
steps:
# no checkout necessary here.
- name: Extract branch name

View File

@@ -4,17 +4,19 @@ on:
push:
branches:
- master
- model-templates
paths:
- "src/**"
- "tests/**"
- ".github/**"
- "templates/**"
# pull_request:
repository_dispatch:
jobs:
run_tests_torch_gpu:
runs-on: [self-hosted, single-gpu]
runs-on: [self-hosted, gpu, single-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
@@ -46,7 +48,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime]
pip install .[torch,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
@@ -55,6 +57,14 @@ jobs:
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Create model files
run: |
source .env/bin/activate
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/encoder-bert-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/pt-encoder-bert-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/standalone.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/tf-encoder-bert-tokenizer.json --path=templates/adding_a_new_model
- name: Run all non-slow tests on GPU
env:
OMP_NUM_THREADS: 1
@@ -76,7 +86,7 @@ jobs:
run_tests_tf_gpu:
runs-on: [self-hosted, single-gpu]
runs-on: [self-hosted, gpu, single-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
@@ -107,7 +117,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install .[tf,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
@@ -116,6 +126,14 @@ jobs:
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('TF GPUs available:', bool(tf.config.list_physical_devices('GPU')))"
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('Number of TF GPUs available:', len(tf.config.list_physical_devices('GPU')))"
- name: Create model files
run: |
source .env/bin/activate
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/encoder-bert-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/pt-encoder-bert-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/standalone.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/tf-encoder-bert-tokenizer.json --path=templates/adding_a_new_model
- name: Run all non-slow tests on GPU
env:
OMP_NUM_THREADS: 1
@@ -135,8 +153,8 @@ jobs:
name: run_all_tests_tf_gpu_test_reports
path: reports
run_tests_torch_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
run_tests_torch_multi_gpu:
runs-on: [self-hosted, gpu, multi-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
@@ -154,7 +172,7 @@ jobs:
id: cache
with:
path: .env
key: v1.1-tests_torch_multiple_gpu-${{ hashFiles('setup.py') }}
key: v1.1-tests_torch_multi_gpu-${{ hashFiles('setup.py') }}
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
@@ -167,7 +185,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime]
pip install .[torch,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
@@ -181,11 +199,11 @@ jobs:
OMP_NUM_THREADS: 1
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_torch_multiple_gpu tests
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_torch_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_multiple_gpu_failures_short.txt
run: cat reports/tests_torch_multi_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
@@ -194,8 +212,8 @@ jobs:
name: run_all_tests_torch_multi_gpu_test_reports
path: reports
run_tests_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
run_tests_tf_multi_gpu:
runs-on: [self-hosted, gpu, multi-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
@@ -213,7 +231,7 @@ jobs:
id: cache
with:
path: .env
key: v1.1-tests_tf_multiple_gpu-${{ hashFiles('setup.py') }}
key: v1.1-tests_tf_multi_gpu-${{ hashFiles('setup.py') }}
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
@@ -226,7 +244,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install .[tf,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
@@ -240,11 +258,11 @@ jobs:
OMP_NUM_THREADS: 1
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_tf_multiple_gpu tests
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_tf_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_multiple_gpu_failures_short.txt
run: cat reports/tests_tf_multi_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}

View File

@@ -15,7 +15,7 @@ on:
jobs:
run_all_tests_torch_gpu:
runs-on: [self-hosted, single-gpu]
runs-on: [self-hosted, gpu, single-gpu]
steps:
- uses: actions/checkout@v2
@@ -49,7 +49,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime]
pip install .[torch,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
pip list
@@ -109,7 +109,7 @@ jobs:
run_all_tests_tf_gpu:
runs-on: [self-hosted, single-gpu]
runs-on: [self-hosted, gpu, single-gpu]
steps:
- uses: actions/checkout@v2
@@ -143,7 +143,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install .[tf,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
pip list
@@ -187,8 +187,8 @@ jobs:
name: run_all_tests_tf_gpu_test_reports
path: reports
run_all_tests_torch_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
run_all_tests_torch_multi_gpu:
runs-on: [self-hosted, gpu, multi-gpu]
steps:
- uses: actions/checkout@v2
@@ -222,7 +222,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime]
pip install .[torch,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
pip list
@@ -238,11 +238,11 @@ jobs:
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_torch_multiple_gpu tests
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_torch_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_multiple_gpu_failures_short.txt
run: cat reports/tests_torch_multi_gpu_failures_short.txt
- name: Run examples tests on multi-GPU
env:
@@ -250,11 +250,11 @@ jobs:
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=examples_torch_multiple_gpu examples
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_torch_examples_multi_gpu examples
- name: Failure short reports
if: ${{ always() }}
run: cat reports/examples_torch_multiple_gpu_failures_short.txt
run: cat reports/tests_torch_examples_multi_gpu_failures_short.txt
- name: Run all pipeline tests on multi-GPU
if: ${{ always() }}
@@ -265,11 +265,11 @@ jobs:
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_torch_pipeline_multiple_gpu tests
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_torch_pipeline_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_pipeline_multiple_gpu_failures_short.txt
run: cat reports/tests_torch_pipeline_multi_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
@@ -278,8 +278,8 @@ jobs:
name: run_all_tests_torch_multi_gpu_test_reports
path: reports
run_all_tests_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
run_all_tests_tf_multi_gpu:
runs-on: [self-hosted, gpu, multi-gpu]
steps:
- uses: actions/checkout@v2
@@ -313,7 +313,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install .[tf,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
pip list
@@ -329,11 +329,11 @@ jobs:
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_tf_multiple_gpu tests
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_tf_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_multiple_gpu_failures_short.txt
run: cat reports/tests_tf_multi_gpu_failures_short.txt
- name: Run all pipeline tests on multi-GPU
if: ${{ always() }}
@@ -344,11 +344,11 @@ jobs:
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_tf_pipelines_multiple_gpu tests
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_tf_pipeline_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_multiple_gpu_pipelines_failures_short.txt
run: cat reports/tests_tf_pipeline_multi_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}

1
.gitignore vendored
View File

@@ -133,7 +133,6 @@ dmypy.json
tensorflow_code
# Models
models
proc_data
# examples

View File

@@ -308,3 +308,12 @@ Check our [documentation writing guide](https://github.com/huggingface/transform
for more information.
#### This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md)
### Develop on Windows
One way one can run the make command on Window is to pass by MSYS2:
1. [Download MSYS2](https://www.msys2.org/), we assume to have it installed in C:\msys64
2. Open the command line C:\msys64\msys2.exe (it should be available from the start menu)
3. Run in the shell: `pacman -Syu` and install make with `pacman -S make`

View File

@@ -181,6 +181,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1. **[MT5](https://huggingface.co/transformers/model_doc/mt5.html)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
@@ -196,6 +197,8 @@ ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/
1. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
To cehck if each model has an implementation in PyTorch/TensorFlow/Flax or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to [this table](https://huggingface.co/transformers/index.html#bigtable)
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

View File

@@ -2,6 +2,15 @@
/* Colab dropdown */
table.center-aligned-table td {
text-align: center;
}
table.center-aligned-table th {
text-align: center;
vertical-align: middle;
}
.colab-dropdown {
position: relative;
display: inline-block;

View File

@@ -1,10 +1,11 @@
// These two things need to be updated at each release for the version selector.
// Last stable version
const stableVersion = "v3.4.0"
const stableVersion = "v3.5.0"
// Dictionary doc folder to label
const versionMapping = {
"master": "master",
"": "v3.4.0",
"": "v3.5.0/v3.5.1",
"v3.4.0": "v3.4.0",
"v3.3.1": "v3.3.0/v3.3.1",
"v3.2.0": "v3.2.0",
"v3.1.0": "v3.1.0 (stable)",

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'3.5.0'
release = u'4.0.0'
# -- General configuration ---------------------------------------------------

View File

@@ -35,6 +35,8 @@ Choose the right framework for every part of a model's lifetime:
- Move a single model between TF2.0/PyTorch frameworks at will
- Seamlessly pick the right framework for training, evaluation, production
Experimental support for Flax with a few models right now, expected to grow in the coming months.
Contents
-----------------------------------------------------------------------------------------------------------------------
@@ -44,7 +46,7 @@ The documentation is organized in five parts:
and a glossary.
- **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
transformers model
- The three last section contain the documentation of each public class and function, grouped in:
@@ -52,8 +54,8 @@ The documentation is organized in five parts:
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and
conversion utilities for the following models:
The library currently contains PyTorch, Tensorflow and Flax implementations, pretrained model weights, usage scripts
and conversion utilities for the following models:
..
This list is updated automatically from the README with `make fix-copies`. Do not update manually!
@@ -126,43 +128,137 @@ conversion utilities for the following models:
21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
22. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
23. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
Mohammad Saleh and Peter J. Liu.
23. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
24. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
24. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
25. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
25. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
26. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. ultilingual BERT into `DistilmBERT
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German version of
DistilBERT.
26. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
27. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
Krishna, and Kurt W. Keutzer.
27. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
28. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
28. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
29. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
29. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
30. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
30. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
31. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
31. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
32. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov.
32. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
33. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
33. `Other community models <https://huggingface.co/models>`__, contributed by the `community
34. `Other community models <https://huggingface.co/models>`__, contributed by the `community
<https://huggingface.co/users>`__.
.. _bigtable:
The table below represents the current support in the library for each of those models, whether they have a Python
tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in PyTorch,
TensorFlow and/or Flax.
..
This table is updated automatically from the auto modules with `make fix-copies`. Do not update manually!
.. rst-class:: center-aligned-table
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
+=============================+================+================+=================+====================+==============+
| ALBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BART | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Bert Generation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Blenderbot | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ELECTRA | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLM | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Longformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Marian | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Pegasus | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RAG | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Reformer | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| T5 | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM-RoBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLMProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLNet | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| mBART | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| mT5 | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
.. toctree::
:maxdepth: 2
:caption: Get started
@@ -248,6 +344,7 @@ conversion utilities for the following models:
model_doc/marian
model_doc/mbart
model_doc/mobilebert
model_doc/mt5
model_doc/gpt
model_doc/gpt2
model_doc/pegasus

View File

@@ -70,15 +70,15 @@ to check 🤗 Transformers is properly installed.
This library provides pretrained models that will be downloaded and cached locally. Unless you specify a location with
`cache_dir=...` when you use methods like `from_pretrained`, these models will automatically be downloaded in the
folder given by the shell environment variable ``TRANSFORMERS_CACHE``. The default value for it will be the PyTorch
cache home followed by ``/transformers/`` (even if you don't have PyTorch installed). This is (by order of priority):
folder given by the shell environment variable ``TRANSFORMERS_CACHE``. The default value for it will be the Hugging
Face cache home followed by ``/transformers/``. This is (by order of priority):
* shell environment variable ``TORCH_HOME``
* shell environment variable ``XDG_CACHE_HOME`` + ``/torch/``
* default: ``~/.cache/torch/``
* shell environment variable ``HF_HOME``
* shell environment variable ``XDG_CACHE_HOME`` + ``/huggingface/``
* default: ``~/.cache/huggingface/``
So if you don't have any specific environment variable set, the cache directory will be at
``~/.cache/torch/transformers/``.
``~/.cache/huggingface/transformers/``.
**Note:** If you have set a shell environment variable for one of the predecessors of this library
(``PYTORCH_TRANSFORMERS_CACHE`` or ``PYTORCH_PRETRAINED_BERT_CACHE``), those will be used if there is no shell
@@ -97,6 +97,6 @@ You should check out our [swift-coreml-transformers](https://github.com/huggingf
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch or
At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!

View File

@@ -35,7 +35,7 @@ Here is an example of how to customize :class:`~transformers.Trainer` using a cu
class MyTrainer(Trainer):
def compute_loss(self, model, inputs):
labels = inputs.pop("labels")
outputs = models(**inputs)
outputs = model(**inputs)
logits = outputs[0]
return my_custom_loss(logits, labels)

View File

@@ -1,5 +1,170 @@
# Migrating from previous packages
## Migrating from transformers `v3.x` to `v4.x`
A couple of changes were introduced when the switch from version 3 to version 4 was done. Below is a summary of the
expected changes:
#### 1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.
The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set.
This introduces two breaking changes:
- The handling of overflowing tokens between the python and rust tokenizers is different.
- The rust tokenizers do not accept integers in the encoding methods.
##### How to obtain the same behavior as v3.x in v4.x
- The pipelines now contain additional features out of the box. See the [token-classification pipeline with the `grouped_entities` flag](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=textclassification#tokenclassificationpipeline).
- The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the `use_fast` flag by setting it to `False`:
In version `v3.x`:
```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
to obtain the same in version `v4.x`:
```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
```
#### 2. SentencePiece is removed from the required dependencies
The requirement on the SentencePiece dependency has been lifted from the `setup.py`. This is done so that we may have a channel on anaconda cloud without relying on `conda-forge`. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard `transformers` installation.
This includes the **slow** versions of:
- `XLNetTokenizer`
- `AlbertTokenizer`
- `CamembertTokenizer`
- `MBartTokenizer`
- `PegasusTokenizer`
- `T5Tokenizer`
- `ReformerTokenizer`
- `XLMRobertaTokenizer`
##### How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version `v3.x`, you should install `sentencepiece` additionally:
In version `v3.x`:
```bash
pip install transformers
```
to obtain the same in version `v4.x`:
```bash
pip install transformers[sentencepiece]
```
or
```bash
pip install transformers sentencepiece
```
#### 3. The architecture of the repo has been updated so that each model resides in its folder
The past and foreseeable addition of new models means that the number of files in the directory `src/transformers` keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.
This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.
##### How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version `v3.x`, you should update the path used to access the layers.
In version `v3.x`:
```bash
from transformers.modeling_bert import BertLayer
```
to obtain the same in version `v4.x`:
```bash
from transformers.models.bert.modeling_bert import BertLayer
```
#### 4. Switching the `return_dict` argument to `True` by default
The [`return_dict` argument](https://huggingface.co/transformers/main_classes/output.html) enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.
This is a breaking change as the limitation of that tuple is that it cannot be unpacked: `value0, value1 = outputs` will not work.
##### How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version `v3.x`, you should specify the `return_dict` argument to `False`, either in the model configuration or during the forward pass.
In version `v3.x`:
```bash
model = BertModel.from_pretrained("bert-base-cased")
outputs = model(**inputs)
```
to obtain the same in version `v4.x`:
```bash
model = BertModel.from_pretrained("bert-base-cased")
outputs = model(**inputs, return_dict=False)
```
or
```bash
model = BertModel.from_pretrained("bert-base-cased", return_dict=False)
outputs = model(**inputs)
```
#### 5. Removed some deprecated attributes
Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in [#8604](https://github.com/huggingface/transformers/pull/8604).
Here is a list of these attributes/methods/arguments and what their replacements should be:
In several models, the labels become consistent with the other models:
- `masked_lm_labels` becomes `labels` in `AlbertForMaskedLM` and `AlbertForPreTraining`.
- `masked_lm_labels` becomes `labels` in `BertForMaskedLM` and `BertForPreTraining`.
- `masked_lm_labels` becomes `labels` in `DistilBertForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `ElectraForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `LongformerForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `MobileBertForMaskedLM`.
- `masked_lm_labels` becomes `labels` in `RobertaForMaskedLM`.
- `lm_labels` becomes `labels` in `BartForConditionalGeneration`.
- `lm_labels` becomes `labels` in `GPT2DoubleHeadsModel`.
- `lm_labels` becomes `labels` in `OpenAIGPTDoubleHeadsModel`.
- `lm_labels` becomes `labels` in `T5ForConditionalGeneration`.
In several models, the caching mechanism becomes consistent with the other models:
- `decoder_cached_states` becomes `past_key_values` in all BART-like, FSMT and T5 models.
- `decoder_past_key_values` becomes `past_key_values` in all BART-like, FSMT and T5 models.
- `past` becomes `past_key_values` in all CTRL models.
- `past` becomes `past_key_values` in all GPT-2 models.
Regarding the tokenizer classes:
- The tokenizer attribute `max_len` becomes `model_max_length`.
- The tokenizer attribute `return_lengths` becomes `return_length`.
- The tokenizer encoding argument `is_pretokenized` becomes `is_split_into_words`.
Regarding the `Trainer` class:
- The `Trainer` argument `tb_writer` is removed in favor of the callback `TensorBoardCallback(tb_writer=...)`.
- The `Trainer` argument `prediction_loss_only` is removed in favor of the class argument `args.prediction_loss_only`.
- The `Trainer` attribute `data_collator` should be a callable.
- The `Trainer` method `_log` is deprecated in favor of `log`.
- The `Trainer` method `_training_step` is deprecated in favor of `training_step`.
- The `Trainer` method `_prediction_loop` is deprecated in favor of `prediction_loop`.
- The `Trainer` method `is_local_master` is deprecated in favor of `is_local_process_zero`.
- The `Trainer` method `is_world_master` is deprecated in favor of `is_world_process_zero`.
Regarding the `TFTrainer` class:
- The `TFTrainer` argument `prediction_loss_only` is removed in favor of the class argument `args.prediction_loss_only`.
- The `Trainer` method `_log` is deprecated in favor of `log`.
- The `TFTrainer` method `_prediction_loop` is deprecated in favor of `prediction_loop`.
- The `TFTrainer` method `_setup_wandb` is deprecated in favor of `setup_wandb`.
- The `TFTrainer` method `_run_model` is deprecated in favor of `run_model`.
Regarding the `TrainerArgument` class:
- The `TrainerArgument` argument `evaluate_during_training` is deprecated in favor of `evaluation_strategy`.
Regarding the Transfo-XL model:
- The Transfo-XL configuration attribute `tie_weight` becomes `tie_words_embeddings`.
- The Transfo-XL modeling method `reset_length` becomes `reset_memory_length`.
Regarding pipelines:
- The `FillMaskPipeline` argument `topk` becomes `top_k`.
## Migrating from pytorch-transformers to 🤗 Transformers
Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to 🤗 Transformers.

View File

@@ -51,10 +51,10 @@ AlbertTokenizer
Albert specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_albert.AlbertForPreTrainingOutput
.. autoclass:: transformers.models.albert.modeling_albert.AlbertForPreTrainingOutput
:members:
.. autoclass:: transformers.modeling_tf_albert.TFAlbertForPreTrainingOutput
.. autoclass:: transformers.models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
:members:

View File

@@ -34,6 +34,8 @@ ________________________________________________________________________________
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
<https://arxiv.org/abs/2010.13002>`__.
Implementation Notes
@@ -42,16 +44,33 @@ Implementation Notes
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
:meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
:func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some
other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the
string you pass to :func:`fairseq.encode` starts with a space.
:func:`transformers.models.bart.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is
different than some other modeling APIs.
- Model predictions are intended to be identical to the original implementation when
:obj:`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
:func:`fairseq.encode` starts with a space.
- :meth:`~transformers.BartForConditionalGeneration.generate` should be used for conditional generation tasks like
summarization, see the example in that docstrings.
- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
mask-filling tasks.
- For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`.
Mask Filling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
.. code-block::
from transformers import BartForConditionalGeneration, BartTokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", force_bos_token_to_be_generated=True)
tok = BartTokenizer.from_pretrained("facebook/bart-large")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
batch = tok(example_english_phrase, return_tensors='pt')
generated_ids = model.generate(batch['input_ids'])
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
BartConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -74,7 +93,7 @@ BartModel
.. autoclass:: transformers.BartModel
:members: forward
.. autofunction:: transformers.modeling_bart._prepare_bart_decoder_inputs
.. autofunction:: transformers.models.bart.modeling_bart._prepare_bart_decoder_inputs
BartForConditionalGeneration

View File

@@ -57,10 +57,10 @@ BertTokenizerFast
Bert specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_bert.BertForPreTrainingOutput
.. autoclass:: transformers.models.bert.modeling_bert.BertForPreTrainingOutput
:members:
.. autoclass:: transformers.modeling_tf_bert.TFBertForPreTrainingOutput
.. autoclass:: transformers.models.bert.modeling_tf_bert.TFBertForPreTrainingOutput
:members:
@@ -188,3 +188,10 @@ TFBertForQuestionAnswering
.. autoclass:: transformers.TFBertForQuestionAnswering
:members: call
FlaxBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxBertModel
:members: __call__

View File

@@ -10,7 +10,7 @@ Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Ali
The abstract from the paper is the following:
*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
@@ -40,7 +40,7 @@ Usage:
labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
# train...
loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels, return_dict=True).loss
loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
loss.backward()

View File

@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
position, respectively, and the attention weights among words are computed using disentangled matrices on their
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*

View File

@@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
study.*

View File

@@ -5,7 +5,7 @@ Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
The abstract from the paper is the following:
@@ -71,13 +71,13 @@ DPRReaderTokenizerFast
DPR specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_dpr.DPRContextEncoderOutput
.. autoclass:: transformers.models.dpr.modeling_dpr.DPRContextEncoderOutput
:members:
.. autoclass:: transformers.modeling_dpr.DPRQuestionEncoderOutput
.. autoclass:: transformers.models.dpr.modeling_dpr.DPRQuestionEncoderOutput
:members:
.. autoclass:: transformers.modeling_dpr.DPRReaderOutput
.. autoclass:: transformers.models.dpr.modeling_dpr.DPRReaderOutput
:members:
@@ -99,3 +99,22 @@ DPRReader
.. autoclass:: transformers.DPRReader
:members: forward
TFDPRContextEncoder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDPRContextEncoder
:members: call
TFDPRQuestionEncoder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDPRQuestionEncoder
:members: call
TFDPRReader
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDPRReader
:members: call

View File

@@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.
The abstract from the paper is the following:
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
@@ -69,10 +69,10 @@ ElectraTokenizerFast
Electra specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_electra.ElectraForPreTrainingOutput
.. autoclass:: transformers.models.electra.modeling_electra.ElectraForPreTrainingOutput
:members:
.. autoclass:: transformers.modeling_tf_electra.TFElectraForPreTrainingOutput
.. autoclass:: transformers.models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput
:members:

View File

@@ -19,7 +19,7 @@ representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018;
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
community for further reproducible experiments in French NLP.*

View File

@@ -65,10 +65,10 @@ FunnelTokenizerFast
Funnel specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_funnel.FunnelForPreTrainingOutput
.. autoclass:: transformers.models.funnel.modeling_funnel.FunnelForPreTrainingOutput
:members:
.. autoclass:: transformers.modeling_tf_funnel.TFFunnelForPreTrainingOutput
.. autoclass:: transformers.models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput
:members:

View File

@@ -14,7 +14,7 @@ The abstract from the paper is the following:
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
@@ -72,10 +72,10 @@ OpenAIGPTTokenizerFast
OpenAI specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
.. autoclass:: transformers.models.openai.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
:members:
.. autoclass:: transformers.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
.. autoclass:: transformers.models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
:members:

View File

@@ -60,10 +60,10 @@ GPT2TokenizerFast
GPT2 specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_gpt2.GPT2DoubleHeadsModelOutput
.. autoclass:: transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
:members:
.. autoclass:: transformers.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
.. autoclass:: transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
:members:

View File

@@ -6,19 +6,19 @@ Overview
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
information extraction tasks, such as form understanding and receipt understanding.
The abstract from the paper is the following:
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
classification (from 93.07 to 94.42).*

View File

@@ -93,29 +93,47 @@ LongformerTokenizerFast
Longformer specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_longformer.LongformerBaseModelOutput
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutput
:members:
.. autoclass:: transformers.modeling_longformer.LongformerBaseModelOutputWithPooling
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
:members:
.. autoclass:: transformers.modeling_longformer.LongformerMultipleChoiceModelOutput
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMaskedLMOutput
:members:
.. autoclass:: transformers.modeling_longformer.LongformerQuestionAnsweringModelOutput
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
:members:
.. autoclass:: transformers.modeling_tf_longformer.TFLongformerBaseModelOutput
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
:members:
.. autoclass:: transformers.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
:members:
.. autoclass:: transformers.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerTokenClassifierOutput
:members:
LongformerModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
:members:
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
:members:
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
:members:
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
:members:
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
:members:
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
:members:
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
:members:
LongformerModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -179,3 +197,24 @@ TFLongformerForQuestionAnswering
.. autoclass:: transformers.TFLongformerForQuestionAnswering
:members: call
TFLongformerForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFLongformerForSequenceClassification
:members: call
TFLongformerForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFLongformerForTokenClassification
:members: call
TFLongformerForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFLongformerForMultipleChoice
:members: call

View File

@@ -19,7 +19,7 @@ Encoder Representations from Transformers) framework to learn these vision-and-l
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification),
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
@@ -67,19 +67,19 @@ LxmertTokenizerFast
Lxmert specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_lxmert.LxmertModelOutput
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertModelOutput
:members:
.. autoclass:: transformers.modeling_lxmert.LxmertForPreTrainingOutput
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
:members:
.. autoclass:: transformers.modeling_lxmert.LxmertForQuestionAnsweringOutput
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
:members:
.. autoclass:: transformers.modeling_tf_lxmert.TFLxmertModelOutput
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
:members:
.. autoclass:: transformers.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
:members:

View File

@@ -5,7 +5,7 @@ MarianMT
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
and assign @patrickvonplaten.
Translations should be similar, but not identical to, output in the test set linked to in each model card.
Translations should be similar, but not identical to output in the test set linked to in each model card.
Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -35,24 +35,109 @@ Naming
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
group use a combination of ISO-639-5 codes and ISO-639-2 codes.
Examples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
fine-tuning experiments and integration tests.
- `Fine-tune on TPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh>`__
- `Fine-tune on GPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh>`__
- `Fine-tune on GPU with pytorch-lightning
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/distil_marian_no_teacher.sh>`__
Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
language to the :obj:`src_text`.
- You can see a models's supported language codes in its model card, under target constituents, like in `opus-mt-en-roa
<https://huggingface.co/Helsinki-NLP/opus-mt-en-roa>`__.
- Note that if a model is only multilingual on the source side, like :obj:`Helsinki-NLP/opus-mt-roa-en`, no language
codes are required.
- If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
looking at the model card, or the Group Members `mapping
<https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
- If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
prepending the desired output language to the :obj:`src_text`.
- You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``
Example of translating english to many romance languages, using language codes:
New multi-lingual models from the `Tatoeba-Challenge repo <https://github.com/Helsinki-NLP/Tatoeba-Challenge>`__
require 3 character language codes:
.. code-block:: python
from transformers import MarianMTModel, MarianTokenizer
src_text = [
'>>fra<< this is a sentence in english that we want to translate to french',
'>>por<< This should go to portuguese',
'>>esp<< And this to Spanish'
]
model_name = 'Helsinki-NLP/opus-mt-en-roa'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français",
# 'Isto deve ir para o português.',
# 'Y esto al español']
Code to see available pretrained models:
.. code-block:: python
from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
Old Style Multi-Lingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
group:
.. code-block:: python
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
'Helsinki-NLP/opus-mt-ROMANCE-en',
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
'Helsinki-NLP/opus-mt-de-ZH',
'Helsinki-NLP/opus-mt-en-CELTIC',
'Helsinki-NLP/opus-mt-en-ROMANCE',
'Helsinki-NLP/opus-mt-es-NORWAY',
'Helsinki-NLP/opus-mt-fi-NORWAY',
'Helsinki-NLP/opus-mt-fi-ZH',
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
'Helsinki-NLP/opus-mt-sv-NORWAY',
'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}
Example of translating english to many romance languages, using old-style 2 character language codes
.. code-block::python
from transformers import MarianMTModel, MarianTokenizer
src_text = [
'>>fr<< this is a sentence in english that we want to translate to french',
@@ -63,52 +148,12 @@ Example of translating english to many romance languages, using language codes:
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text))
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français",
# 'Isto deve ir para o português.',
# 'Y esto al español']
# ["c'est une phrase en anglais que nous voulons traduire en français", 'Isto deve ir para o português.', 'Y esto al español']
Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a
separator for src or tgt, as in :obj:`Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi`. These still require language
codes.
There are many supported regional language codes, like :obj:`>>es_ES<<` (Spain) and :obj:`>>es_AR<<` (Argentina), that
do not seem to change translations. I have not found these to provide different results than just using :obj:`>>es<<`.
For example:
- `Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU`: translates from all NORTH_EU languages (see `mapping
<https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special
language code like :obj:`>>de<<` to specify output language.
- `Helsinki-NLP/opus-mt-ROMANCE-en`: translates from many romance languages to english, no codes needed since there
is only one target language.
.. code-block:: python
GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}
Code to see available pretrained models:
.. code-block:: python
from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
MarianConfig

View File

@@ -13,12 +13,19 @@ The MBart model was presented in `Multilingual Denoising Pre-training for Neural
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
on the encoder, decoder, or reconstructing parts of the text.
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning mBART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- Given the large embeddings table, mBART consumes a large amount of GPU RAM, especially for fine-tuning.
:class:`MarianMTModel` is usually a better choice for bilingual machine translation.
Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -37,12 +44,8 @@ the sequences for sequence-to-sequence fine-tuning.
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"]
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian, return_tensors="pt")
model(input_ids=batch['input_ids'], labels=batch['labels']) # forward pass
- Generation
@@ -55,7 +58,7 @@ the sequences for sequence-to-sequence fine-tuning.
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
article = "UN Chief Says There Is No Military Solution in Syria"
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX")
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX", return_tensors="pt")
translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"

View File

@@ -58,10 +58,10 @@ MobileBertTokenizerFast
MobileBert specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_mobilebert.MobileBertForPreTrainingOutput
.. autoclass:: transformers.models.mobilebert.modeling_mobilebert.MobileBertForPreTrainingOutput
:members:
.. autoclass:: transformers.modeling_tf_mobilebert.TFMobileBertForPreTrainingOutput
.. autoclass:: transformers.models.mobilebert.modeling_tf_mobilebert.TFMobileBertForPreTrainingOutput
:members:

View File

@@ -0,0 +1,53 @@
MT5
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The mT5 model was presented in `mT5: A massively multilingual pre-trained text-to-text transformer
<https://arxiv.org/abs/2010.11934>`_ by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya
Siddhant, Aditya Barua, Colin Raffel.
The abstract from the paper is the following:
*The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain
state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a
multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We describe
the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual
benchmarks. All of the code and model checkpoints*
The original code can be found `here <https://github.com/google-research/multilingual-t5>`__.
MT5Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MT5Config
:members:
MT5Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MT5Model
:members:
MT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MT5ForConditionalGeneration
:members:
TFMT5Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMT5Model
:members:
TFMT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMT5ForConditionalGeneration
:members:

View File

@@ -31,10 +31,19 @@ All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-
- Each checkpoint is 2.2 GB on disk and 568M parameters.
- FP16 is not supported (help/ideas on this appreciated!).
- Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU.
- For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores
46.91/24.34/39.1.
- Full replication results and correctly pre-processed data can be found in this `Issue
<https://github.com/huggingface/transformers/issues/6844#issue-689259666>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distill-pegasus>`__ are described in this `paper
<https://arxiv.org/abs/2010.13002>`__.
The gap is likely because of different alpha/length_penalty implementations in beam search.
Examples
_______________________________________________________________________________________________________________________
- `Script <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_pegasus_xsum.sh>`__ to
fine-tune pegasus on the XSUM dataset. Data download instructions at `examples/seq2seq/
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.
Implementation Notes
@@ -45,7 +54,7 @@ Implementation Notes
- Some key configuration differences:
- static, sinusoidal position embeddings
- no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
- no :obj:`layernorm_embedding` (:obj:`PegasusConfig.normalize_embedding=False`)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
- more beams are used (:obj:`num_beams=8`)
- All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum
@@ -69,7 +78,7 @@ Usage Example
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest', return_tensors="pt").to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."

View File

@@ -17,7 +17,7 @@ the next token.
The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
@@ -25,7 +25,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
@@ -47,16 +47,16 @@ ProphetNetTokenizer
ProphetNet specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_prophetnet.ProphetNetSeq2SeqLMOutput
.. autoclass:: transformers.models.prophetnet.modeling_prophetnet.ProphetNetSeq2SeqLMOutput
:members:
.. autoclass:: transformers.modeling_prophetnet.ProphetNetSeq2SeqModelOutput
.. autoclass:: transformers.models.prophetnet.modeling_prophetnet.ProphetNetSeq2SeqModelOutput
:members:
.. autoclass:: transformers.modeling_prophetnet.ProphetNetDecoderModelOutput
.. autoclass:: transformers.models.prophetnet.modeling_prophetnet.ProphetNetDecoderModelOutput
:members:
.. autoclass:: transformers.modeling_prophetnet.ProphetNetDecoderLMOutput
.. autoclass:: transformers.models.prophetnet.modeling_prophetnet.ProphetNetDecoderLMOutput
:members:
ProphetNetModel

View File

@@ -50,10 +50,10 @@ RagTokenizer
Rag specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
.. autoclass:: transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput
:members:
.. autoclass:: transformers.modeling_rag.RetrievAugLMOutput
.. autoclass:: transformers.models.rag.modeling_rag.RetrievAugLMOutput
:members:
RagRetriever

View File

@@ -146,3 +146,10 @@ TFRobertaForQuestionAnswering
.. autoclass:: transformers.TFRobertaForQuestionAnswering
:members: call
FlaxRobertaModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxRobertaModel
:members: __call__

View File

@@ -17,7 +17,7 @@ The abstract from the paper is the following:
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
@@ -64,7 +64,7 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels, return_dict=True).loss
loss = model(input_ids=input_ids, labels=labels).loss
- Supervised training
@@ -77,7 +77,7 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels, return_dict=True).loss
loss = model(input_ids=input_ids, labels=labels).loss
T5Config

View File

@@ -49,16 +49,16 @@ TransfoXLTokenizer
TransfoXL specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_transfo_xl.TransfoXLModelOutput
.. autoclass:: transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput
:members:
.. autoclass:: transformers.modeling_transfo_xl.TransfoXLLMHeadModelOutput
.. autoclass:: transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput
:members:
.. autoclass:: transformers.modeling_tf_transfo_xl.TFTransfoXLModelOutput
.. autoclass:: transformers.models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput
:members:
.. autoclass:: transformers.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput
.. autoclass:: transformers.models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput
:members:

View File

@@ -50,7 +50,7 @@ XLMTokenizer
XLM specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_xlm.XLMForQuestionAnsweringOutput
.. autoclass:: transformers.models.xlm.modeling_xlm.XLMForQuestionAnsweringOutput
:members:

View File

@@ -19,7 +19,7 @@ just the next token. Its architecture is identical to ProhpetNet, but the model
The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
@@ -27,7 +27,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

View File

@@ -53,43 +53,43 @@ XLNetTokenizer
XLNet specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_xlnet.XLNetModelOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetModelOutput
:members:
.. autoclass:: transformers.modeling_xlnet.XLNetLMHeadModelOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetLMHeadModelOutput
:members:
.. autoclass:: transformers.modeling_xlnet.XLNetForSequenceClassificationOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForSequenceClassificationOutput
:members:
.. autoclass:: transformers.modeling_xlnet.XLNetForMultipleChoiceOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForMultipleChoiceOutput
:members:
.. autoclass:: transformers.modeling_xlnet.XLNetForTokenClassificationOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForTokenClassificationOutput
:members:
.. autoclass:: transformers.modeling_xlnet.XLNetForQuestionAnsweringSimpleOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForQuestionAnsweringSimpleOutput
:members:
.. autoclass:: transformers.modeling_xlnet.XLNetForQuestionAnsweringOutput
.. autoclass:: transformers.models.xlnet.modeling_xlnet.XLNetForQuestionAnsweringOutput
:members:
.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetModelOutput
.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetModelOutput
:members:
.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetLMHeadModelOutput
.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetLMHeadModelOutput
:members:
.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForSequenceClassificationOutput
.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForSequenceClassificationOutput
:members:
.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForMultipleChoiceOutput
.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForMultipleChoiceOutput
:members:
.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForTokenClassificationOutput
.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForTokenClassificationOutput
:members:
.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForQuestionAnsweringSimpleOutput
.. autoclass:: transformers.models.xlnet.modeling_tf_xlnet.TFXLNetForQuestionAnsweringSimpleOutput
:members:

View File

@@ -37,7 +37,7 @@ For instance:
.. code-block::
>>> tokenizer = AutoTokenizer.from_pretrained(
>>> model = AutoModel.from_pretrained(
>>> "julien-c/EsperBERTo-small",
>>> revision="v2.0.1" # tag name, or branch name, or commit hash
>>> )
@@ -46,31 +46,42 @@ Basic steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to upload a model, you'll need to first create a git repo. This repo will live on the model hub, allowing
users to clone it and you (and your organization members) to push to it. First, you should ensure you are logged in the
``transformers-cli``:
users to clone it and you (and your organization members) to push to it.
Go in a terminal and run the following command. It should be in the virtual environment where you installed 🤗
You can create a model repo directly from the website, `here <https://huggingface.co/new>`.
Alternatively, you can use the ``transformers-cli``. The next steps describe that process:
Go to a terminal and run the following command. It should be in the virtual environment where you installed 🤗
Transformers, since that command :obj:`transformers-cli` comes from the library.
.. code-block::
.. code-block:: bash
transformers-cli login
Once you are logged in with your model hub credentials, you can start building your repositories. To create a repo:
.. code-block::
.. code-block:: bash
transformers-cli repo create your-model-name
This creates a repo on the model hub, which can be cloned. You can then add/remove from that repo as you would with any
other git repo.
This creates a repo on the model hub, which can be cloned.
.. code-block::
.. code-block:: bash
git clone https://huggingface.co/username/your-model-name
# Then commit as usual
# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install
When you have your local clone of your repo and lfs installed, you can then add/remove from that clone as you would
with any other git repo.
.. code-block:: bash
# Commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"
@@ -159,24 +170,25 @@ Or, if you're using the Trainer API
.. code-block::
>>> trainer.save_model("path/to/awesome-name-you-picked")
>>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")
You can then add these files to the staging environment and verify that they have been correctly staged with the ``git
status`` command:
.. code-block::
.. code-block:: bash
git add --all
git status
Finally, the files should be comitted:
.. code-block::
.. code-block:: bash
git commit -m "First version of the your-model-name model and tokenizer."
And pushed to the remote:
.. code-block::
.. code-block:: bash
git push
@@ -199,7 +211,7 @@ don't forget to link to its model card so that people can fully trace how your m
If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
to see the steps to follow.
.. Note::
.. note::
You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
inside `path/to/awesome-name-you-picked/`.
@@ -225,3 +237,47 @@ You may specify a revision by using the ``revision`` flag in the ``from_pretrain
>>> "julien-c/EsperBERTo-small",
>>> revision="v2.0.1" # tag name, or branch name, or commit hash
>>> )
Workflow in a Colab notebook
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're in a Colab notebook (or similar) with no direct access to a terminal, here is the workflow you can use to
upload your model. You can execute each one of them in a cell by adding a ! at the beginning.
First you need to install `git-lfs` in the environment used by the notebook:
.. code-block:: bash
sudo apt-get install git-lfs
Then you can use the :obj:`transformers-cli` to create your new repo:
.. code-block:: bash
transformers-cli login
transformers-cli repo create your-model-name
Once it's created, you can clone it and configure it (replace username by your username on huggingface.co):
.. code-block:: bash
git clone https://username:password@huggingface.co/username/your-model-name
# Alternatively if you have a token,
# you can use it instead of your password
git clone https://username:token@huggingface.co/username/your-model-name
cd your-model-name
git lfs install
git config --global user.email "email@example.com"
# Tip: using the same email than for your huggingface.co account will link your commits to your profile
git config --global user.name "Your name"
Once you've saved your model inside, and your clone is setup with the right remote URL, you can add it and push it with
usual git commands.
.. code-block:: bash
git add .
git commit -m "Initial commit"
git push

View File

@@ -527,10 +527,10 @@ Pegasus
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
objective, called Gap Sentence Generation (GSG).
* MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
BERT)
* GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
causal mask to hide the future words like a regular auto-regressive transformer decoder.
@@ -560,6 +560,7 @@ A framework for translation models, using the same models as BART
The library provides a version of this model for conditional generation.
T5
-----------------------------------------------------------------------------------------------------------------------
@@ -592,6 +593,28 @@ For instance, if we have the sentence “My dog is very cute .”, and we decide
The library provides a version of this model for conditional generation.
MT5
-----------------------------------------------------------------------------------------------------------------------
.. raw:: html
<a href="https://huggingface.co/models?filter=mt5">
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mt5-blueviolet">
</a>
<a href="model_doc/mt5.html">
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mt5-blueviolet">
</a>
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
et al.
The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
supervised training. mT5 is trained on 101 languages.
The library provides a version of this model for conditional generation.
MBart
-----------------------------------------------------------------------------------------------------------------------
@@ -607,8 +630,8 @@ MBart
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
sequence-to-sequence model by denoising full texts in multiple languages,
The library provides a version of this model for conditional generation.
@@ -635,7 +658,7 @@ ProphetNet
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
@@ -660,8 +683,8 @@ XLM-ProphetNet
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
versions for headline generation and question generation, respectively.

View File

@@ -109,7 +109,7 @@ XLM-RoBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
labeling and question answering.
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:

View File

@@ -62,7 +62,7 @@ sliding the context window so that the model has more context when making each p
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
1 token a time. This allows computation to procede much faster while still giving the model a large context to make
1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
predictions at each step.
Example: Calculating perplexity with GPT-2 in 🤗 Transformers

View File

@@ -3,11 +3,11 @@ Pretrained models
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes community-uploaded models, refer to `https://huggingface.co/models
For a list that includes all community-uploaded models, refer to `https://huggingface.co/models
<https://huggingface.co/models>`__.
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Shortcut name | Details of the model |
| Architecture | Model id | Details of the model |
+====================+============================================================+=======================================================================================================================================+
| BERT | ``bert-base-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on lower-cased English text. |

View File

@@ -89,7 +89,7 @@ each other. The process is the following:
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
>>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=True)
>>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
>>> classes = ["not paraphrase", "is paraphrase"]
@@ -122,7 +122,7 @@ each other. The process is the following:
>>> import tensorflow as tf
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
>>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=True)
>>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
>>> classes = ["not paraphrase", "is paraphrase"]
@@ -211,7 +211,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
>>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=True)
>>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
>>> text = r"""
... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
@@ -231,7 +231,9 @@ Here is an example of question answering using a model and a tokenizer. The proc
... input_ids = inputs["input_ids"].tolist()[0]
...
... text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
... answer_start_scores, answer_end_scores = model(**inputs)
... outputs = model(**inputs)
... answer_start_scores = outputs.start_logits
... answer_end_scores = outputs.end_logits
...
... answer_start = torch.argmax(
... answer_start_scores
@@ -253,7 +255,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
>>> import tensorflow as tf
>>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
>>> model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=True)
>>> model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
>>> text = r"""
... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
@@ -273,7 +275,9 @@ Here is an example of question answering using a model and a tokenizer. The proc
... input_ids = inputs["input_ids"].numpy()[0]
...
... text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
... answer_start_scores, answer_end_scores = model(inputs)
... outputs = model(inputs)
... answer_start_scores = outputs.start_logits
... answer_end_scores = outputs.end_logits
...
... answer_start = tf.argmax(
... answer_start_scores, axis=1
@@ -301,7 +305,7 @@ Language modeling is the task of fitting a model to a corpus, which can be domai
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
GPT-2 with causal language modeling.
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
@@ -373,7 +377,7 @@ Here is an example of doing masked language modeling using a model and a tokeniz
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
>>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased", return_dict=True)
>>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
>>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
@@ -389,7 +393,7 @@ Here is an example of doing masked language modeling using a model and a tokeniz
>>> import tensorflow as tf
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
>>> model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased", return_dict=True)
>>> model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")
>>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
@@ -437,7 +441,7 @@ of tokens.
>>> from torch.nn import functional as F
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelWithLMHead.from_pretrained("gpt2", return_dict=True)
>>> model = AutoModelWithLMHead.from_pretrained("gpt2")
>>> sequence = f"Hugging Face is based in DUMBO, New York City, and "
@@ -461,7 +465,7 @@ of tokens.
>>> import tensorflow as tf
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = TFAutoModelWithLMHead.from_pretrained("gpt2", return_dict=True)
>>> model = TFAutoModelWithLMHead.from_pretrained("gpt2")
>>> sequence = f"Hugging Face is based in DUMBO, New York City, and "
@@ -513,14 +517,14 @@ Here, the model generates a random text with a total maximal length of *50* toke
concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overridden in the
pipeline, as is shown above for the argument ``max_length``.
Here is an example of text generation using ``XLNet`` and its tokenzier.
Here is an example of text generation using ``XLNet`` and its tokenizer.
.. code-block::
>>> ## PYTORCH CODE
>>> from transformers import AutoModelWithLMHead, AutoTokenizer
>>> model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased", return_dict=True)
>>> model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
>>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
>>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
@@ -545,7 +549,7 @@ Here is an example of text generation using ``XLNet`` and its tokenzier.
>>> ## TENSORFLOW CODE
>>> from transformers import TFAutoModelWithLMHead, AutoTokenizer
>>> model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased", return_dict=True)
>>> model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased")
>>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
>>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
@@ -664,7 +668,7 @@ Here is an example of doing named entity recognition, using a model and a tokeni
>>> from transformers import AutoModelForTokenClassification, AutoTokenizer
>>> import torch
>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english", return_dict=True)
>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> label_list = [
@@ -692,7 +696,7 @@ Here is an example of doing named entity recognition, using a model and a tokeni
>>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
>>> import tensorflow as tf
>>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english", return_dict=True)
>>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> label_list = [
@@ -790,7 +794,7 @@ CNN / Daily Mail), it yields very good results.
>>> ## PYTORCH CODE
>>> from transformers import AutoModelWithLMHead, AutoTokenizer
>>> model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
>>> model = AutoModelWithLMHead.from_pretrained("t5-base")
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
@@ -799,7 +803,7 @@ CNN / Daily Mail), it yields very good results.
>>> ## TENSORFLOW CODE
>>> from transformers import TFAutoModelWithLMHead, AutoTokenizer
>>> model = TFAutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
>>> model = TFAutoModelWithLMHead.from_pretrained("t5-base")
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
@@ -834,7 +838,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarizaed.
2. Define the article that should be summarized.
3. Add the T5 specific prefix "translate English to German: "
4. Use the ``PreTrainedModel.generate()`` method to perform the translation.
@@ -843,7 +847,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce
>>> ## PYTORCH CODE
>>> from transformers import AutoModelWithLMHead, AutoTokenizer
>>> model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
>>> model = AutoModelWithLMHead.from_pretrained("t5-base")
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
@@ -851,7 +855,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce
>>> ## TENSORFLOW CODE
>>> from transformers import TFAutoModelWithLMHead, AutoTokenizer
>>> model = TFAutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
>>> model = TFAutoModelWithLMHead.from_pretrained("t5-base")
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="tf")

View File

@@ -405,32 +405,32 @@ decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
* ``require_torch`` - this test will run only under torch
* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
* ``require_torch_multigpu`` - as ``require_torch`` plus requires at least 2 GPUs
* ``require_torch_non_multigpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
* ``require_torch_multi_gpu`` - as ``require_torch`` plus requires at least 2 GPUs
* ``require_torch_non_multi_gpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU
Let's depict the GPU requirements in the following table:
+----------+---------------------------------+
| n gpus | decorator |
+==========+=================================+
| ``>= 0`` | ``@require_torch`` |
+----------+---------------------------------+
| ``>= 1`` | ``@require_torch_gpu`` |
+----------+---------------------------------+
| ``>= 2`` | ``@require_torch_multigpu`` |
+----------+---------------------------------+
| ``< 2`` | ``@require_torch_non_multigpu`` |
+----------+---------------------------------+
+----------+----------------------------------+
| n gpus | decorator |
+==========+==================================+
| ``>= 0`` | ``@require_torch`` |
+----------+----------------------------------+
| ``>= 1`` | ``@require_torch_gpu`` |
+----------+----------------------------------+
| ``>= 2`` | ``@require_torch_multi_gpu`` |
+----------+----------------------------------+
| ``< 2`` | ``@require_torch_non_multi_gpu`` |
+----------+----------------------------------+
For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
.. code-block:: python
@require_torch_multigpu
def test_example_with_multigpu():
@require_torch_multi_gpu
def test_example_with_multi_gpu():
If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
@@ -454,7 +454,7 @@ last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python
@parameterized.expand(...)
@require_torch_multigpu
@require_torch_multi_gpu
def test_integration_foo():
This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
@@ -716,11 +716,11 @@ Temporary files and directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
each other's data. Also we want to get the temp files and directories removed at the end of each test that created
each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
them. Therefore, using packages like ``tempfile``, which address these needs is essential.
However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to
know it's exact path and not having it randomized on every test re-run.
However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
to know it's exact path and not having it randomized on every test re-run.
A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
@@ -736,32 +736,33 @@ Here is an example of its usage:
This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless
``after=False`` is passed to the helper function.
* Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a
specific directory:
* Create a unique temporary dir:
.. code-block:: python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test")
tmp_dir = self.get_auto_remove_tmp_dir()
* Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the
temp results:
``tmp_dir`` will contain the path to the created temporary dir. It will be automatically removed at the end of the
test.
* Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
.. code-block:: python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", after=False)
tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
* Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion
in the previous test run and want to make sure the that temporary directory is empty before the new test is run:
This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
leave any data in there.
.. code-block:: python
* You can override the default behavior by directly overriding the ``before`` and ``after`` args, leading to one of the
following behaviors:
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", before=True)
- ``before=True``: the temporary dir will always be cleared at the beginning of the test.
- ``before=False``: if the temporary dir already existed, any existing files will remain there.
- ``after=True``: the temporary dir will always be deleted at the end of the test.
- ``after=False``: the temporary dir will always be left intact at the end of the test.
.. note::
In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
@@ -815,7 +816,7 @@ or the ``xfail`` way:
@pytest.mark.xfail
def test_feature_x():
Here is how to skip a test based on some internal check inside the test:
- Here is how to skip a test based on some internal check inside the test:
.. code-block:: python
@@ -838,7 +839,7 @@ or the ``xfail`` way:
def test_feature_x():
pytest.xfail("expected to fail until bug XYZ is fixed")
Here is how to skip all tests in a module if some import is missing:
- Here is how to skip all tests in a module if some import is missing:
.. code-block:: python
@@ -1054,7 +1055,7 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger`
msg = "Testing 1, 2, 3"
logging.set_verbosity_info()
logger = logging.get_logger("transformers.tokenization_bart")
logger = logging.get_logger("transformers.models.bart.tokenization_bart")
with CaptureLogger(logger) as cl:
logger.info(msg)
assert cl.out, msg+"\n"

View File

@@ -1,223 +1,243 @@
Tokenizer summary
Summary of the tokenizers
-----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
those.
On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
using :ref:`WordPiece <wordpiece>`.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
that the model uses :ref:`WordPiece <wordpiece>`.
Introduction to tokenization
Introduction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
text is just to split it by spaces, which would give:
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
this text is to split it by spaces, which would give:
.. code-block::
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
into account. This would give:
This is a sensible first step, but if we look at the tokens ``"Transformers?"`` and ``"do."``, we notice that the
punctuation is attached to the words ``"Transformer"`` and ``"do"``, which is suboptimal. We should take the
punctuation into account so that a model does not have to learn a different representation of a word and every possible
punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
Taking punctuation into account, tokenizing our exemplary text would give:
.. code-block::
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
perform properly if you don't use the exact same rules as the persons who pretrained it.
Better. However, it is disadvantageous, how the tokenization dealt with the word ``"Don't"``. ``"Don't"`` stands for
``"do not"``, so it would be better tokenized as ``["Do", "n't"]``. This is where things start getting complicated, and
part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
input that was tokenized with the same rules that were used to tokenize its training data.
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
rule-based tokenizers. On the text above, they'd output something like:
rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:
.. code-block::
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, :doc:`Transformer XL
<model_doc/transformerxl>` uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
language.
Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
greater than 50,000, especially if they are pretrained only on a single language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
Subword tokenization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
form (almost) arbitrarily long complex words by stringing together some subwords.
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
stand-alone subwords would appear more frequently while at the same time the meaning of ``"annoyingly"`` is kept by the
composite meaning of ``"annoying"`` and ``"ly"``. This is especially useful in agglutinative languages such as Turkish,
where you can form (almost) arbitrarily long complex words by stringing together subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
context-independent representations. In addition, subword tokenization enables the model to process words it has never
seen before, by decomposing them into known subwords. For instance, the :class:`~transformers.BertTokenizer` tokenizes
``"I have a new GPU!"`` as follows:
.. code-block::
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("I have a new GPU!")
['i', 'have', 'a', 'new', 'gp', '##u', '!']
["i", "have", "a", "new", "gp", "##u", "!"]
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization).
Because we are considering the uncased model, the sentence was lowercased first. We can see that the words ``["i",
"have", "a", "new"]`` are present in the tokenizer's vocabulary, but the word ``"gpu"`` is not. Consequently, the
tokenizer splits ``"gpu"`` into known subwords: ``["gp" and "##u"]``. ``"##"`` means that the rest of the token should
be attached to the previous one, without space (for decoding or reversal of the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
As another example, :class:`~transformers.XLNetTokenizer` tokenizes our previously exemplary text as follows:
.. code-block::
>>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
Transformers has been split into "Transform" and "ers".
We'll get back to the meaning of those ``"▁"`` when we look at :ref:`SentencePiece <sentencepiece>`. As one can see,
the rare word ``"Transformers"`` has been split into the more frequent subwords ``"Transform"`` and ``"ers"``.
Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
training which is usually done on the corpus the corresponding model will be trained on.
Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
on.
.. _byte-pair-encoding:
Byte-Pair Encoding
Byte-Pair Encoding (BPE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
Byte-Pair Encoding (BPE) was introduced in `Neural Machine Translation of Rare Words with Subword Units (Sennrich et
al., 2015) <https://arxiv.org/abs/1508.07909>`__. BPE relies on a pre-tokenizer that splits the training data into
words. Pretokenization can be as simple as space tokenization, e.g. :doc:`GPT-2 <model_doc/gpt2>`, :doc:`Roberta
<model_doc/roberta>`. More advanced pre-tokenization include rule-based tokenization, e.g. :doc:`XLM <model_doc/xlm>`,
:doc:`FlauBERT <model_doc/flaubert>` which uses Moses for most languages, or :doc:`GPT <model_doc/gpt>` which uses
Spacy and ftfy, to count the frequency of each word in the training corpus.
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
define before training the tokenizer.
It then begins from the list of all characters and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
word):
As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
determined:
.. code-block::
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
Consequently, the base vocabulary is ``["b", "g", "h", "n", "p", "s", "u"]``. Splitting all words into symbols of the
base vocabulary, we obtain:
.. code-block::
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes
BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
the example above ``"h"`` followed by ``"u"`` is present `10 + 5 = 15` times (10 times in the 10 occurrences of
``"hug"``, 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is ``"u"`` followed by "g",
occurring `10 + 5 + 5 = 20` times in total. Thus, the first merge rule the tokenizer learns is to group all ``"u"``
symbols followed by a ``"g"`` symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes
.. code-block::
('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
to the vocabulary.
BPE then identifies the next most common symbol pair. It's ``"u"`` followed by ``"n"``, which occurs 16 times. ``"u"``,
``"n"`` is merged to ``"un"`` and added to the vocabulary. The next most frequent symbol pair is ``"h"`` followed by
``"ug"``, occurring 15 times. Again the pair is merged and ``"hug"`` can be added to the vocabulary.
At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
represented as
At this stage, the vocabulary is ``["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`` and our set of unique words
is represented as
.. code-block::
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
(since the base corpus uses all of them), but to special characters like emojis.
Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
the word ``"bug"`` would be tokenized to ``["b", "ug"]`` but ``"mug"`` would be tokenized as ``["<unk>", "ug"]`` since
the symbol ``"m"`` is not in the base vocabulary. In general, single letters such as ``"m"`` are not replaced by the
``"<unk>"`` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
to happen for very special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop the training of the tokenizer at 40,000 merges.
and chose to stop training after 40,000 merges.
Byte-level BPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the `GPT-2 paper
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
a special end-of-text token and the symbols learned with 50,000 merges.
A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
considered as base characters. To have a better base vocabulary, `GPT-2
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ uses bytes
as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
tokenizer can tokenize every text without the need for the <unk> symbol. :doc:`GPT-2 <model_doc/gpt>` has a vocabulary
size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
with 50,000 merges.
.. _wordpiece:
WordPiece
=======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
that will maximize the likelihood on the corpus once merged.
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
Voice Seach (Schuster et al., 2012)
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
sure it's `worth it`.
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
its second symbol is the greatest among all symbol pairs. *E.g.* ``"u"``, followed by ``"g"`` would have only been
merged if the probability of ``"ug"`` divided by ``"u"``, ``"g"`` would have been greater than for any other symbol
pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it `loses` by merging two symbols
to make ensure it's `worth it`.
.. _unigram:
Unigram
=======================================================================================================================
Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
with :ref:`SentencePiece <sentencepiece>`.
Unigram is a subword tokenization algorithm introduced in `Subword Regularization: Improving Neural Network Translation
Models with Multiple Subword Candidates (Kudo, 2018) <https://arxiv.org/pdf/1804.10959.pdf>`__. In contrast to BPE or
WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
conjunction with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
like BPE or WordPiece).
At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
vocabulary
Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
.. code-block::
['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities).
``"hugs"`` could be tokenized both as ``["hug", "s"]``, ``["h", "ug", "s"]`` or ``["h", "u", "g", "s"]``. So which one
to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
probabilities.
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
:math:`x_{i}` (with the current vocabulary), then the loss is defined as
Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
the words :math:`x_{1}, \dots, x_{N}` and that the set of all possible tokenizations for a word :math:`x_{i}` is
defined as :math:`S(x_{i})`, then the overall loss is defined as
.. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
@@ -227,15 +247,18 @@ x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all
SentencePiece
=======================================================================================================================
All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
separate words. However, not all languages use spaces to separate words. One possible solution is to use language
specific pre-tokenizers, *e.g.* :doc:`XLM <model_doc/xlm>` uses a specific Chinese, Japanese, and Thai pre-tokenizer).
To solve this problem more generally, `SentencePiece: A simple and language independent subword tokenizer and
detokenizer for Neural Text Processing (Kudo et al., 2018) <https://arxiv.org/pdf/1808.06226.pdf>`__ treats the input
as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
algorithm to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
of them together and replace '▁' with space.
The :class:`~transformers.XLNetTokenizer` uses SentencePiece for example, which is also why in the example earlier the
``"▁"`` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
concatenated and ``"▁"`` is replaced by a space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
using SentencePiece are :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>`, :doc:`Marian
<model_doc/marian>`, and :doc:`T5 <model_doc/t5>`.

View File

@@ -39,7 +39,7 @@ head on top of the encoder with an output size of 2. Models are initialized in `
.. code-block:: python
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.train()
This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever

View File

@@ -23,6 +23,7 @@ from typing import Dict, List, Optional
import numpy as np
import torch
import transformers
from transformers import (
AutoConfig,
AutoModelForSequenceClassification,
@@ -33,6 +34,7 @@ from transformers import (
default_data_collator,
set_seed,
)
from transformers.trainer_utils import is_main_process
from utils_hans import HansDataset, InputFeatures, hans_processors, hans_tasks_num_labels
@@ -55,7 +57,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
@@ -124,6 +127,11 @@ def main():
bool(training_args.local_rank != -1),
training_args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed

View File

@@ -25,7 +25,7 @@ class PlotArguments:
)
plot_along_batch: bool = field(
default=False,
metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
metadata={"help": "Whether to plot along batch size or sequence length. Defaults to sequence length."},
)
is_time: bool = field(
default=False,

View File

@@ -21,7 +21,7 @@ import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_albert import (
from transformers.models.albert.modeling_albert import (
ALBERT_INPUTS_DOCSTRING,
ALBERT_START_DOCSTRING,
AlbertModel,

View File

@@ -23,7 +23,7 @@ from torch import nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_bert import (
from transformers.models.bert.modeling_bert import (
BERT_INPUTS_DOCSTRING,
BERT_START_DOCSTRING,
BertEncoder,

View File

@@ -29,6 +29,7 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Tenso
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
import transformers
from pabee.modeling_pabee_albert import AlbertForSequenceClassificationWithPabee
from pabee.modeling_pabee_bert import BertForSequenceClassificationWithPabee
from transformers import (
@@ -44,6 +45,7 @@ from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers.trainer_utils import is_main_process
try:
@@ -474,7 +476,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
@@ -630,7 +632,11 @@ def main():
bool(args.local_rank != -1),
args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seed
set_seed(args)

View File

@@ -4,7 +4,7 @@ import sys
from unittest.mock import patch
import run_glue_with_pabee
from transformers.testing_utils import TestCasePlus, require_torch_non_multigpu_but_fix_me
from transformers.testing_utils import TestCasePlus, require_torch_non_multi_gpu_but_fix_me
logging.basicConfig(level=logging.DEBUG)
@@ -20,7 +20,7 @@ def get_setup_file():
class PabeeTests(TestCasePlus):
@require_torch_non_multigpu_but_fix_me
@require_torch_non_multi_gpu_but_fix_me
def test_run_glue(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)

View File

@@ -30,6 +30,7 @@ from torch.utils.data import DataLoader, SequentialSampler, Subset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm
import transformers
from transformers import (
AutoConfig,
AutoModelForSequenceClassification,
@@ -41,6 +42,7 @@ from transformers import (
glue_processors,
set_seed,
)
from transformers.trainer_utils import is_main_process
logger = logging.getLogger(__name__)
@@ -296,7 +298,7 @@ def main():
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances."
@@ -368,6 +370,11 @@ def main():
# Setup logging
logging.basicConfig(level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, args.n_gpu, bool(args.local_rank != -1)))
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seeds
set_seed(args.seed)

View File

@@ -29,6 +29,7 @@ from typing import Optional
from torch.utils.data import ConcatDataset
import transformers
from transformers import (
CONFIG_MAPPING,
MODEL_WITH_LM_HEAD_MAPPING,
@@ -47,6 +48,7 @@ from transformers import (
TrainingArguments,
set_seed,
)
from transformers.trainer_utils import is_main_process
logger = logging.getLogger(__name__)
@@ -79,7 +81,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
@@ -219,6 +222,11 @@ def main():
bool(training_args.local_rank != -1),
training_args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed

View File

@@ -31,6 +31,7 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
import transformers
from transformers import (
WEIGHTS_NAME,
AdamW,
@@ -41,6 +42,7 @@ from transformers import (
MMBTForClassification,
get_linear_schedule_with_warmup,
)
from transformers.trainer_utils import is_main_process
from utils_mmimdb import ImageEncoder, JsonlDataset, collate_fn, get_image_transforms, get_mmimdb_labels
@@ -348,7 +350,7 @@ def main():
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
@@ -476,7 +478,11 @@ def main():
bool(args.local_rank != -1),
args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seed
set_seed(args)

View File

@@ -1,7 +1,6 @@
import torch
from transformers.modeling_camembert import CamembertForMaskedLM
from transformers.tokenization_camembert import CamembertTokenizer
from transformers import CamembertForMaskedLM, CamembertTokenizer
def fill_mask(masked_input, model, tokenizer, topk=5):

View File

@@ -3,7 +3,7 @@ import json
from typing import List
from ltp import LTP
from transformers.tokenization_bert import BertTokenizer
from transformers import BertTokenizer
def _is_chinese_char(cp):

View File

@@ -31,8 +31,16 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Tenso
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import WEIGHTS_NAME, AdamW, AutoConfig, AutoTokenizer, get_linear_schedule_with_warmup
from transformers.modeling_auto import AutoModelForMultipleChoice
import transformers
from transformers import (
WEIGHTS_NAME,
AdamW,
AutoConfig,
AutoModelForMultipleChoice,
AutoTokenizer,
get_linear_schedule_with_warmup,
)
from transformers.trainer_utils import is_main_process
try:
@@ -620,6 +628,11 @@ def main():
bool(args.local_rank != -1),
args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seed
set_seed(args)

View File

@@ -13,6 +13,7 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Tenso
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
import transformers
from src.modeling_highway_bert import DeeBertForSequenceClassification
from src.modeling_highway_roberta import DeeRobertaForSequenceClassification
from transformers import (
@@ -28,6 +29,7 @@ from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers.trainer_utils import is_main_process
try:
@@ -450,7 +452,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
@@ -580,7 +582,11 @@ def main():
bool(args.local_rank != -1),
args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seed
set_seed(args)

View File

@@ -3,7 +3,7 @@ from torch import nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_bert import (
from transformers.models.bert.modeling_bert import (
BERT_INPUTS_DOCSTRING,
BERT_START_DOCSTRING,
BertEmbeddings,

View File

@@ -3,9 +3,13 @@ from __future__ import absolute_import, division, print_function, unicode_litera
import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.configuration_roberta import RobertaConfig
from transformers import RobertaConfig
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_roberta import ROBERTA_INPUTS_DOCSTRING, ROBERTA_START_DOCSTRING, RobertaEmbeddings
from transformers.models.roberta.modeling_roberta import (
ROBERTA_INPUTS_DOCSTRING,
ROBERTA_START_DOCSTRING,
RobertaEmbeddings,
)
from .modeling_highway_bert import BertPreTrainedModel, DeeBertModel, HighwayException, entropy

View File

@@ -5,7 +5,7 @@ import unittest
from unittest.mock import patch
import run_glue_deebert
from transformers.testing_utils import require_torch_non_multigpu_but_fix_me, slow
from transformers.testing_utils import require_torch_non_multi_gpu_but_fix_me, slow
logging.basicConfig(level=logging.DEBUG)
@@ -26,7 +26,7 @@ class DeeBertTests(unittest.TestCase):
logger.addHandler(stream_handler)
@slow
@require_torch_non_multigpu_but_fix_me
@require_torch_non_multi_gpu_but_fix_me
def test_glue_deebert_train(self):
train_args = """

View File

@@ -17,7 +17,7 @@ This folder contains the original code used to train Distil* as well as examples
## What is Distil*
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distilled-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
We have applied the same method to other Transformer architectures and released the weights:
- GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for **DistilGPT2** (after fine-tuning on the train set).
@@ -57,7 +57,7 @@ Here are the results on the *test* sets for 6 of the languages available in XNLI
This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0).
**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breaking changes compared to v1.1.0).
## How to use DistilBERT
@@ -111,7 +111,7 @@ python scripts/binarized_data.py \
--dump_file data/binarized_text
```
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smooths the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
```bash
python scripts/token_counts.py \
@@ -173,7 +173,7 @@ python -m torch.distributed.launch \
--token_counts data/token_counts.bert-base-uncased.pickle
```
**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
**Tips:** Starting distilled training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
Happy distillation!

View File

@@ -188,7 +188,7 @@ class Distiller:
def prepare_batch_mlm(self, batch):
"""
Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
Prepare the batch: from the token_ids and the lengths, compute the attention mask and the masked label for MLM.
Input:
------
@@ -200,7 +200,7 @@ class Distiller:
-------
token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -100 where there is nothing to predict.
mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels. There is a -100 where there is nothing to predict.
"""
token_ids, lengths = batch
token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
@@ -253,7 +253,7 @@ class Distiller:
def prepare_batch_clm(self, batch):
"""
Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
Prepare the batch: from the token_ids and the lengths, compute the attention mask and the labels for CLM.
Input:
------

View File

@@ -30,6 +30,7 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
import transformers
from transformers import (
WEIGHTS_NAME,
AdamW,
@@ -57,6 +58,7 @@ from transformers.data.metrics.squad_metrics import (
squad_evaluate,
)
from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor
from transformers.trainer_utils import is_main_process
try:
@@ -576,7 +578,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
@@ -745,7 +747,11 @@ def main():
bool(args.local_rank != -1),
args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seed
set_seed(args)

View File

@@ -86,7 +86,7 @@ if __name__ == "__main__":
compressed_sd[f"vocab_layer_norm.{w}"] = state_dict[f"cls.predictions.transform.LayerNorm.{w}"]
print(f"N layers selected for distillation: {std_idx}")
print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
print(f"Number of params transferred for distillation: {len(compressed_sd.keys())}")
print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
print(f"Save transferred checkpoint to {args.dump_checkpoint}.")
torch.save(compressed_sd, args.dump_checkpoint)

View File

@@ -90,7 +90,7 @@ selected tokens (which may be part of words), they mask randomly selected words
to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
To fine-tune a model using whole word masking, use the following script:
```bash
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--dataset_name wikitext \
@@ -164,7 +164,7 @@ context length for permutation language modeling.
The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used
for permutation language modeling.
Here is how to fine-tun XLNet on wikitext-2:
Here is how to fine-tune XLNet on wikitext-2:
```bash
python run_plm.py \

View File

@@ -76,7 +76,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
@@ -168,6 +169,8 @@ def main():
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
@@ -254,7 +257,7 @@ def main():
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=[text_column_name],
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
@@ -310,9 +313,12 @@ def main():
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
trainer.save_model() # Saves the tokenizer too for easy upload
# Evaluation

View File

@@ -74,7 +74,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
@@ -179,6 +180,8 @@ def main():
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
@@ -292,7 +295,7 @@ def main():
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=[text_column_name],
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
@@ -351,9 +354,12 @@ def main():
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
trainer.save_model() # Saves the tokenizer too for easy upload
# Evaluation

View File

@@ -76,7 +76,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
@@ -186,6 +187,8 @@ def main():
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
@@ -299,9 +302,12 @@ def main():
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
trainer.save_model() # Saves the tokenizer too for easy upload
# Evaluation

View File

@@ -64,7 +64,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
@@ -176,6 +177,8 @@ def main():
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
@@ -279,7 +282,7 @@ def main():
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=[text_column_name],
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
@@ -341,9 +344,12 @@ def main():
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
trainer.save_model() # Saves the tokenizer too for easy upload
# Evaluation

View File

@@ -4,6 +4,7 @@ import os
from pathlib import Path
from typing import Any, Dict
import packaging
import pytorch_lightning as pl
from pytorch_lightning.utilities import rank_zero_info
@@ -33,15 +34,17 @@ from transformers.optimization import (
logger = logging.getLogger(__name__)
try:
pkg = "pytorch_lightning"
min_ver = "1.0.4"
pkg_resources.require(f"{pkg}>={min_ver}")
except pkg_resources.VersionConflict:
logger.warning(
f"{pkg}>={min_ver} is required for a normal functioning of this module, but found {pkg}=={pkg_resources.get_distribution(pkg).version}. Try pip install -r examples/requirements.txt"
)
def require_min_ver(pkg, min_ver):
got_ver = pkg_resources.get_distribution(pkg).version
if packaging.version.parse(got_ver) < packaging.version.parse(min_ver):
logger.warning(
f"{pkg}>={min_ver} is required for a normal functioning of this module, but found {pkg}=={got_ver}. "
"Try: pip install -r examples/requirements.txt"
)
require_min_ver("pytorch_lightning", "1.0.4")
MODEL_MODES = {
"base": AutoModel,
@@ -233,7 +236,7 @@ class BaseTransformer(pl.LightningModule):
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--encoder_layerdrop",

View File

@@ -107,7 +107,12 @@ def make_support(question, source="wiki40b", method="dense", n_results=10):
return question_doc, support_list
@st.cache(hash_funcs={torch.Tensor: (lambda _: None), transformers.tokenization_bart.BartTokenizer: (lambda _: None)})
@st.cache(
hash_funcs={
torch.Tensor: (lambda _: None),
transformers.models.bart.tokenization_bart.BartTokenizer: (lambda _: None),
}
)
def answer_question(
question_doc, s2s_model, s2s_tokenizer, min_len=64, max_len=256, sampling=False, n_beams=2, top_p=0.95, temp=0.8
):

View File

@@ -210,7 +210,6 @@
" visual_feats=features,\n",
" visual_pos=normalized_boxes,\n",
" token_type_ids=inputs.token_type_ids,\n",
" return_dict=True,\n",
" output_attentions=False,\n",
" )\n",
" output_vqa = lxmert_vqa(\n",
@@ -219,7 +218,6 @@
" visual_feats=features,\n",
" visual_pos=normalized_boxes,\n",
" token_type_ids=inputs.token_type_ids,\n",
" return_dict=True,\n",
" output_attentions=False,\n",
" )\n",
" # get prediction\n",
@@ -266,4 +264,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}

View File

@@ -21,7 +21,7 @@ You can also have a look at this fun *Explain Like I'm Five* introductory [slide
One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the original dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).

View File

@@ -16,7 +16,7 @@
"""Masked Version of BERT. It replaces the `torch.nn.Linear` layers with
:class:`~emmental.MaskedLinear` and add an additional parameters in the forward pass to
compute the adaptive mask.
Built on top of `transformers.modeling_bert`"""
Built on top of `transformers.models.bert.modeling_bert`"""
import logging
@@ -29,8 +29,8 @@ from torch.nn import CrossEntropyLoss, MSELoss
from emmental import MaskedBertConfig
from emmental.modules import MaskedLinear
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_bert import ACT2FN, BertLayerNorm, load_tf_weights_in_bert
from transformers.modeling_utils import PreTrainedModel, prune_linear_layer
from transformers.models.bert.modeling_bert import ACT2FN, BertLayerNorm, load_tf_weights_in_bert
logger = logging.getLogger(__name__)

View File

@@ -14,7 +14,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
Binarizers take a (real value) matrix as input and produce a binary (values in {0,1}) mask of the same shape.
"""
import torch

View File

@@ -620,7 +620,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",

View File

@@ -725,7 +725,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(

View File

@@ -23,6 +23,7 @@ from typing import Dict, Optional
import numpy as np
import transformers
from transformers import (
AutoConfig,
AutoModelForMultipleChoice,
@@ -33,6 +34,7 @@ from transformers import (
TrainingArguments,
set_seed,
)
from transformers.trainer_utils import is_main_process
from utils_multiple_choice import MultipleChoiceDataset, Split, processors
@@ -59,7 +61,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
@@ -115,6 +118,11 @@ def main():
bool(training_args.local_rank != -1),
training_args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed

View File

@@ -33,9 +33,15 @@ from transformers import (
TFTrainingArguments,
set_seed,
)
from transformers.utils import logging as hf_logging
from utils_multiple_choice import Split, TFMultipleChoiceDataset, processors
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()
logger = logging.getLogger(__name__)
@@ -59,7 +65,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)

View File

@@ -29,6 +29,7 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
import transformers
from transformers import (
MODEL_FOR_QUESTION_ANSWERING_MAPPING,
WEIGHTS_NAME,
@@ -45,6 +46,7 @@ from transformers.data.metrics.squad_metrics import (
squad_evaluate,
)
from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor
from transformers.trainer_utils import is_main_process
try:
@@ -319,7 +321,7 @@ def evaluate(args, model, tokenizer, prefix=""):
eval_feature = features[feature_index.item()]
unique_id = int(eval_feature.unique_id)
output = [to_list(output[i]) for output in outputs]
output = [to_list(output[i]) for output in outputs.to_tuple()]
# Some models (XLNet, XLM) use 5 arguments for their predictions, while the other "simpler"
# models only use two.
@@ -530,7 +532,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
@@ -712,7 +714,11 @@ def main():
bool(args.local_rank != -1),
args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Set seed
set_seed(args)
@@ -730,6 +736,7 @@ def main():
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
use_fast=False, # SquadDataset is not compatible with Fast tokenizers which have a smarter overflow handeling
)
model = AutoModelForQuestionAnswering.from_pretrained(
args.model_name_or_path,
@@ -778,7 +785,10 @@ def main():
# Load a trained model and vocabulary that you have fine-tuned
model = AutoModelForQuestionAnswering.from_pretrained(args.output_dir) # , force_download=True)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
# SquadDataset is not compatible with Fast tokenizers which have a smarter overflow handeling
# So we use use_fast=False here for now until Fast-tokenizer-compatible-examples are out
tokenizer = AutoTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case, use_fast=False)
model.to(args.device)
# Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory

View File

@@ -22,9 +22,11 @@ import sys
from dataclasses import dataclass, field
from typing import Optional
import transformers
from transformers import AutoConfig, AutoModelForQuestionAnswering, AutoTokenizer, HfArgumentParser, SquadDataset
from transformers import SquadDataTrainingArguments as DataTrainingArguments
from transformers import Trainer, TrainingArguments
from transformers.trainer_utils import is_main_process
logger = logging.getLogger(__name__)
@@ -49,7 +51,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
@@ -91,6 +94,11 @@ def main():
bool(training_args.local_rank != -1),
training_args.fp16,
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
logger.info("Training/evaluation parameters %s", training_args)
# Prepare Question-Answering task
@@ -107,6 +115,7 @@ def main():
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=False, # SquadDataset is not compatible with Fast tokenizers which have a smarter overflow handeling
)
model = AutoModelForQuestionAnswering.from_pretrained(
model_args.model_name_or_path,

View File

@@ -33,6 +33,12 @@ from transformers import (
squad_convert_examples_to_features,
)
from transformers.data.processors.squad import SquadV1Processor, SquadV2Processor
from transformers.utils import logging as hf_logging
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()
logger = logging.getLogger(__name__)
@@ -57,7 +63,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)

View File

@@ -27,7 +27,7 @@ class RagPyTorchDistributedRetriever(RagRetriever):
It is used to decode the question and then use the generator_tokenizer.
generator_tokenizer (:class:`~transformers.PretrainedTokenizer`):
The tokenizer used for the generator part of the RagModel.
index (:class:`~transformers.retrieval_rag.Index`, optional, defaults to the one defined by the configuration):
index (:class:`~transformers.models.rag.retrieval_rag.Index`, optional, defaults to the one defined by the configuration):
If specified, use this index instead of the one built using the configuration
"""

View File

@@ -95,7 +95,7 @@ def evaluate_batch_retrieval(args, rag_model, questions):
truncation=True,
)["input_ids"].to(args.device)
question_enc_outputs = rag_model.rag.question_encoder(retriever_input_ids, return_dict=True)
question_enc_outputs = rag_model.rag.question_encoder(retriever_input_ids)
question_enc_pool_output = question_enc_outputs.pooler_output
result = rag_model.retriever(

View File

@@ -204,7 +204,6 @@ class GenerativeQAModule(BaseTransformer):
decoder_input_ids=decoder_input_ids,
use_cache=False,
labels=lm_labels,
return_dict=True,
**rag_kwargs,
)

View File

@@ -7,7 +7,7 @@ export PYTHONPATH="../":"${PYTHONPATH}"
python examples/rag/finetune.py \
--data_dir $DATA_DIR \
--output_dir $OUTPUT_DIR \
--model_name_or_path $MODLE_NAME_OR_PATH \
--model_name_or_path $MODEL_NAME_OR_PATH \
--model_type rag_sequence \
--fp16 \
--gpus 8 \

View File

@@ -11,16 +11,12 @@ import numpy as np
from datasets import Dataset
import faiss
from transformers.configuration_bart import BartConfig
from transformers.configuration_dpr import DPRConfig
from transformers.configuration_rag import RagConfig
from transformers import BartConfig, BartTokenizer, DPRConfig, DPRQuestionEncoderTokenizer, RagConfig
from transformers.file_utils import is_datasets_available, is_faiss_available, is_psutil_available, is_torch_available
from transformers.retrieval_rag import CustomHFIndex
from transformers.testing_utils import require_torch_non_multigpu_but_fix_me
from transformers.tokenization_bart import BartTokenizer
from transformers.tokenization_bert import VOCAB_FILES_NAMES as DPR_VOCAB_FILES_NAMES
from transformers.tokenization_dpr import DPRQuestionEncoderTokenizer
from transformers.tokenization_roberta import VOCAB_FILES_NAMES as BART_VOCAB_FILES_NAMES
from transformers.models.bert.tokenization_bert import VOCAB_FILES_NAMES as DPR_VOCAB_FILES_NAMES
from transformers.models.rag.retrieval_rag import CustomHFIndex
from transformers.models.roberta.tokenization_roberta import VOCAB_FILES_NAMES as BART_VOCAB_FILES_NAMES
from transformers.testing_utils import require_torch_non_multi_gpu_but_fix_me
sys.path.append(os.path.join(os.getcwd())) # noqa: E402 # noqa: E402 # isort:skip
@@ -137,7 +133,7 @@ class RagRetrieverTest(TestCase):
question_encoder=DPRConfig().to_dict(),
generator=BartConfig().to_dict(),
)
with patch("transformers.retrieval_rag.load_dataset") as mock_load_dataset:
with patch("transformers.models.rag.retrieval_rag.load_dataset") as mock_load_dataset:
mock_load_dataset.return_value = dataset
retriever = RagPyTorchDistributedRetriever(
config,
@@ -179,7 +175,7 @@ class RagRetrieverTest(TestCase):
retriever.init_retrieval(port)
return retriever
@require_torch_non_multigpu_but_fix_me
@require_torch_non_multi_gpu_but_fix_me
def test_pytorch_distributed_retriever_retrieve(self):
n_docs = 1
retriever = self.get_dummy_pytorch_distributed_retriever(init_retrieval=True)
@@ -195,7 +191,7 @@ class RagRetrieverTest(TestCase):
self.assertEqual(doc_dicts[1]["id"][0], "0") # max inner product is reached with first doc
self.assertListEqual(doc_ids.tolist(), [[1], [0]])
@require_torch_non_multigpu_but_fix_me
@require_torch_non_multi_gpu_but_fix_me
def test_custom_hf_index_retriever_retrieve(self):
n_docs = 1
retriever = self.get_dummy_custom_hf_index_retriever(init_retrieval=True, from_disk=False)
@@ -211,7 +207,7 @@ class RagRetrieverTest(TestCase):
self.assertEqual(doc_dicts[1]["id"][0], "0") # max inner product is reached with first doc
self.assertListEqual(doc_ids.tolist(), [[1], [0]])
@require_torch_non_multigpu_but_fix_me
@require_torch_non_multi_gpu_but_fix_me
def test_custom_pytorch_distributed_retriever_retrieve_from_disk(self):
n_docs = 1
retriever = self.get_dummy_custom_hf_index_retriever(init_retrieval=True, from_disk=True)

View File

@@ -18,3 +18,4 @@ fire
pytest
conllu
sentencepiece != 0.1.92
protobuf

View File

@@ -380,7 +380,7 @@ cp xsum/test* all_pl
then use `all_pl` as DATA in the command above.
#### Direct Knowledge Distillation (KD)
+ In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `BartSummarizationDistiller`.
+ In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `SummarizationDistiller`.
+ This method was used for `sshleifer/distilbart-xsum-12-6`, `6-6`, and `9-6` checkpoints were produced.
+ You must use [`distillation.py`](./distillation.py). Note that this command initializes the student for you.

View File

@@ -3,7 +3,8 @@
python finetune_trainer.py \
--learning_rate=3e-5 \
--fp16 \
--do_train --do_eval --do_predict --evaluate_during_training \
--do_train --do_eval --do_predict \
--evaluation_strategy steps \
--predict_with_generate \
--n_val 1000 \
"$@"

Some files were not shown because too many files have changed in this diff Show More