Commit Graph

133 Commits

Author SHA1 Message Date
Nicolas Patry
d33dc7966a Improve truncation_side (#14947)
* Enabling `truncation_side` for Slow and Fast tokenizer.

Co-Authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>

* Disable failing tests.

* Layout xlm.

* assert -> assertEqual.

Co-authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
2022-01-03 16:18:39 +01:00
Patrick von Platen
a1392883ce [AutoProcessor] Correct AutoProcessor and automatically add processor… (#14881)
* [AutoProcessor] Correct AutoProcessor and automatically add processor class

* up

* up

* up

* up

* up

* up

* up

* up

* continue tomorrow

* up

* up

* up

* make processor class private

* fix loop
2021-12-30 09:56:43 +01:00
Sylvain Gugger
b5e2b183af Doc styler examples (#14953)
* Fix bad examples

* Add black formatting to style_doc

* Use first nonempty line

* Put it at the right place

* Don't add spaces to empty lines

* Better templates

* Deal with triple quotes in docstrings

* Result of style_doc

* Enable mdx treatment and fix code examples in MDXs

* Result of doc styler on doc source files

* Last fixes

* Break copy from
2021-12-27 19:07:46 -05:00
Stas Bekman
133c5e40c4 [doc] consistent True/False/None default format (#14951)
* [doc] consistent True/False/None default format

* Update src/transformers/models/xlnet/modeling_xlnet.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-12-27 14:31:40 -08:00
Sylvain Gugger
87e6e4fe5c Doc styler v2 (#14950)
* New doc styler

* Fix issue with args at the start

* Code sample fixes

* Style code examples in MDX

* Fix more patterns

* Typo

* Typo

* More patterns

* Do without black for now

* Get more info in error

* Docstring style

* Re-enable check

* Quality

* Fix add_end_docstring decorator

* Fix docstring
2021-12-27 16:31:21 -05:00
Sylvain Gugger
87a033d9fa Properly indent return block (#14887) 2021-12-22 12:28:45 -05:00
Sylvain Gugger
27b3031de2 Mass conversion of documentation from rst to Markdown (#14866)
* Convert docstrings of all configurations and tokenizers

* Processors and fixes

* Last modeling files and fixes to models

* Pipeline modules

* Utils files

* Data submodule

* All the other files

* Style

* Missing examples

* Style again

* Fix copies

* Say bye bye to rst docstrings forever
2021-12-21 15:06:33 -05:00
Stas Bekman
b6ec956976 [logging] implement warning_advice / TRANSFORMERS_NO_ADVISORY_WARNINGS (#14669)
* [logging] implement warning_advice / TRANSFORMERS_NO_ADVISORY_WARNINGS

* reword
2021-12-20 20:48:38 -08:00
Vladimir Maryasin
6c4d688ffa add cache_dir for tokenizer verification loading (#14508)
When loading a pretrained tokenizer, a verification is done to ensure
that the actual tokenizer class matches the class it was called from.
If the tokenizer is absent, its config file is loaded from the repo.

However, the cache_dir for downloading is not provided, which leads to
ignoring of the user-specified cache_dir, storing files in several
places and and may result in incorrect warnings when the default
cache_dir is unreachsble.

This commit fixes that.
2021-11-24 06:22:03 -05:00
Li-Huai (Allan) Lin
9e37c5cdf8 Fix list index out of range when padding nested empty lists (#13876)
* Fix index out of range when padding

* Apply suggestions from code review

* Style
2021-11-10 21:34:52 +01:00
Apoorv Garg
6326aa4bf0 Correct order of overflowing tokens for LayoutLmV2 tokenizer (#13495)
* correct order of overflowing tokens for LayoutLmV2 tokenizer

* test to check order of overflowing_tokens for a seq of input_ids

* fix up quality

* added suggested changes

* check that tests the bbox sequence

* pair_input test added

* pass quality test

* check bbox sequence added

* unittest method

* comments added

* add overflowing bbox test

* improved "seq_1"

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* improve code quality

Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-11-09 07:49:53 -05:00
Sylvain Gugger
dfb00bf644 Expand dynamic supported objects to configs and tokenizers (#14296)
* Dynamic configs

* Add config test

* Better tests

* Add tokenizer and test

* Add to from_config

* With save
2021-11-08 15:28:25 -05:00
Sylvain Gugger
4a18337bae Honor existing attention mask in tokenzier.pad (#13926)
* Honor existing attention mask in tokenzier.pad

* Fix initialization of attention mask

* Roll the implem on all subclasses

* Fix tests
2021-10-11 09:12:09 -04:00
Mishig Davaadorj
5f34163b88 Add missing character (#13922) 2021-10-07 18:10:19 +02:00
Alex Hedges
57420b103e Add missing whitespace to multiline strings (#13916) 2021-10-07 09:22:11 -04:00
Alex Hedges
46efc58024 Improve error message when loading models from Hub (#13836)
* Improve error message when loading models from Hub

* Adjust error message wording
2021-10-05 08:09:10 -04:00
Yuta Hayashibe
90f980ed35 Fix warning situation: UserWarning: max_length is ignored when padding=True" (#13829)
* Removed wrong warning

* Raise a warning when `max_length` is given with wrong `truncation`

* Update the error message

* Update the warning message

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-10-01 09:29:08 -04:00
Yuta Hayashibe
66b01ce864 Warn for unexpected argument combinations (#13509)
* Warn for unexpected argument combinations

* Updated the waning message for pad_to_max_length
2021-09-24 09:14:23 -04:00
Stas Bekman
f1c22dae7d [tokenizer] use use_auth_token for config (#13523)
* [tokenizer] use use_auth_token for config

* args order
2021-09-13 07:31:35 -04:00
Apoorv Garg
b91e65afe0 Correct order of overflowing_tokens for slow tokenizer (#13179)
* correct order of overflowing_tokens for slow tokenizer (issue fix #13148)

* python 3.9 requires sentencepiece version 0.1.94 or above

* slicing of ids fixed in truncated_sequence()

* Update setup.py

* Correct order of overflowing tokens for pair of sentences

* code reformatted

* Update tokenization_utils_base.py

* reformatting file

* test to check single_input added

* missing function restored

* test to check pair_input overflowing tokens order

* test to check pair_input overflowing tokens order

* test to check pair_input overflowing tokens order

* added an error message for pair of seq and longest_first strategy

* test for pair_input modified

* variable name corrected

* fixed a typo in error message

* requested changes implemented

* required test added

* Corrected the message to match test message

* added error message for Luke Tokenizer

* lost test recovered

* docstring for truncate_sequences and prepare_for_model updated

* docstring for luke tokenizer updated

* updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING

* aligned text and fixed puncuatations

* improved style and quality of code

* fixed error_msg in truncate_sequences

* replaced encode_plus method with regular call method

* clean up

* rephrased the docstring
2021-09-02 05:58:23 -04:00
arfy slowy
01977466f4 fix: typo spelling grammar (#13212)
* fix: typo spelling grammar

* fix: make fixup
2021-08-30 08:09:14 -04:00
Bram Vanroy
401377e679 Add error message concerning revision (#13266)
* add error message concerning revision

* Update src/transformers/configuration_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* re-add double line endings

* is not None instead of implicit bool casting

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-08-26 04:32:57 -04:00
Lysandre Debut
5af8df5afb Some model_types cannot be in the mapping (#13259)
* Some tokenizers cannot be in the mapping

* Style
2021-08-25 12:56:16 -04:00
Bram Vanroy
39db2f3c19 Allow local_files_only for fast pretrained tokenizers (#13225)
* allow local_files_only for fast pretrained tokenizers

* make style
2021-08-24 03:05:33 -04:00
SaulLu
7223844df9 Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056)
* add test

* add change in PretrainedTokenizerBase

* change Luke

* deactivate

* add the possibility to add additional special tokens for M2M100

* format

* add special test for canine

* proposed changes for mbart

* proposed changes for mbart50

* proposed changes for byt5

* proposed changes for canine

* proposed changes for t5

* test fast and slow

* remove comment

* remove comment

* add fast version for all tests

* replace break by continue

* add more comments

* add check to avoid duplicates

* remove comment

* format

* proposed change for wave2vec2

* reverse changes mbart

* uncomment

* format
2021-08-23 14:35:18 +02:00
Sylvain Gugger
9870093f7b [WIP] Disentangle auto modules from other modeling files (#13023)
* Initial work

* All auto models

* All tf auto models

* All flax auto models

* Tokenizers

* Add feature extractors

* Fix typos

* Fix other typo

* Use the right config

* Remove old mapping names and update logic in AutoTokenizer

* Update check_table

* Fix copies and check_repo script

* Fix last test

* Add back name

* clean up

* Update template

* Update template

* Forgot a )

* Use alternative to fixup

* Fix TF model template

* Address review comments

* Address review comments

* Style
2021-08-06 13:12:30 +02:00
Sylvain Gugger
5f43623843 Add possibility to ignore imports in test_fecther (#12801)
* Add possibility to ignore imports in test_fecther

* Style
2021-07-26 09:48:19 -04:00
Sylvain Gugger
786ced3639 Add versioning system to fast tokenizer files (#12713)
* Add versioning system to fast tokenizer files

* Deal with offline mode

* Use staging env in tests

* Style

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Style

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-07-21 08:24:36 -04:00
Sylvain Gugger
da72ac6e26 Fix push_to_hub docstring and make it appear in doc (#12770) 2021-07-17 15:52:33 +02:00
Tomohiro Endo
08d609bfb8 Add tokenizers class mismatch detection between cls and checkpoint (#12619)
* Detect mismatch by analyzing config

* Fix comment

* Fix import

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* Revise based on reviews

* remove kwargs

* Fix exception

* Fix handling exception again

* Disable mismatch test in PreTrainedTokenizerFast

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-07-17 15:52:21 +02:00
SaulLu
6e87010060 Preserve list type of additional_special_tokens in special_token_map (#12759)
* preserve type of `additional_special_tokens` in `special_token_map`

* format

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-07-16 18:26:54 +02:00
Stas Bekman
7a22a02a70 [tokenizer.prepare_seq2seq_batch] change deprecation to be easily actionable (#12669)
* change deprecation to be easily actionable

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rework as suggested

* one warning together

* fix format

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-07-13 09:19:04 -07:00
Sylvain Gugger
0f43e742d9 Fix typo 2021-07-12 10:32:51 -04:00
Sylvain Gugger
53c60babe4 Clean push to hub API (#12187)
* Clean push to hub API

* Create working dir if it does not exist

* Different tweak

* New API + all models + test Flax

* Adds the Trainer clean up

* Update src/transformers/file_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* (nit) output types

* No need to set clone_from when folder exists

* Update src/transformers/trainer.py

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Add generated_from_trainer tag

* Update to new version

* Fixes

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2021-06-23 10:11:19 -04:00
Sylvain Gugger
adb70eda4d AutoTokenizer: infer the class from the tokenizer config if possible (#12208)
* AutoTokenizer: infer the class from the tokenizer config if possible

* Add tests

* Update src/transformers/models/auto/tokenization_auto.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-06-17 12:39:22 -04:00
SaulLu
476ba679dd Feature to use the PreTrainedTokenizerFast class as a stand-alone tokenizer (#11810)
* feature for tokenizer without slow/legacy version

* format

* modify common test

* add tests

* add PreTrainedTokenizerFast to AutoTokenizer

* format

* change tokenizer common test in order to be able to run test without a slow version

* update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class`

* add autokenizer test

* replace  `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None`

* remove obsolete change in comment

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_fast.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change `get_main_tokenizer` into `get_tokenizers`

* clarify `get_tokenizers` method

* homogenize with `test_slow_tokenizer` and `test_rust_tokenizer`

* add `test_rust_tokenizer = False` to tokenizer which don't define a fast version

* `test_rust_tokenizer = False` for BertJapaneseTokenizer

* `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-14 11:58:44 +02:00
NielsRogge
f3cf8ae7b3 Add LUKE (#11223)
* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <ikuya@ikuya.net>
2021-05-03 09:07:29 -04:00
Stas Bekman
282f3ac3ef [debug utils] activation/weights underflow/overflow detector (#11274)
* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-30 11:15:46 -07:00
Sylvain Gugger
ad1f7bef13 Reformat to make code clearer in tokenizer call (#11497)
* Reformat to make code clearer

* Reformat to make code clearer
2021-04-29 07:51:09 -04:00
Hamel Husain
c0eb218a55 Update PreTrainedTokenizerBase to check/handle batch length for text_pair parameter (#11486)
* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add error message

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-28 10:11:17 -04:00
Sylvain Gugger
b03b2a653d Style 2021-04-26 11:45:04 -04:00
Kostas Stathoulopoulos
6715e3b6a1 Clarify description of the is_split_into_words argument (#11449)
* Improve documentation for is_split_into_words argument

* Change description wording
2021-04-26 11:29:36 -04:00
Patrick von Platen
32dbb2d954 make style (#11442) 2021-04-26 13:50:34 +02:00
Sylvain Gugger
bf2e0cf70b Trainer push to hub (#11328)
* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-04-23 09:17:37 -04:00
Takuya Makino
881945c0b5 Add space (#11373) 2021-04-22 17:48:58 +05:30
Sylvain Gugger
2550b41aa2 Tokenizer fast save (#11234)
* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior
2021-04-15 09:32:32 -04:00
Joel Stremmel
9337c6c668 make embeddings plural in warning message (#11228) 2021-04-14 10:13:25 -04:00
Lysandre Debut
c0d97cee13 Adds a note to resize the token embedding matrix when adding special … (#11120)
* Adds a note to resize the token embedding matrix when adding special tokens

* Remove superfluous space
2021-04-07 10:06:45 -04:00
Stas Bekman
3d39226a51 s|Pretrained|PreTrained| (#11048) 2021-04-04 18:08:42 -07:00
Sylvain Gugger
acc3bd9d2a Enforce string-formatting with f-strings (#10980)
* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-03-31 10:00:27 -04:00