Commit Graph

114 Commits

Author SHA1 Message Date
Ben Eyal
9f9ddcc2de 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string (#15775)
* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
2022-11-02 15:45:38 -04:00
Yih-Dar
f9257843b5 Fix incorrect model<->tokenizer mapping in tokenization testing (#19872)
* Fix model-tokenizer mapping

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-10-25 16:02:13 +02:00
Sylvain Gugger
3e4900208a Tokenizer from_pretrained should not use local files named like tokenizer files (#19626) 2022-10-14 14:06:56 -04:00
Sylvain Gugger
34be08efcd More tests for regression in cached non existence (#19216)
* More tests for regression in cached non existence

* Style
2022-09-27 09:36:34 -04:00
Sylvain Gugger
ca485e562b Add tests for legacy load by url and fix bugs (#19078) 2022-09-16 23:20:02 +02:00
Sylvain Gugger
9017ba4ca4 Fix tokenizer load from one file (#19073)
* Fix tokenizer load from one file

* Add a test

* Style

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2022-09-16 16:11:47 -04:00
Sylvain Gugger
f7ce4f1ff7 Fix custom tokenizers test (#19052)
* Fix CI for custom tokenizers

* Add nightly tests

* Run CI, run!

* Fix paths

* Typos

* Fix test
2022-09-15 11:31:09 -04:00
Lucain
169b8cde47 Fix mock in test_cached_files_are_used_when_internet_is_down (#18804) 2022-08-29 15:56:08 +02:00
SaulLu
6667b0d7bf add warning to let the user know that the __call__ method is faster than encode + pad for a fast tokenizer (#18693)
* add warning to let the user know that the  method is slower that  for a fast tokenizer

* user warnings

* fix layoutlmv2

* fix layout*

* change warnings into logger.warning
2022-08-24 06:27:56 -04:00
Sylvain Gugger
5cd4032368 Use new huggingface_hub tools for download models (#18438)
* Draft new cached_file

* Initial draft for config and model

* Small fixes

* Fix first batch of tests

* Look in cache when internet is down

* Fix last tests

* Bad black, not fixing all quality errors

* Make diff less

* Implement change for TF and Flax models

* Add tokenizer and feature extractor

* For compatibility with main

* Add utils to move the cache and auto-do it at first use.

* Quality

* Deal with empty commit shas

* Deal with empty etag

* Address review comments
2022-08-05 10:12:40 -04:00
Sylvain Gugger
01db72abd4 Rewrite push_to_hub to use upload_files (#18366)
* Rewrite push_to_hub to use upload_files

* Adapt the doc a bit

* Address review comments and clean doc
2022-08-01 12:07:30 -04:00
Yulv-git
95113d1365 Fix some typos. (#17560)
* Fix some typos.

Signed-off-by: Yulv-git <yulvchi@qq.com>

* Fix typo.

Signed-off-by: Yulv-git <yulvchi@qq.com>

* make fixup.
2022-07-11 05:00:13 -04:00
Guillaume Klein
3eed5530ec Fix properties of unset special tokens in non verbose mode (#17797)
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2022-06-23 14:40:13 +02:00
Lysandre Debut
6a5272b205 Prepare transformers for v0.8.0 huggingface-hub release (#17716)
* Prepare CI for v0.8.0

* pin hfh (revert before merge)

* Revert "pin hfh (revert before merge)"

This reverts commit a0103140e1c77b810ffcb735192968bc03be3e1f.

* Test rc3

* Test latest rc

* Unpin to the RC

Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
2022-06-21 11:51:18 -04:00
Patrick von Platen
f394a2a50d [Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) (#17457)
* [Json dump] Make json prettier

* correct more tokenizeirs

* more patterns

* add aggressive test

* the aggressive test was actually useful :-)

* more tests

* Apply suggestions from code review
2022-05-31 17:07:30 +02:00
Sylvain Gugger
afe5d42d8d Black preview (#17217)
* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black
2022-05-12 16:25:55 -04:00
davidleonfdez
9f8bfe703c Fix #16660 (tokenizers setters of ids of special tokens) (#16661)
* Fix setters of *_token_id properties of SpecialTokensMixin

* Test setters of common tokens ids

* Move to a separate test checks of setters of tokens ids

* Add independent test for ByT5

* Add Canine test

* Test speech to text
2022-04-13 07:49:06 -04:00
SaulLu
be9474bd35 add a test checking the format of convert_tokens_to_string's output (#16540)
* add new tests

* add comment to overridden tests
2022-04-04 16:57:24 +02:00
Sylvain Gugger
c595b6e6a9 Make Transformers use cache files when hf.co is down (#16362)
* Make Transformers use cache files when hf.co is down

* Fix tests

* Was there a random circleCI failure?

* Isolate patches

* Style

* Comment out the failure since it doesn't fail anymore

* Better comment
2022-03-23 15:56:49 -04:00
Sylvain Gugger
45f56580a7 Allow custom code for Processors (#15649)
* Allow custom code for Processors

* Add more test

* Test all auto_map configs are properly set
2022-02-15 09:44:35 -05:00
SaulLu
39b5d1a63a fix set truncation attribute in __init__ of PreTrainedTokenizerBase (#15456)
* change truncation_side in init of `PreTrainedTokenizerBase`

Co-authored-by: LSinev <LSinev@users.noreply.github.com>

* add test

* Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"

This reverts commit 7a98b87962d2635c7e4d4f00db3948b694624843.

* fix kwargs

* Revert "fix kwargs"

This reverts commit 67b0a5270e8cf1dbf70e6b0232e94c0452b6946f.

* Update tests/test_tokenization_common.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete truncation_side variable

* reorganize test

* format

* complete doc

* Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`""

This reverts commit d5a10a7e2680539e5d9e98ae5d896c893d224b80.

* fix typo

* fix typos to render documentation

* Revert "Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"""

This reverts commit 16cf58811943a08f43409a7c83eaa330686591d0.

* format

Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-02 23:18:09 +01:00
Sylvain Gugger
44b21f117b Save code of registered custom models (#15379)
* Allow dynamic modules to use relative imports

* Work for configs

* Fix last merge conflict

* Save code of registered custom objects

* Map strings to strings

* Fix test

* Add tokenizer

* Rework tests

* Tests

* Ignore fixtures py files for tests

* Tokenizer test + fix collection

* With full path

* Rework integration

* Fix typo

* Remove changes in conftest

* Test for tokenizers

* Add documentation

* Update docs/source/custom_models.mdx

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add file structure and file content

* Add more doc

* Style

* Update docs/source/custom_models.mdx

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Address review comments

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
2022-02-02 10:44:37 -05:00
SaulLu
7b8bdd8601 fix the tokenizer_config.json file for the slow tokenizer when a fast version is available (#15319)
* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da0895330bedfafc81ae3073470a9348c669f032.

* set `tokenizer_file` in super `__init__` of mbart
2022-02-01 16:48:25 +01:00
SaulLu
6d585fe0f0 replace assert with exception for padding_side arg in PreTrainedTokenizerBase __init__ (#15454)
* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`

* add test

* fix kwargs

* reformat test

* format

* format

* fix typo to render the documentation
2022-02-01 16:13:58 +01:00
SaulLu
ade7371a41 improve saving strategy of sentencepiece tokenizer (#15328)
* add new test

* add a feature to same the sentencepiece tokenizer model when the init file was deleted

* update marian

* update m2m_100

* fix marian

* update speech to text

* override test for layoutxlm

* fix saving bartpho

* remove harcoded values bartpho

* special token string version

* finish bartpho

* override layoutxml test

* add mbart

* move special tokens list

* format

* Revert "format"

This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7.

* simplify list of string of special tokens

* Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2022-01-27 16:24:51 +01:00
Nicolas Patry
d2183a46fb Remove old asserts. (#15012) 2022-01-06 09:45:41 -05:00
Nicolas Patry
d33dc7966a Improve truncation_side (#14947)
* Enabling `truncation_side` for Slow and Fast tokenizer.

Co-Authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>

* Disable failing tests.

* Layout xlm.

* assert -> assertEqual.

Co-authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>
2022-01-03 16:18:39 +01:00
Nicolas Patry
d7d60df0ec Fixing a pathological case for slow tokenizers (#14981)
* Fixing a pathological case for slow tokenizers

* Update src/transformers/tokenization_utils.py
2021-12-30 09:10:34 +01:00
Li-Huai (Allan) Lin
66ea739168 Improve tokenizer tests (#13594)
* Use new method to acquire tokenizers

* Resolve TODOs.

* Style

* Fix

* Enable do_lower_case in test_tokenize_special_tokens

* Apply suggestion from code review

* Fix mask token handling

* Revert "Fix mask token handling"

This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648.

* Fix FNet mask token tokenization

* Complete everything

* Apply suggestions from code review
2021-12-03 08:39:10 +01:00
Li-Huai (Allan) Lin
9e37c5cdf8 Fix list index out of range when padding nested empty lists (#13876)
* Fix index out of range when padding

* Apply suggestions from code review

* Style
2021-11-10 21:34:52 +01:00
Sylvain Gugger
dfb00bf644 Expand dynamic supported objects to configs and tokenizers (#14296)
* Dynamic configs

* Add config test

* Better tests

* Add tokenizer and test

* Add to from_config

* With save
2021-11-08 15:28:25 -05:00
Sylvain Gugger
558f8543ba Update Transformers to huggingface_hub >= 0.1.0 (#14251)
* Update Transformers to huggingface_hub >= 0.1.0

* Forgot to save...

* Style

* Fix test
2021-11-02 18:58:42 -04:00
Sylvain Gugger
4a18337bae Honor existing attention mask in tokenzier.pad (#13926)
* Honor existing attention mask in tokenzier.pad

* Fix initialization of attention mask

* Roll the implem on all subclasses

* Fix tests
2021-10-11 09:12:09 -04:00
Nicolas Patry
d70919e6d5 Adding support for tokens being suffixes or part of each other. (#13918)
* Adding support for tokens being suffixes or part of each other.

* Better test name.
2021-10-08 10:10:38 +02:00
Nicolas Patry
7079a99e76 Fixing 1-length special tokens cut. (#13862) 2021-10-05 12:26:54 +02:00
Li-Huai (Allan) Lin
da8beaaf76 Fix special tokens not correctly tokenized (#13489)
* Fix special tokens not correctly tokenized

* Add testing

* Fix

* Fix

* Use user workflows instead of directly assigning variables

* Enable test of fast tokenizers

* Update test of canine tokenizer
2021-09-17 10:28:28 -04:00
Nicolas Patry
3dd538c4d3 [Tentative] Moving slow tokenizer to the Trie world. (#13220)
* Moving slow tokenizer to the Trie world.

* Adding more docstrings to the Trie.

* Fixing doctest (incompatible wiht our format? )

* Update src/transformers/tokenization_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding a lot more comment into the internals of this algorithm.

* Cleaner doc.

* Fixing the namings.

* Update src/transformers/tokenization_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* quality.

* Fixing longest first match.

* Small improvements to cuts + more test + canine resistant test.

* Fixing fast test.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-09-09 17:26:16 +02:00
Apoorv Garg
b91e65afe0 Correct order of overflowing_tokens for slow tokenizer (#13179)
* correct order of overflowing_tokens for slow tokenizer (issue fix #13148)

* python 3.9 requires sentencepiece version 0.1.94 or above

* slicing of ids fixed in truncated_sequence()

* Update setup.py

* Correct order of overflowing tokens for pair of sentences

* code reformatted

* Update tokenization_utils_base.py

* reformatting file

* test to check single_input added

* missing function restored

* test to check pair_input overflowing tokens order

* test to check pair_input overflowing tokens order

* test to check pair_input overflowing tokens order

* added an error message for pair of seq and longest_first strategy

* test for pair_input modified

* variable name corrected

* fixed a typo in error message

* requested changes implemented

* required test added

* Corrected the message to match test message

* added error message for Luke Tokenizer

* lost test recovered

* docstring for truncate_sequences and prepare_for_model updated

* docstring for luke tokenizer updated

* updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING

* aligned text and fixed puncuatations

* improved style and quality of code

* fixed error_msg in truncate_sequences

* replaced encode_plus method with regular call method

* clean up

* rephrased the docstring
2021-09-02 05:58:23 -04:00
SaulLu
c4d78f01de Fix tokenizer saving during training with Trainer (#12806)
* add test in trainer and test tokenizer saving wi
th trainer

* quality

* reverse trainer changes

* replace test in test_trainer by a test for all the tokenizers

* format

* add can_save_slow_tokenizer attribute to all tokenizers

* fix Herbert

* format

* Change comment in error

* add comments and a new assert

* Update src/transformers/models/albert/tokenization_albert_fast.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change ValueError barthez

* change ValueError BigBird

* change ValueError Camembert

* change ValueError Mbart50

* change ValueError Pegasus

* change ValueError ReFormer

* change ValueError T5

* change ValueError RoBERTa

* XLNET fast

* Update tests/test_tokenization_common.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change `assert` into `self.assertIn`

* format

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-09-01 16:32:56 +02:00
SaulLu
7223844df9 Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056)
* add test

* add change in PretrainedTokenizerBase

* change Luke

* deactivate

* add the possibility to add additional special tokens for M2M100

* format

* add special test for canine

* proposed changes for mbart

* proposed changes for mbart50

* proposed changes for byt5

* proposed changes for canine

* proposed changes for t5

* test fast and slow

* remove comment

* remove comment

* add fast version for all tests

* replace break by continue

* add more comments

* add check to avoid duplicates

* remove comment

* format

* proposed change for wave2vec2

* reverse changes mbart

* uncomment

* format
2021-08-23 14:35:18 +02:00
Tomohiro Endo
08d609bfb8 Add tokenizers class mismatch detection between cls and checkpoint (#12619)
* Detect mismatch by analyzing config

* Fix comment

* Fix import

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* Revise based on reviews

* remove kwargs

* Fix exception

* Fix handling exception again

* Disable mismatch test in PreTrainedTokenizerFast

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-07-17 15:52:21 +02:00
SaulLu
6e87010060 Preserve list type of additional_special_tokens in special_token_map (#12759)
* preserve type of `additional_special_tokens` in `special_token_map`

* format

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-07-16 18:26:54 +02:00
SaulLu
3aa37b945e Add test for a WordLevel tokenizer model (#12437)
* add a test for a WordLevel tokenizer

* adapt common test to new tokenizer
2021-07-01 12:37:07 +02:00
Sylvain Gugger
dc42e770b8 Easily train a new fast tokenizer from a given one (#12361)
* [WIP] Easily train a new fast tokenizer from a given one

* Fix test

* Roll out to other tokenizers and add tests

* Fix bug with unk id and add emoji to test

* Really use something different in test

* Implement special tokens map

* Map special tokens in the Transformers tokenizers

* Fix test

* Make test more robust

* Fix test for BPE

* More robust map and test

Co-authored-by SaulLu

* Test file

* Stronger tests

Co-authored-by: SaulLu <lucilesaul.com@gmail.com>

* Map unk token for Wordpiece and address review comment

* Fix lowercase test and address review comment

* Fix all tests

* Simplify test

* Fix tests for realsies

* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)

* Propose change in tests regarding lower case

* add new test for special tokens types

* put back the test part about decoding

* add feature: the AddedToken is re-build with the different mapped content

* Address review comment: simplify AddedToken building

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* Update src/transformers/tokenization_utils_fast.py

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-06-29 15:00:08 -04:00
Sylvain Gugger
53c60babe4 Clean push to hub API (#12187)
* Clean push to hub API

* Create working dir if it does not exist

* Different tweak

* New API + all models + test Flax

* Adds the Trainer clean up

* Update src/transformers/file_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* (nit) output types

* No need to set clone_from when folder exists

* Update src/transformers/trainer.py

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Add generated_from_trainer tag

* Update to new version

* Fixes

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2021-06-23 10:11:19 -04:00
SaulLu
476ba679dd Feature to use the PreTrainedTokenizerFast class as a stand-alone tokenizer (#11810)
* feature for tokenizer without slow/legacy version

* format

* modify common test

* add tests

* add PreTrainedTokenizerFast to AutoTokenizer

* format

* change tokenizer common test in order to be able to run test without a slow version

* update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class`

* add autokenizer test

* replace  `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None`

* remove obsolete change in comment

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_fast.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change `get_main_tokenizer` into `get_tokenizers`

* clarify `get_tokenizers` method

* homogenize with `test_slow_tokenizer` and `test_rust_tokenizer`

* add `test_rust_tokenizer = False` to tokenizer which don't define a fast version

* `test_rust_tokenizer = False` for BertJapaneseTokenizer

* `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-14 11:58:44 +02:00
Philip May
3857f2b4e3 fix deberta 2 tokenizer integration test (#12017) 2021-06-07 04:55:55 -04:00
Philip May
fcad801825 Add regression tests for slow sentencepiece tokenizers. (#11737)
* add test_vocab_size for sentencepiece tok.

* add test_get_vocab for sentencepiece tok.

* add test_convert_token_and_id for sentencepiece tok.

* add test_tokenize_and_convert_tokens_to_string for all tok.

* improve test_tokenize_and_convert_tokens_to_string for sp. tok.

* add common tokenizer integration tests
- for albert
- for barthez

* add tokenizer integration tests to bert gen.

* add most tokenizer integration tests

* fix camembert tokenizer integration test

* add tokenizer integration test to marian

* add tokenizer integration test to reformer

* add typing and doc to tokenizer_integration_test_util

* fix tokenizer integration test of reformer

* improve test_sentencepiece_tokenize_and_convert_tokens_to_string

* empty commit to trigger CI

* fix tokenizer integration test of reformer

* remove code not needed anymore

* empty commit to trigger CI

* empty commit to trigger CI
2021-06-01 09:24:39 -04:00
Philip May
37ed3ab719 Enable option for subword regularization in more tokenizers. (#11417)
* improve slow class tok usage at xlm rob

* add subword regularization for barthez

* improve barthez tok. test

* fix tokenizer tests

* add subword regularization for camembert

* add subword regularization for deberta v2 tokenizer

* add more doc to deberta v2 tokenizer

* add subword regularization for speech to text tok.

* fix sp_model_kwargs type in speech 2 text tok.

* add subword regularization for M2M100 tok.

* add more concrete type hints

* fix tests for m2m100 and s2t tok.

* add missing Any import

* fix syntax error in m2m100 tok.

* fix unpickle of m2m100 and s2t tok.

* fix test of m2m100 and s2t tok.

* improve unpickle of deberta v2 tok.

* add test for pickle of barthez & camembert

* fix pickle of barthez & camembert

* add test for deberta v2 tok. pickle

* fix m2m100 tok. pickle

* fix s2t tok. pickle

* add subword regularization to albert tok.

* refactor subword reg. test into TokenizerTesterMixin

improve albert tok. test

remove sample argument form albert tok.

check subword reg. using TokenizerTesterMixin

improve tok. tests

improve xlm roberta tok. tests

improve xlm roberta tok. tests

* add subword regularization for big bird t.

* improve xlm roberta tok. test

* add subword regularization for mbart50 tok.

* add subword regularization for pegasus tok.

* add subword regularization for reformer tok.

* add subword regularization for T5 tok.

* fix t5 tok. test formatting

* add subword regularization for xlm_proph. tok.

* add subword regularization for xlnet tok.

* add subword regularization for gert_gen tok.

* add typing to tokenizers

* add typing to xlm rob. tok

* add subword regularization for marian tok.

* add reverse tok. test

* fix marian tok test

* fix marian tok test

* fix casing in tok. tests

* fix style of tok. common test

* fix deberta v2 tok test

* add type annotations to tok. tests

* add type annotations to tok. __init__

* add typing to kokenizer

* add type annotations to tok. __init__

* don't specify the default when it's None

* fix barthez tok. doc

* move sentencepiece tok. tests to TokenizerTesterMixin

* fix unused imports

* fix albert tok. test

* add comment to sentencepiece test options

* fix Any import at big bird tok.

* fix Any import at xlm prophetnet tok.

* empty commit to trigger CI
2021-05-13 02:44:55 -04:00
Lysandre Debut
09b0bcfea9 Enable added tokens (#11325)
* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition
2021-05-04 08:13:57 -04:00