[`Tokenizer`] Fix slow and fast serialization (#26570)

* fix

* last attempt

* current work

* fix forward compatibility

* save all special tokens

* current state

* revert additional changes

* updates

* remove tokenizer.model

* add a test and the fix

* nit

* revert one more break

* fix typefield issue

* quality

* more tests

* fix fields for FC

* more nits?

* new additional changes

* how

* some updates

* simplify all

* more nits

* revert some things to original

* nice

* nits

* a small hack

* more nits

* ahhaha

* fixup

* update

* make test run on ci

* use subtesting

* update

* Update .circleci/create_circleci_config.py

* updates

* fixup

* nits

* replace typo

* fix the test

* nits

* update

* None max dif pls

* a partial fix

* had to revert one thing

* test the fast

* updates

* fixup

* and more nits

* more fixes

* update

* Oupsy 👁️

* nits

* fix marian

* on our way to heaven

* Update src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Lysandre Debut <hi@lysand.re>

* fixup

* Update src/transformers/tokenization_utils_fast.py

Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>

* fix phobert

* skip some things, test more

* nits

* fixup

* fix deberta

* update

* update

* more updates

* skip one test

* more updates

* fix camembert

* can't test this one

* more good fixes

* kind of a major update

- seperate what is only done in fast in fast init and refactor
- add_token(AddedToken(..., speicla = True)) ignores it in fast
- better loading

* fixup

* more fixups

* fix pegasus and mpnet

* remove skipped tests

* fix phoneme tokenizer if self.verbose

* fix individual models

* update common tests

* update testing files

* all over again

* nits

* skip test for markup lm

* fixups

* fix order of addition in fast by sorting the added tokens decoder

* proper defaults for deberta

* correct default for fnet

* nits on add tokens, string initialized to special if special

* skip irrelevant herbert tests

* main fixes

* update test added_tokens_serialization

* the fix for bart like models and class instanciating

* update bart

* nit!

* update idefix test

* fix whisper!

* some fixup

* fixups

* revert some of the wrong chanegs

* fixup

* fixup

* skip marian

* skip the correct tests

* skip for tf and flax as well

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>

This commit is contained in:

Arthur

2023-10-18 16:30:53 +02:00

committed by

GitHub

parent 34678db4a1

commit ef7e93699a

49 changed files with 511 additions and 245 deletions

									
										2

tests/models/llama/test_tokenization_llama.py
									
												View File
												
				@@ -517,7 +517,7 @@ class LlamaIntegrationTest(unittest.TestCase):

				    def test_special_token_special_word(self):

				        # the word inform should be split as ['in', 'form']

				        tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", legacy=False)

				        tokenizer.add_tokens(["<REPR_END>"], special_tokens=False)

				        tokenizer.add_tokens([AddedToken("<REPR_END>", rstrip=True, lstrip=True)], special_tokens=False)

				        out1 = tokenizer.decode(

				            tokenizer.encode("<REPR_END>inform", add_special_tokens=False), spaces_between_special_tokens=False

				        )

[Tokenizer] Fix slow and fast serialization (#26570)

2 tests/models/llama/test_tokenization_llama.py Unescape Escape View File

[`Tokenizer`] Fix slow and fast serialization (#26570)

2

tests/models/llama/test_tokenization_llama.py

View File