[Tokenizer] Fix slow and fast serialization (#26570)

* fix

* last attempt

* current work

* fix forward compatibility

* save all special tokens

* current state

* revert additional changes

* updates

* remove tokenizer.model

* add a test and the fix

* nit

* revert one more break

* fix typefield issue

* quality

* more tests

* fix fields for FC

* more nits?

* new additional changes

* how

* some updates

* simplify all

* more nits

* revert some things to original

* nice

* nits

* a small hack

* more nits

* ahhaha

* fixup

* update

* make test run on ci

* use subtesting

* update

* Update .circleci/create_circleci_config.py

* updates

* fixup

* nits

* replace typo

* fix the test

* nits

* update

* None max dif pls

* a partial fix

* had to revert one thing

* test the fast

* updates

* fixup

* and more nits

* more fixes

* update

* Oupsy 👁️

* nits

* fix marian

* on our way to heaven

* Update src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Lysandre Debut <hi@lysand.re>

* fixup

* Update src/transformers/tokenization_utils_fast.py

Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>

* fix phobert

* skip some things, test more

* nits

* fixup

* fix deberta

* update

* update

* more updates

* skip one test

* more updates

* fix camembert

* can't test this one

* more good fixes

* kind of a major update

- seperate what is only done in fast in fast init and refactor
- add_token(AddedToken(..., speicla = True)) ignores it in fast
- better loading

* fixup

* more fixups

* fix pegasus and mpnet

* remove skipped tests

* fix phoneme tokenizer if self.verbose

* fix individual models

* update common tests

* update testing files

* all over again

* nits

* skip test for markup lm

* fixups

* fix order of addition in fast by sorting the added tokens decoder

* proper defaults for deberta

* correct default for fnet

* nits on add tokens, string initialized to special if special

* skip irrelevant herbert tests

* main fixes

* update test added_tokens_serialization

* the fix for bart like models and class instanciating

* update bart

* nit!

* update idefix test

* fix whisper!

* some fixup

* fixups

* revert some of the wrong chanegs

* fixup

* fixup

* skip marian

* skip the correct tests

* skip for tf and flax as well

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>
This commit is contained in:
Arthur
2023-10-18 16:30:53 +02:00
committed by GitHub
parent 34678db4a1
commit ef7e93699a
49 changed files with 511 additions and 245 deletions

View File

@@ -62,8 +62,8 @@ class PegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.assertEqual(vocab_keys[0], "<pad>")
self.assertEqual(vocab_keys[1], "</s>")
self.assertEqual(vocab_keys[-1], "<unk_102>")
self.assertEqual(len(vocab_keys), 1_104)
self.assertEqual(vocab_keys[104], "<unk_102>")
self.assertEqual(len(vocab_keys), 1_103)
def test_vocab_size(self):
self.assertEqual(self.get_tokenizer().vocab_size, 1_103)
@@ -129,13 +129,9 @@ class PegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
revision="ba85d0851d708441f91440d509690f1ab6353415",
)
@unittest.skip("Need to fix this after #26538")
def test_training_new_tokenizer(self):
pass
@unittest.skip("Need to fix this after #26538")
def test_training_new_tokenizer_with_special_tokens_change(self):
pass
# @unittest.skip("We have to use from_slow")
# def test_added_tokens_serialization(self):
# pass
@require_sentencepiece
@@ -219,3 +215,7 @@ class BigBirdPegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
token_ids,
[182, 117, 142, 587, 4211, 120, 117, 263, 112, 804, 109, 856, 25016, 3137, 464, 109, 26955, 3137, 1],
)
# @unittest.skip("We have to use from_slow")
# def test_added_tokens_serialization(self):
# pass