[Tokenizer] Fix slow and fast serialization (#26570)

* fix * last attempt * current work * fix forward compatibility * save all special tokens * current state * revert additional changes * updates * remove tokenizer.model * add a test and the fix * nit * revert one more break * fix typefield issue * quality * more tests * fix fields for FC * more nits? * new additional changes * how * some updates * simplify all * more nits * revert some things to original * nice * nits * a small hack * more nits * ahhaha * fixup * update * make test run on ci * use subtesting * update * Update .circleci/create_circleci_config.py * updates * fixup * nits * replace typo * fix the test * nits * update * None max dif pls * a partial fix * had to revert one thing * test the fast * updates * fixup * and more nits * more fixes * update * Oupsy 👁️ * nits * fix marian * on our way to heaven * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: Lysandre Debut <hi@lysand.re> * fixup * Update src/transformers/tokenization_utils_fast.py Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com> * fix phobert * skip some things, test more * nits * fixup * fix deberta * update * update * more updates * skip one test * more updates * fix camembert * can't test this one * more good fixes * kind of a major update - seperate what is only done in fast in fast init and refactor - add_token(AddedToken(..., speicla = True)) ignores it in fast - better loading * fixup * more fixups * fix pegasus and mpnet * remove skipped tests * fix phoneme tokenizer if self.verbose * fix individual models * update common tests * update testing files * all over again * nits * skip test for markup lm * fixups * fix order of addition in fast by sorting the added tokens decoder * proper defaults for deberta * correct default for fnet * nits on add tokens, string initialized to special if special * skip irrelevant herbert tests * main fixes * update test added_tokens_serialization * the fix for bart like models and class instanciating * update bart * nit! * update idefix test * fix whisper! * some fixup * fixups * revert some of the wrong chanegs * fixup * fixup * skip marian * skip the correct tests * skip for tf and flax as well --------- Co-authored-by: Lysandre Debut <hi@lysand.re> Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>
2023-10-18 16:30:53 +02:00
parent 34678db4a1
commit ef7e93699a
49 changed files with 511 additions and 245 deletions
--- a/tests/models/pegasus/test_tokenization_pegasus.py
+++ b/tests/models/pegasus/test_tokenization_pegasus.py
@@ -62,8 +62,8 @@ class PegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):

        self.assertEqual(vocab_keys[0], "<pad>")
        self.assertEqual(vocab_keys[1], "</s>")
-        self.assertEqual(vocab_keys[-1], "<unk_102>")
-        self.assertEqual(len(vocab_keys), 1_104)
+        self.assertEqual(vocab_keys[104], "<unk_102>")
+        self.assertEqual(len(vocab_keys), 1_103)

    def test_vocab_size(self):
        self.assertEqual(self.get_tokenizer().vocab_size, 1_103)
@@ -129,13 +129,9 @@ class PegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
            revision="ba85d0851d708441f91440d509690f1ab6353415",
        )

-    @unittest.skip("Need to fix this after #26538")
-    def test_training_new_tokenizer(self):
-        pass
-
-    @unittest.skip("Need to fix this after #26538")
-    def test_training_new_tokenizer_with_special_tokens_change(self):
-        pass
+    # @unittest.skip("We have to use from_slow")
+    # def test_added_tokens_serialization(self):
+    #     pass


@require_sentencepiece
@@ -219,3 +215,7 @@ class BigBirdPegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
            token_ids,
            [182, 117, 142, 587, 4211, 120, 117, 263, 112, 804, 109, 856, 25016, 3137, 464, 109, 26955, 3137, 1],
        )
+
+    # @unittest.skip("We have to use from_slow")
+    # def test_added_tokens_serialization(self):
+    #     pass