[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words (#24622)
* patch `_tokenize` function * more tests * properly fix * fixup * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix without ifs * update * protect import * add python processing * is first needed * add doc and update with lefacy * updaate * fix T5 SPM converter * styling * fix T5 warning * add is_seqio_available * remove is_first * revert some changes * more tests and update * update llama test batterie * fixup * refactor T5 spm common tests * draft the llama tests * update * uopdate test * nits * refine * name nit * fix t5 tests * fix T5 * update * revert convert slow to fast changes that fail lots of tests * legacy support * fixup * nits is first not defined * don't use legacy behaviour for switch transformers * style * My attempt to check. * nits * fixes * update * fixup * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates * fixup * add legacy warning * fixup * warning_once nit * update t5 documentation test * update llama tok documentation * add space to warning * nits * nit * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * last nits --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
@@ -347,13 +347,16 @@ class UMT5ModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin
|
||||
@require_tokenizers
|
||||
class Umt5IntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
@unittest.skip(
|
||||
"Unless we stop stripping left and right by default for all special tokens, the expected ids obtained here will not match the original ones. Wait for https://github.com/huggingface/transformers/pull/23909 to be merged"
|
||||
)
|
||||
def test_small_integration_test(self):
|
||||
"""
|
||||
For comparison run the kaggle notbook available here : https://www.kaggle.com/arthurzucker/umt5-inference
|
||||
"""
|
||||
|
||||
model = UMT5ForConditionalGeneration.from_pretrained("google/umt5-small", return_dict=True).to(torch_device)
|
||||
tokenizer = AutoTokenizer.from_pretrained("google/umt5-small", use_fast=False)
|
||||
tokenizer = AutoTokenizer.from_pretrained("google/umt5-small", use_fast=False, legacy=False)
|
||||
input_text = [
|
||||
"Bonjour monsieur <extra_id_0> bien <extra_id_1>.",
|
||||
"No se como puedo <extra_id_0>.",
|
||||
@@ -373,7 +376,7 @@ class Umt5IntegrationTest(unittest.TestCase):
|
||||
]
|
||||
)
|
||||
# fmt: on
|
||||
self.assertEqual(input_ids, EXPECTED_IDS)
|
||||
torch.testing.assert_allclose(input_ids, EXPECTED_IDS)
|
||||
|
||||
generated_ids = model.generate(input_ids.to(torch_device))
|
||||
EXPECTED_FILLING = [
|
||||
@@ -384,4 +387,4 @@ class Umt5IntegrationTest(unittest.TestCase):
|
||||
"<pad><extra_id_0>nyone who<extra_id_1> drink<extra_id_2> a<extra_id_3> alcohol<extra_id_4> A<extra_id_5> A. This<extra_id_6> I<extra_id_7><extra_id_52><extra_id_53></s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",
|
||||
]
|
||||
filling = tokenizer.batch_decode(generated_ids)
|
||||
self.assertTrue(filling, EXPECTED_FILLING)
|
||||
self.assertEqual(filling, EXPECTED_FILLING)
|
||||
|
||||
Reference in New Issue
Block a user