[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words (#24622)

* patch `_tokenize` function

* more tests

* properly fix

* fixup

* Update src/transformers/models/t5/tokenization_t5.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix without ifs

* update

* protect import

* add python processing

* is first needed

* add doc and update with lefacy

* updaate

* fix T5 SPM converter

* styling

* fix T5 warning

* add is_seqio_available

* remove is_first

* revert some changes

* more tests and update

* update llama test batterie

* fixup

* refactor T5 spm common tests

* draft the llama tests

* update

* uopdate test

* nits

* refine

* name nit

* fix t5 tests

* fix T5

* update

* revert convert slow to fast changes that fail lots of tests

* legacy support

* fixup

* nits is first not defined

* don't use legacy behaviour for switch transformers

* style

* My attempt to check.

* nits

* fixes

* update

* fixup

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates

* fixup

* add legacy warning

* fixup

* warning_once nit

* update t5 documentation test

* update llama tok documentation

* add space to warning

* nits

* nit

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* last nits

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
Arthur
2023-07-11 22:02:18 +09:00
committed by GitHub
parent b3ab3fac1d
commit b15343de6f
11 changed files with 387 additions and 52 deletions

View File

@@ -77,6 +77,7 @@ from .utils import (
is_safetensors_available,
is_scipy_available,
is_sentencepiece_available,
is_seqio_available,
is_soundfile_availble,
is_spacy_available,
is_sudachi_available,
@@ -442,6 +443,13 @@ def require_sentencepiece(test_case):
return unittest.skipUnless(is_sentencepiece_available(), "test requires SentencePiece")(test_case)
def require_seqio(test_case):
"""
Decorator marking a test that requires SentencePiece. These tests are skipped when SentencePiece isn't installed.
"""
return unittest.skipUnless(is_seqio_available(), "test requires Seqio")(test_case)
def require_scipy(test_case):
"""
Decorator marking a test that requires Scipy. These tests are skipped when SentencePiece isn't installed.