🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string (#15775)

* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
This commit is contained in:
Ben Eyal
2022-11-02 21:45:38 +02:00
committed by GitHub
parent fb7cbe236b
commit 9f9ddcc2de
18 changed files with 379 additions and 40 deletions

View File

@@ -385,6 +385,33 @@ class TokenizerTesterMixin:
self.assertEqual(reverse_text, text)
special_tokens = tokenizer.all_special_tokens
special_tokens_string = tokenizer.convert_tokens_to_string(special_tokens)
for special_token in special_tokens:
self.assertIn(special_token, special_tokens_string)
if self.test_rust_tokenizer:
rust_tokenizer = self.get_rust_tokenizer()
special_tokens_string_rust = rust_tokenizer.convert_tokens_to_string(special_tokens)
self.assertEqual(special_tokens_string, special_tokens_string_rust)
def test_sentencepiece_tokenize_and_decode(self):
if not self.test_sentencepiece:
return
text = "This is text to test the tokenizer."
if self.test_rust_tokenizer:
tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer()
slow_ids = tokenizer(text).input_ids
fast_ids = rust_tokenizer(text).input_ids
self.assertEqual(slow_ids, fast_ids)
slow_decoded = tokenizer.decode(slow_ids)
fast_decoded = rust_tokenizer.decode(slow_ids)
self.assertEqual(slow_decoded, fast_decoded)
def test_subword_regularization_tokenizer(self) -> None:
if not self.test_sentencepiece:
return