🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string (#15775)
* Add test for SentencePiece not adding special tokens to strings * Add SentencePieceStringConversionMixin to fix issue 15003 * Fix conversion from tokens to string for most SentencePiece tokenizers Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab * Fix DebertaV2Tokenizer * Ignore LayoutXLMTokenizer in SentencePiece string conversion test * Run 'make style' and 'make quality' * Clean convert_tokens_to_string test Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it. * Remove commented out code * Improve robustness of convert_tokens_to_string test Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens. * Inline and remove SentencePieceStringConversionMixin The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer. * Run 'make style' and 'make quality' * Revert removal of space in convert_tokens_to_string * Remove redundant import * Revert test text to original * Uncomment the lowercasing of the reverse_text variable * Mimic Rust tokenizer behavior for tokenizers - Albert - Barthez - Camembert - MBart50 - T5 * Fix accidentally skipping test in wrong tokenizer * Add test for equivalent Rust and slow tokenizer behavior * Override _decode in BigBirdTokenizer to mimic Rust behavior * Override _decode in FNetTokenizer to mimic Rust behavior * Override _decode in XLNetTokenizer to mimic Rust behavior * Remove unused 're' import * Update DebertaV2Tokenizer to mimic Rust tokenizer * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested. * Ignore problematic tests in Deberta V2 * Add comment on why the Deberta V2 tests are skipped
This commit is contained in:
@@ -385,6 +385,33 @@ class TokenizerTesterMixin:
|
||||
|
||||
self.assertEqual(reverse_text, text)
|
||||
|
||||
special_tokens = tokenizer.all_special_tokens
|
||||
special_tokens_string = tokenizer.convert_tokens_to_string(special_tokens)
|
||||
for special_token in special_tokens:
|
||||
self.assertIn(special_token, special_tokens_string)
|
||||
|
||||
if self.test_rust_tokenizer:
|
||||
rust_tokenizer = self.get_rust_tokenizer()
|
||||
special_tokens_string_rust = rust_tokenizer.convert_tokens_to_string(special_tokens)
|
||||
self.assertEqual(special_tokens_string, special_tokens_string_rust)
|
||||
|
||||
def test_sentencepiece_tokenize_and_decode(self):
|
||||
if not self.test_sentencepiece:
|
||||
return
|
||||
|
||||
text = "This is text to test the tokenizer."
|
||||
if self.test_rust_tokenizer:
|
||||
tokenizer = self.get_tokenizer()
|
||||
rust_tokenizer = self.get_rust_tokenizer()
|
||||
|
||||
slow_ids = tokenizer(text).input_ids
|
||||
fast_ids = rust_tokenizer(text).input_ids
|
||||
self.assertEqual(slow_ids, fast_ids)
|
||||
|
||||
slow_decoded = tokenizer.decode(slow_ids)
|
||||
fast_decoded = rust_tokenizer.decode(slow_ids)
|
||||
self.assertEqual(slow_decoded, fast_decoded)
|
||||
|
||||
def test_subword_regularization_tokenizer(self) -> None:
|
||||
if not self.test_sentencepiece:
|
||||
return
|
||||
|
||||
Reference in New Issue
Block a user