🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string (#15775)

* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
This commit is contained in:
Ben Eyal
2022-11-02 21:45:38 +02:00
committed by GitHub
parent fb7cbe236b
commit 9f9ddcc2de
18 changed files with 379 additions and 40 deletions

View File

@@ -37,7 +37,7 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
super().setUp()
# We have a SentencePiece fixture for testing
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB)
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>")
tokenizer.save_pretrained(self.tmpdirname)
def get_input_output_texts(self, tokenizer):
@@ -55,7 +55,6 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
def test_get_vocab(self):
vocab_keys = list(self.get_tokenizer().get_vocab().keys())
self.assertEqual(vocab_keys[0], "<pad>")
self.assertEqual(vocab_keys[1], "<unk>")
self.assertEqual(vocab_keys[-1], "[PAD]")
@@ -80,6 +79,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.assertListEqual(rust_tokens, tokens_target)
@unittest.skip("There is an inconsistency between slow and fast tokenizer due to a bug in the fast one.")
def test_sentencepiece_tokenize_and_convert_tokens_to_string(self):
pass
@unittest.skip("There is an inconsistency between slow and fast tokenizer due to a bug in the fast one.")
def test_sentencepiece_tokenize_and_decode(self):
pass
def test_split_by_punct(self):
# fmt: off
sequence = "I was born in 92000, and this is falsé."

View File

@@ -1946,3 +1946,11 @@ class LayoutXLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
@unittest.skip("Doesn't support another framework than PyTorch")
def test_np_encode_plus_sent_to_model(self):
pass
@unittest.skip("Doesn't use SentencePiece")
def test_sentencepiece_tokenize_and_convert_tokens_to_string(self):
pass
@unittest.skip("Doesn't use SentencePiece")
def test_sentencepiece_tokenize_and_decode(self):
pass