Files
HuggingFace_transformer/tests/models
Ben Eyal 9f9ddcc2de 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string (#15775)
* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
2022-11-02 15:45:38 -04:00
..
2022-11-02 17:38:44 +01:00
2022-05-03 14:42:02 +02:00
2022-05-03 14:42:02 +02:00
2022-11-02 11:57:36 +00:00
2022-05-12 16:25:55 -04:00
2022-05-03 14:42:02 +02:00
2022-05-12 16:25:55 -04:00
2022-05-03 14:42:02 +02:00
2022-11-02 17:38:44 +01:00
2022-11-02 11:57:36 +00:00
2022-11-02 11:57:36 +00:00
2022-05-03 14:42:02 +02:00
2022-11-02 17:38:44 +01:00
2022-11-02 17:38:44 +01:00
2022-11-02 11:57:36 +00:00
2022-05-03 14:42:02 +02:00
2022-11-02 11:57:36 +00:00
2022-11-02 17:38:44 +01:00
2022-11-02 17:38:44 +01:00
2022-11-02 17:38:44 +01:00
2022-11-02 11:57:36 +00:00
2022-11-02 17:38:44 +01:00
2022-05-03 14:42:02 +02:00
2022-10-11 14:29:15 +02:00
2022-05-03 14:42:02 +02:00
2022-11-02 17:38:44 +01:00
2022-11-02 11:57:36 +00:00
2022-11-02 11:57:36 +00:00
2022-11-02 11:57:36 +00:00
2022-11-02 11:57:36 +00:00
2022-10-12 10:11:20 +02:00
2022-11-02 17:38:44 +01:00
2022-09-30 08:25:43 +02:00
2022-05-12 16:25:55 -04:00
2022-11-02 11:57:36 +00:00
2022-05-03 14:42:02 +02:00
2022-07-12 04:28:28 -04:00
2022-11-02 17:38:44 +01:00
2022-09-14 14:45:00 +02:00
2022-05-03 14:42:02 +02:00
2022-11-02 11:57:36 +00:00
2022-05-12 16:25:55 -04:00
2022-05-03 14:42:02 +02:00
2022-05-12 16:25:55 -04:00
2022-05-12 16:25:55 -04:00
2022-05-03 14:42:02 +02:00
2022-11-02 17:38:44 +01:00
2022-11-02 11:57:36 +00:00
2022-05-03 14:42:02 +02:00
2022-05-03 14:42:02 +02:00
2022-06-24 16:26:14 +02:00
2022-07-27 11:14:47 -04:00
2022-11-02 17:38:44 +01:00
2022-11-02 11:57:36 +00:00
2022-11-02 11:57:36 +00:00
2022-11-02 11:57:36 +00:00
2022-05-03 14:42:02 +02:00
2022-10-12 17:05:12 +02:00
2022-11-02 17:38:44 +01:00
2022-05-12 16:25:55 -04:00
2022-05-12 16:25:55 -04:00
2022-05-03 14:42:02 +02:00