[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
This commit is contained in:
Anthony MOI
2020-06-15 17:12:51 -04:00
committed by GitHub
parent ebba39e4e1
commit 36434220fc
25 changed files with 3821 additions and 2680 deletions

View File

@@ -53,6 +53,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
"\u0120newer",
"\u0120wider",
"<unk>",
"<|endoftext|>",
]
vocab_tokens = dict(zip(vocab, range(len(vocab))))
merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
@@ -73,7 +74,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
kwargs.update(self.special_tokens_map)
return GPT2TokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
def get_input_output_texts(self):
def get_input_output_texts(self, tokenizer):
input_text = "lower newer"
output_text = "lower newer"
return input_text, output_text
@@ -118,3 +119,8 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
input_tokens = tokens + [rust_tokenizer.unk_token]
input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
self.assertListEqual(rust_tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
def test_pretokenized_inputs(self, *args, **kwargs):
# It's very difficult to mix/test pretokenization with byte-level
# And get both GPT2 and Roberta to work at the same time (mostly an issue of adding a space before the string)
pass