Add pad_to_multiple_of on tokenizers (reimport) (#5054)

* Add new parameter `pad_to_multiple_of` on tokenizers.

* unittest for pad_to_multiple_of

* Add .name when logging enum.

* Fix missing .items() on dict in tests.

* Add special check + warning if the tokenizer doesn't have proper pad_token.

* Use the correct logger format specifier.

* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.

* Skip test if tokenizer doesn't have pad_token

* Fix RobertaTokenizer on empty input

* Format.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix and updating to simpler API

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
This commit is contained in:
Funtowicz Morgan
2020-06-26 11:55:57 +02:00
committed by GitHub
parent 7cc15bdd96
commit 135791e8ef
5 changed files with 117 additions and 18 deletions

View File

@@ -244,7 +244,7 @@ class RobertaTokenizer(GPT2Tokenizer):
def prepare_for_tokenization(self, text, is_pretokenized=False, **kwargs):
add_prefix_space = kwargs.pop("add_prefix_space", self.add_prefix_space)
if (is_pretokenized or add_prefix_space) and text:
if (is_pretokenized or add_prefix_space) and (len(text) > 0 and not text[0].isspace()):
text = " " + text
return (text, kwargs)