Add pad_to_multiple_of on tokenizers (reimport) (#5054)
* Add new parameter `pad_to_multiple_of` on tokenizers. * unittest for pad_to_multiple_of * Add .name when logging enum. * Fix missing .items() on dict in tests. * Add special check + warning if the tokenizer doesn't have proper pad_token. * Use the correct logger format specifier. * Ensure tokenizer with no pad_token do not modify the underlying padding strategy. * Skip test if tokenizer doesn't have pad_token * Fix RobertaTokenizer on empty input * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fix and updating to simpler API Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
This commit is contained in:
@@ -244,7 +244,7 @@ class RobertaTokenizer(GPT2Tokenizer):
|
||||
|
||||
def prepare_for_tokenization(self, text, is_pretokenized=False, **kwargs):
|
||||
add_prefix_space = kwargs.pop("add_prefix_space", self.add_prefix_space)
|
||||
if (is_pretokenized or add_prefix_space) and text:
|
||||
if (is_pretokenized or add_prefix_space) and (len(text) > 0 and not text[0].isspace()):
|
||||
text = " " + text
|
||||
return (text, kwargs)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user