Consider do_lower_case in PreTrainedTokenizer

As pointed out in #1545, when using an uncased model, and adding
a new uncased token, the tokenizer does not correctly identify this
in the case that the input text contains the token in a cased format.

For instance, if we load bert-base-uncased into BertTokenizer, and
then use .add_tokens() to add "cool-token", we get the expected
result for .tokenize('this is a cool-token'). However, we get a
possibly unexpected result for .tokenize('this is a cOOl-Token'),
which in fact mirrors the result for the former from before the new
token was added.

This commit adds
- functionality to PreTrainedTokenizer to handle this
situation in case a tokenizer (currently Bert, DistilBert,
and XLNet) has the do_lower_case=True kwarg by:
    1) lowercasing tokens added with .add_tokens()
    2) lowercasing text at the beginning of .tokenize()
- new common test case for tokenizers

https://github.com/huggingface/transformers/issues/1545
This commit is contained in:
Michael Watkins
2019-11-06 13:18:16 +02:00
parent 8aba81a0b6
commit 7246d3c2f9
2 changed files with 35 additions and 1 deletions

View File

@@ -512,6 +512,8 @@ class PreTrainedTokenizer(object):
to_add_tokens = []
for token in new_tokens:
assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
if self.init_kwargs.get('do_lower_case', False):
token = token.lower()
if token != self.unk_token and \
self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) and \
token not in to_add_tokens:
@@ -605,6 +607,9 @@ class PreTrainedTokenizer(object):
Take care of added tokens.
"""
if self.init_kwargs.get('do_lower_case', False):
text = text.lower()
def split_on_token(tok, text):
result = []
split_text = text.split(tok)