Fix llama tokenizer (#22402)
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
This commit is contained in:
@@ -3932,7 +3932,7 @@ class TokenizerTesterMixin:
|
||||
tokenizer_fast.save_pretrained(tmp_dir_2)
|
||||
tokenizer = BertTokenizer.from_pretrained(tmp_dir_2)
|
||||
|
||||
assert tokenizer_fast.clean_up_tokenization_spaces is False
|
||||
assert tokenizer.clean_up_tokenization_spaces is False
|
||||
decoded = tokenizer.decode(tokens)
|
||||
assert decoded == "[CLS] this shouldn ' t be ! he ' ll go . [SEP]"
|
||||
|
||||
|
||||
Reference in New Issue
Block a user