Upgrading to tokenizers 0.19.0 (#30289)
* [DO NOT MERGE] Testing tokenizers 0.19.0rc0 * Accounting for the breaking change. * Ruff. * Upgrading to tokenizers `0.19` (new release with preprend_scheme fixed and new surface for BPE tiktoken bug).
This commit is contained in:
@@ -46,12 +46,16 @@ class SentencePieceUnigramTokenizer(BaseTokenizer):
|
||||
)
|
||||
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
|
||||
[
|
||||
pre_tokenizers.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space),
|
||||
pre_tokenizers.Metaspace(
|
||||
replacement=replacement, add_prefix_space="always" if add_prefix_space else "never"
|
||||
),
|
||||
pre_tokenizers.Digits(individual_digits=True),
|
||||
pre_tokenizers.Punctuation(),
|
||||
]
|
||||
)
|
||||
tokenizer.decoder = decoders.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space)
|
||||
tokenizer.decoder = decoders.Metaspace(
|
||||
replacement=replacement, add_prefix_space="always" if add_prefix_space else "never"
|
||||
)
|
||||
|
||||
tokenizer.post_processor = TemplateProcessing(
|
||||
single=f"$A {self.special_tokens['eos']['token']}",
|
||||
|
||||
Reference in New Issue
Block a user