BLOOM minor changes on tokenizer (#17823)
* few fixes: - hardcode tokenizer padding side - remove unused args * few fixes: - added new attribute on TokenizerTesterMixin - added new slow test - remove unused arg on tokenizer class * make style * Update src/transformers/models/bloom/tokenization_bloom_fast.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * make quality * apply changes - remove new attribute - redefine test on the class * add comments Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
This commit is contained in:
@@ -127,3 +127,10 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
output_tokens = list(map(tokenizer.encode, input_text))
|
||||
predicted_text = list(map(lambda x: tokenizer.decode(x, clean_up_tokenization_spaces=False), output_tokens))
|
||||
self.assertListEqual(predicted_text, input_text)
|
||||
|
||||
def test_pretrained_model_lists(self):
|
||||
# The test has to be overriden because BLOOM uses ALiBi positional embeddings that does not have
|
||||
# any sequence length constraints. This test of the parent class will fail since it relies on the
|
||||
# maximum sequence length of the positoonal embeddings.
|
||||
self.assertGreaterEqual(len(self.tokenizer_class.pretrained_vocab_files_map), 1)
|
||||
self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_vocab_files_map.values())[0]), 1)
|
||||
|
||||
Reference in New Issue
Block a user