BLOOM minor changes on tokenizer (#17823)

* few fixes:

- hardcode tokenizer padding side
- remove unused args

* few fixes:

- added new attribute on TokenizerTesterMixin
- added new slow test
- remove unused arg on tokenizer class

* make style

* Update src/transformers/models/bloom/tokenization_bloom_fast.py

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* make quality

* apply changes

- remove new attribute
- redefine test on the class

* add comments

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
This commit is contained in:
Younes Belkada
2022-06-23 15:57:12 +02:00
committed by GitHub
parent 6f29029b05
commit 18c263c4b6
3 changed files with 35 additions and 12 deletions

View File

@@ -127,3 +127,10 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
output_tokens = list(map(tokenizer.encode, input_text))
predicted_text = list(map(lambda x: tokenizer.decode(x, clean_up_tokenization_spaces=False), output_tokens))
self.assertListEqual(predicted_text, input_text)
def test_pretrained_model_lists(self):
# The test has to be overriden because BLOOM uses ALiBi positional embeddings that does not have
# any sequence length constraints. This test of the parent class will fail since it relies on the
# maximum sequence length of the positoonal embeddings.
self.assertGreaterEqual(len(self.tokenizer_class.pretrained_vocab_files_map), 1)
self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_vocab_files_map.values())[0]), 1)