VLM: special multimodal Tokenizer (#34461)

* kinda works

* update

* add tests

* update

* use special tokens in processors

* typo

* fix copies

* fix

* fix moshi after rebase

* update

* fix tests

* update

* Update docs/source/en/main_classes/tokenizer.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update docs

* test for load time adding tokens

* fix some more tests which are now fetched better

* one more fix

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
Raushan Turganbay
2024-11-04 16:37:51 +01:00
committed by GitHub
parent ef976a7e18
commit 187439c3fa
35 changed files with 248 additions and 335 deletions

View File

@@ -385,6 +385,7 @@ class LlamaIntegrationTest(unittest.TestCase):
assert fast == [1, 319, 4559, 1243]
fast_tokenizer.add_eos_token = True
print(fast_tokenizer.add_eos_token)
fast = fast_tokenizer.encode("A sample test", add_special_tokens=True)
assert fast == [1, 319, 4559, 1243, 2]