[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
Thomas Wolf
2020-10-18 20:51:24 +02:00
committed by GitHub
parent c65863ce53
commit ba8c4d0ac0
140 changed files with 6551 additions and 3961 deletions

View File

@@ -19,7 +19,7 @@ import unittest
from transformers import is_torch_available
from transformers.file_utils import cached_property
from transformers.hf_api import HfApi
from transformers.testing_utils import require_torch, slow, torch_device
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
if is_torch_available():
@@ -53,6 +53,8 @@ class ModelManagementTests(unittest.TestCase):
@require_torch
@require_sentencepiece
@require_tokenizers
class MarianIntegrationTest(unittest.TestCase):
src = "en"
tgt = "de"
@@ -110,6 +112,8 @@ class MarianIntegrationTest(unittest.TestCase):
return generated_words
@require_sentencepiece
@require_tokenizers
class TestMarian_EN_DE_More(MarianIntegrationTest):
@slow
def test_forward(self):
@@ -154,6 +158,8 @@ class TestMarian_EN_DE_More(MarianIntegrationTest):
self.assertIsInstance(config, MarianConfig)
@require_sentencepiece
@require_tokenizers
class TestMarian_EN_FR(MarianIntegrationTest):
src = "en"
tgt = "fr"
@@ -171,6 +177,8 @@ class TestMarian_EN_FR(MarianIntegrationTest):
self._assert_generated_batch_equal_expected()
@require_sentencepiece
@require_tokenizers
class TestMarian_FR_EN(MarianIntegrationTest):
src = "fr"
tgt = "en"
@@ -188,6 +196,8 @@ class TestMarian_FR_EN(MarianIntegrationTest):
self._assert_generated_batch_equal_expected()
@require_sentencepiece
@require_tokenizers
class TestMarian_RU_FR(MarianIntegrationTest):
src = "ru"
tgt = "fr"
@@ -199,6 +209,8 @@ class TestMarian_RU_FR(MarianIntegrationTest):
self._assert_generated_batch_equal_expected()
@require_sentencepiece
@require_tokenizers
class TestMarian_MT_EN(MarianIntegrationTest):
src = "mt"
tgt = "en"
@@ -210,6 +222,8 @@ class TestMarian_MT_EN(MarianIntegrationTest):
self._assert_generated_batch_equal_expected()
@require_sentencepiece
@require_tokenizers
class TestMarian_en_zh(MarianIntegrationTest):
src = "en"
tgt = "zh"
@@ -221,6 +235,8 @@ class TestMarian_en_zh(MarianIntegrationTest):
self._assert_generated_batch_equal_expected()
@require_sentencepiece
@require_tokenizers
class TestMarian_en_ROMANCE(MarianIntegrationTest):
"""Multilingual on target side."""