[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)
* splitting fast and slow tokenizers [WIP] * [WIP] splitting sentencepiece and tokenizers dependencies * update dummy objects * add name_or_path to models and tokenizers * prefix added to file names * prefix * styling + quality * spliting all the tokenizer files - sorting sentencepiece based ones * update tokenizer version up to 0.9.0 * remove hard dependency on sentencepiece 🎉 * and removed hard dependency on tokenizers 🎉 * update conversion script * update missing models * fixing tests * move test_tokenization_fast to main tokenization tests - fix bugs * bump up tokenizers * fix bert_generation * update ad fix several tokenizers * keep sentencepiece in deps for now * fix funnel and deberta tests * fix fsmt * fix marian tests * fix layoutlm * fix squeezebert and gpt2 * fix T5 tokenization * fix xlnet tests * style * fix mbart * bump up tokenizers to 0.9.2 * fix model tests * fix tf models * fix seq2seq examples * fix tests without sentencepiece * fix slow => fast conversion without sentencepiece * update auto and bert generation tests * fix mbart tests * fix auto and common test without tokenizers * fix tests without tokenizers * clean up tests lighten up when tokenizers + sentencepiece are both off * style quality and tests fixing * add sentencepiece to doc/examples reqs * leave sentencepiece on for now * style quality split hebert and fix pegasus * WIP Herbert fast * add sample_text_no_unicode and fix hebert tokenization * skip FSMT example test for now * fix style * fix fsmt in example tests * update following Lysandre and Sylvain's comments * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
32
tests/fixtures/sample_text_no_unicode.txt
vendored
Normal file
32
tests/fixtures/sample_text_no_unicode.txt
vendored
Normal file
@@ -0,0 +1,32 @@
|
||||
Text should be one-sentence-per-line, with empty lines between documents.
|
||||
This sample text is public domain and was randomly selected from Project Guttenberg.
|
||||
|
||||
The rain had only ceased with the gray streaks of morning at Blazing Star, and the settlement awoke to a moral sense of cleanliness, and the finding of forgotten knives, tin cups, and smaller camp utensils, where the heavy showers had washed away the debris and dust heaps before the cabin doors.
|
||||
Indeed, it was recorded in Blazing Star that a fortunate early riser had once picked up on the highway a solid chunk of gold quartz which the rain had freed from its incumbering soil, and washed into immediate and glittering popularity.
|
||||
Possibly this may have been the reason why early risers in that locality, during the rainy season, adopted a thoughtful habit of body, and seldom lifted their eyes to the rifted or india-ink washed skies above them.
|
||||
"Cass" Beard had risen early that morning, but not with a view to discovery.
|
||||
A leak in his cabin roof,--quite consistent with his careless, improvident habits,--had roused him at 4 A. M., with a flooded "bunk" and wet blankets.
|
||||
The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor's to supply the deficiency.
|
||||
This was nearly opposite.
|
||||
Mr. Cassius crossed the highway, and stopped suddenly.
|
||||
Something glittered in the nearest red pool before him.
|
||||
Gold, surely!
|
||||
But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a plain gold ring.
|
||||
Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
|
||||
Like most of his fellow gold-seekers, Cass was superstitious.
|
||||
|
||||
The fountain of classic wisdom, Hypatia herself.
|
||||
As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred doors of her lecture-room, imbibe celestial knowledge.
|
||||
From my youth I felt in me a soul above the matter-entangled herd.
|
||||
She revealed to me the glorious fact, that I am a spark of Divinity itself.
|
||||
A fallen star, I am, sir!' continued he, pensively, stroking his lean stomach--'a fallen star!--fallen, if the dignity of philosophy will allow of the simile, among the hogs of the lower world--indeed, even into the hog-bucket itself. Well, after all, I will show you the way to the Archbishop's.
|
||||
There is a philosophic pleasure in opening one's treasures to the modest young.
|
||||
Perhaps you will assist me by carrying this basket of fruit?' And the little man jumped up, put his basket on Philammon's head, and trotted off up a neighbouring street.
|
||||
Philammon followed, half contemptuous, half wondering at what this philosophy might be, which could feed the self-conceit of anything so abject as his ragged little apish guide;
|
||||
but the novel roar and whirl of the street, the perpetual stream of busy faces, the line of curricles, palanquins, laden asses, camels, elephants, which met and passed him, and squeezed him up steps and into doorways, as they threaded their way through the great Moon-gate into the ample street beyond, drove everything from his mind but wondering curiosity, and a vague, helpless dread of that great living wilderness, more terrible than any dead wilderness of sand which he had left behind.
|
||||
Already he longed for the repose, the silence of the Laura--for faces which knew him and smiled upon him; but it was too late to turn back now.
|
||||
His guide held on for more than a mile up the great main street, crossed in the centre of the city, at right angles, by one equally magnificent, at each end of which, miles away, appeared, dim and distant over the heads of the living stream of passengers, the yellow sand-hills of the desert;
|
||||
while at the end of the vista in front of them gleamed the blue harbour, through a network of countless masts.
|
||||
At last they reached the quay at the opposite end of the street;
|
||||
and there burst on Philammon's astonished eyes a vast semicircle of blue sea, ringed with palaces and towers.
|
||||
He stopped involuntarily; and his little guide stopped also, and looked askance at the young monk, to watch the effect which that grand panorama should produce on him.
|
||||
@@ -20,7 +20,7 @@ import timeout_decorator # noqa
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
@@ -207,6 +207,8 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
def test_tiny_model(self):
|
||||
model_name = "sshleifer/bart-tiny-random"
|
||||
tiny = AutoModel.from_pretrained(model_name) # same vocab size
|
||||
@@ -439,6 +441,8 @@ TOLERANCE = 1e-4
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class BartModelIntegrationTests(unittest.TestCase):
|
||||
@cached_property
|
||||
def default_tokenizer(self):
|
||||
|
||||
@@ -19,7 +19,7 @@ import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
@@ -131,6 +131,8 @@ class BlenderbotTesterMixin(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
@unittest.skipUnless(torch_device != "cpu", "3B test too slow on CPU.")
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class Blenderbot3BIntegrationTests(unittest.TestCase):
|
||||
ckpt = "facebook/blenderbot-3B"
|
||||
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -26,6 +26,8 @@ if is_torch_available():
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class CamembertModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_output_embeds_base_model(self):
|
||||
|
||||
@@ -20,7 +20,7 @@ import unittest
|
||||
import numpy as np
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
@@ -236,6 +236,8 @@ class DebertaModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class DebertaModelIntegrationTest(unittest.TestCase):
|
||||
@unittest.skip(reason="Model not available yet")
|
||||
def test_inference_masked_lm(self):
|
||||
|
||||
@@ -22,7 +22,7 @@ import timeout_decorator # noqa
|
||||
from parameterized import parameterized
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import WEIGHTS_NAME, cached_property
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
@@ -393,6 +393,8 @@ pairs = [
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class FSMTModelIntegrationTests(unittest.TestCase):
|
||||
tokenizers_cache = {}
|
||||
models_cache = {}
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import FunnelTokenizer, is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
@@ -417,6 +417,8 @@ class FunnelBaseModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class FunnelModelIntegrationTest(unittest.TestCase):
|
||||
def test_inference_tiny_model(self):
|
||||
batch_size = 13
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
|
||||
@@ -329,6 +329,8 @@ class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class LongformerModelIntegrationTest(unittest.TestCase):
|
||||
def _get_hidden_states(self):
|
||||
return torch.tensor(
|
||||
|
||||
@@ -19,7 +19,7 @@ import unittest
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.hf_api import HfApi
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -53,6 +53,8 @@ class ModelManagementTests(unittest.TestCase):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class MarianIntegrationTest(unittest.TestCase):
|
||||
src = "en"
|
||||
tgt = "de"
|
||||
@@ -110,6 +112,8 @@ class MarianIntegrationTest(unittest.TestCase):
|
||||
return generated_words
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_EN_DE_More(MarianIntegrationTest):
|
||||
@slow
|
||||
def test_forward(self):
|
||||
@@ -154,6 +158,8 @@ class TestMarian_EN_DE_More(MarianIntegrationTest):
|
||||
self.assertIsInstance(config, MarianConfig)
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_EN_FR(MarianIntegrationTest):
|
||||
src = "en"
|
||||
tgt = "fr"
|
||||
@@ -171,6 +177,8 @@ class TestMarian_EN_FR(MarianIntegrationTest):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_FR_EN(MarianIntegrationTest):
|
||||
src = "fr"
|
||||
tgt = "en"
|
||||
@@ -188,6 +196,8 @@ class TestMarian_FR_EN(MarianIntegrationTest):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_RU_FR(MarianIntegrationTest):
|
||||
src = "ru"
|
||||
tgt = "fr"
|
||||
@@ -199,6 +209,8 @@ class TestMarian_RU_FR(MarianIntegrationTest):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_MT_EN(MarianIntegrationTest):
|
||||
src = "mt"
|
||||
tgt = "en"
|
||||
@@ -210,6 +222,8 @@ class TestMarian_MT_EN(MarianIntegrationTest):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_en_zh(MarianIntegrationTest):
|
||||
src = "en"
|
||||
tgt = "zh"
|
||||
@@ -221,6 +235,8 @@ class TestMarian_en_zh(MarianIntegrationTest):
|
||||
self._assert_generated_batch_equal_expected()
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TestMarian_en_ROMANCE(MarianIntegrationTest):
|
||||
"""Multilingual on target side."""
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@ import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_modeling_bart import TOLERANCE, _long_tensor, assert_tensors_close
|
||||
|
||||
@@ -24,6 +24,8 @@ RO_CODE = 250020
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class AbstractSeq2SeqIntegrationTest(unittest.TestCase):
|
||||
maxDiff = 1000 # longer string compare tracebacks
|
||||
checkpoint_name = None
|
||||
@@ -43,6 +45,8 @@ class AbstractSeq2SeqIntegrationTest(unittest.TestCase):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class MBartEnroIntegrationTest(AbstractSeq2SeqIntegrationTest):
|
||||
checkpoint_name = "facebook/mbart-large-en-ro"
|
||||
src_text = [
|
||||
@@ -134,6 +138,8 @@ class MBartEnroIntegrationTest(AbstractSeq2SeqIntegrationTest):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class MBartCC25IntegrationTest(AbstractSeq2SeqIntegrationTest):
|
||||
checkpoint_name = "facebook/mbart-large-cc25"
|
||||
src_text = [
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
|
||||
@@ -411,6 +411,8 @@ TOLERANCE = 1e-3
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class MobileBertModelIntegrationTests(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference_no_head(self):
|
||||
|
||||
@@ -3,7 +3,7 @@ import unittest
|
||||
from transformers import AutoConfig, AutoTokenizer, is_torch_available
|
||||
from transformers.configuration_pegasus import task_specific_params
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
from transformers.utils.logging import ERROR, set_verbosity
|
||||
|
||||
from .test_modeling_bart import PGE_ARTICLE
|
||||
@@ -19,6 +19,8 @@ set_verbosity(ERROR)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class PegasusXSUMIntegrationTest(AbstractSeq2SeqIntegrationTest):
|
||||
checkpoint_name = "google/pegasus-xsum"
|
||||
src_text = [PGE_ARTICLE, XSUM_ENTRY_LONGER]
|
||||
|
||||
@@ -23,13 +23,12 @@ from unittest.mock import patch
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import BartTokenizer, T5Tokenizer
|
||||
from transformers.file_utils import cached_property, is_datasets_available, is_faiss_available, is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.tokenization_bart import BartTokenizer
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
from transformers.tokenization_bert import VOCAB_FILES_NAMES as DPR_VOCAB_FILES_NAMES
|
||||
from transformers.tokenization_dpr import DPRQuestionEncoderTokenizer
|
||||
from transformers.tokenization_roberta import VOCAB_FILES_NAMES as BART_VOCAB_FILES_NAMES
|
||||
from transformers.tokenization_t5 import T5Tokenizer
|
||||
|
||||
from .test_modeling_bart import ModelTester as BartModelTester
|
||||
from .test_modeling_dpr import DPRModelTester
|
||||
@@ -89,6 +88,7 @@ def require_retrieval(test_case):
|
||||
|
||||
@require_torch
|
||||
@require_retrieval
|
||||
@require_sentencepiece
|
||||
class RagTestMixin:
|
||||
|
||||
all_model_classes = (
|
||||
@@ -438,6 +438,8 @@ class RagDPRT5Test(RagTestMixin, unittest.TestCase):
|
||||
|
||||
@require_torch
|
||||
@require_retrieval
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class RagModelIntegrationTests(unittest.TestCase):
|
||||
@cached_property
|
||||
def sequence_model(self):
|
||||
|
||||
@@ -16,7 +16,14 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_multigpu, require_torch, slow, torch_device
|
||||
from transformers.testing_utils import (
|
||||
require_multigpu,
|
||||
require_sentencepiece,
|
||||
require_tokenizers,
|
||||
require_torch,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
|
||||
@@ -680,6 +687,8 @@ class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.T
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class ReformerIntegrationTests(unittest.TestCase):
|
||||
"""
|
||||
These integration tests test the current layer activations and gradients againts the output of the Hugging Face Reformer model at time of integration: 29/06/2020. During integration, the model was tested against the output of the official Trax ReformerLM model for various cases ("lsh" only, "local" only, masked / non-masked, different chunk length, ....). In order to recover the original trax integration tests, one should use patrickvonplaten's fork of trax and the code that lives on the branch `reformer_trax_tests`.
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
|
||||
@@ -394,6 +394,8 @@ class RobertaModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
self.assertTrue(torch.all(torch.eq(position_ids, expected_positions)))
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class RobertaModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference_masked_lm(self):
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
|
||||
@@ -271,6 +271,8 @@ class SqueezeBertModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class SqueezeBertModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference_classification_head(self):
|
||||
|
||||
@@ -20,7 +20,7 @@ import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
@@ -29,9 +29,8 @@ from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import T5Config, T5ForConditionalGeneration, T5Model
|
||||
from transformers import T5Config, T5ForConditionalGeneration, T5Model, T5Tokenizer
|
||||
from transformers.modeling_t5 import T5_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
from transformers.tokenization_t5 import T5Tokenizer
|
||||
|
||||
|
||||
class T5ModelTester:
|
||||
@@ -546,6 +545,8 @@ def use_task_specific_params(model, task):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class T5ModelIntegrationTests(unittest.TestCase):
|
||||
@cached_property
|
||||
def model(self):
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_tf_available
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
|
||||
|
||||
|
||||
if is_tf_available():
|
||||
@@ -27,6 +27,8 @@ if is_tf_available():
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TFCamembertModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_output_embeds_base_model(self):
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_tf_available
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
|
||||
@@ -332,6 +332,8 @@ class TFFlaubertModelTest(TFModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TFFlaubertModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_output_embeds_base_model(self):
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_tf_available
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
|
||||
@@ -304,6 +304,8 @@ class TFLongformerModelTest(TFModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TFLongformerModelIntegrationTest(unittest.TestCase):
|
||||
def _get_hidden_states(self):
|
||||
return tf.convert_to_tensor(
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import RobertaConfig, is_tf_available
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
|
||||
@@ -222,6 +222,8 @@ class TFRobertaModelTest(TFModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TFRobertaModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference_masked_lm(self):
|
||||
|
||||
@@ -18,7 +18,7 @@ import unittest
|
||||
|
||||
from transformers import T5Config, is_tf_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
|
||||
@@ -285,6 +285,8 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TFT5ModelIntegrationTests(unittest.TestCase):
|
||||
@cached_property
|
||||
def model(self):
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_tf_available
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
|
||||
|
||||
|
||||
if is_tf_available():
|
||||
@@ -27,6 +27,8 @@ if is_tf_available():
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TFFlaubertModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_output_embeds_base_model(self):
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, slow
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -26,6 +26,8 @@ if is_torch_available():
|
||||
from transformers import XLMRobertaModel
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class XLMRobertaModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_xlm_roberta_base(self):
|
||||
|
||||
@@ -10,7 +10,7 @@ from transformers.convert_graph_to_onnx import (
|
||||
infer_shapes,
|
||||
quantize,
|
||||
)
|
||||
from transformers.testing_utils import require_tf, require_torch, slow
|
||||
from transformers.testing_utils import require_tf, require_tokenizers, require_torch, slow
|
||||
|
||||
|
||||
class FuncContiguousArgs:
|
||||
@@ -94,6 +94,7 @@ class OnnxExportTestCase(unittest.TestCase):
|
||||
self.fail(e)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_infer_dynamic_axis_pytorch(self):
|
||||
"""
|
||||
Validate the dynamic axis generated for each parameters are correct
|
||||
@@ -105,6 +106,7 @@ class OnnxExportTestCase(unittest.TestCase):
|
||||
self._test_infer_dynamic_axis(model, tokenizer, "pt")
|
||||
|
||||
@require_tf
|
||||
@require_tokenizers
|
||||
def test_infer_dynamic_axis_tf(self):
|
||||
"""
|
||||
Validate the dynamic axis generated for each parameters are correct
|
||||
|
||||
@@ -3,7 +3,7 @@ from typing import Iterable, List, Optional
|
||||
|
||||
from transformers import pipeline
|
||||
from transformers.pipelines import SUPPORTED_TASKS, Conversation, DefaultArgumentHandler, Pipeline
|
||||
from transformers.testing_utils import require_tf, require_torch, slow, torch_device
|
||||
from transformers.testing_utils import require_tf, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
|
||||
DEFAULT_DEVICE_NUM = -1 if torch_device == "cpu" else 0
|
||||
@@ -342,6 +342,7 @@ class MonoColumnInputTestCase(unittest.TestCase):
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_torch_summarization(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["summary_text"]
|
||||
@@ -377,6 +378,7 @@ class MonoColumnInputTestCase(unittest.TestCase):
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_torch_translation(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["translation_text"]
|
||||
@@ -399,6 +401,7 @@ class MonoColumnInputTestCase(unittest.TestCase):
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, mandatory_keys, invalid_inputs=invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_torch_text2text(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["generated_text"]
|
||||
|
||||
@@ -14,7 +14,13 @@ from transformers.configuration_bart import BartConfig
|
||||
from transformers.configuration_dpr import DPRConfig
|
||||
from transformers.configuration_rag import RagConfig
|
||||
from transformers.retrieval_rag import RagRetriever
|
||||
from transformers.testing_utils import require_datasets, require_faiss, require_torch
|
||||
from transformers.testing_utils import (
|
||||
require_datasets,
|
||||
require_faiss,
|
||||
require_sentencepiece,
|
||||
require_tokenizers,
|
||||
require_torch,
|
||||
)
|
||||
from transformers.tokenization_bart import BartTokenizer
|
||||
from transformers.tokenization_bert import VOCAB_FILES_NAMES as DPR_VOCAB_FILES_NAMES
|
||||
from transformers.tokenization_dpr import DPRQuestionEncoderTokenizer
|
||||
@@ -189,6 +195,8 @@ class RagRetrieverTest(TestCase):
|
||||
self.assertListEqual(doc_ids.tolist(), [[1], [0]])
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
@require_sentencepiece
|
||||
def test_hf_index_retriever_call(self):
|
||||
import torch
|
||||
|
||||
|
||||
@@ -17,7 +17,8 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_albert import AlbertTokenizer, AlbertTokenizerFast
|
||||
from transformers import AlbertTokenizer, AlbertTokenizerFast
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -25,6 +26,8 @@ from .test_tokenization_common import TokenizerTesterMixin
|
||||
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/spiece.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class AlbertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = AlbertTokenizer
|
||||
|
||||
@@ -33,6 +33,7 @@ from transformers.testing_utils import (
|
||||
DUMMY_DIFF_TOKENIZER_IDENTIFIER,
|
||||
DUMMY_UNKWOWN_IDENTIFIER,
|
||||
SMALL_MODEL_IDENTIFIER,
|
||||
require_tokenizers,
|
||||
)
|
||||
from transformers.tokenization_auto import TOKENIZER_MAPPING
|
||||
|
||||
@@ -70,6 +71,7 @@ class AutoTokenizerTest(unittest.TestCase):
|
||||
self.assertIsInstance(tokenizer, (BertTokenizer, BertTokenizerFast))
|
||||
self.assertEqual(tokenizer.vocab_size, 12)
|
||||
|
||||
@require_tokenizers
|
||||
def test_tokenizer_identifier_with_correct_config(self):
|
||||
for tokenizer_class in [BertTokenizer, BertTokenizerFast, AutoTokenizer]:
|
||||
tokenizer = tokenizer_class.from_pretrained("wietsedv/bert-base-dutch-cased")
|
||||
@@ -82,6 +84,7 @@ class AutoTokenizerTest(unittest.TestCase):
|
||||
|
||||
self.assertEqual(tokenizer.max_len, 512)
|
||||
|
||||
@require_tokenizers
|
||||
def test_tokenizer_identifier_non_existent(self):
|
||||
for tokenizer_class in [BertTokenizer, BertTokenizerFast, AutoTokenizer]:
|
||||
with self.assertRaises(EnvironmentError):
|
||||
@@ -101,12 +104,16 @@ class AutoTokenizerTest(unittest.TestCase):
|
||||
msg="Testing if {} is child of {}".format(child_config.__name__, parent_config.__name__)
|
||||
):
|
||||
self.assertFalse(issubclass(child_config, parent_config))
|
||||
self.assertFalse(issubclass(child_model_py, parent_model_py))
|
||||
|
||||
# Check for Slow tokenizer implementation if provided
|
||||
if child_model_py and parent_model_py:
|
||||
self.assertFalse(issubclass(child_model_py, parent_model_py))
|
||||
|
||||
# Check for Fast tokenizer implementation if provided
|
||||
if child_model_fast and parent_model_fast:
|
||||
self.assertFalse(issubclass(child_model_fast, parent_model_fast))
|
||||
|
||||
@require_tokenizers
|
||||
def test_from_pretrained_use_fast_toggle(self):
|
||||
self.assertIsInstance(AutoTokenizer.from_pretrained("bert-base-cased"), BertTokenizer)
|
||||
self.assertIsInstance(AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True), BertTokenizerFast)
|
||||
|
||||
@@ -4,16 +4,19 @@ import unittest
|
||||
|
||||
from transformers import BartTokenizer, BartTokenizerFast, BatchEncoding
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch
|
||||
from transformers.testing_utils import require_tokenizers, require_torch
|
||||
from transformers.tokenization_roberta import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
from .test_tokenization_common import TokenizerTesterMixin, filter_roberta_detectors
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
|
||||
tokenizer_class = BartTokenizer
|
||||
rust_tokenizer_class = BartTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
from_pretrained_filter = filter_roberta_detectors
|
||||
# from_pretrained_kwargs = {'add_prefix_space': True}
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -56,7 +59,7 @@ class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
def get_rust_tokenizer(self, **kwargs):
|
||||
kwargs.update(self.special_tokens_map)
|
||||
return BartTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
|
||||
return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_input_output_texts(self, tokenizer):
|
||||
return "lower newer", "lower newer"
|
||||
@@ -145,3 +148,38 @@ class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
|
||||
self.assertTrue((labels[:, 0] == tokenizer.bos_token_id).all().item())
|
||||
self.assertTrue((input_ids[:, -1] == tokenizer.eos_token_id).all().item())
|
||||
self.assertTrue((labels[:, -1] == tokenizer.eos_token_id).all().item())
|
||||
|
||||
def test_pretokenized_inputs(self):
|
||||
pass
|
||||
|
||||
def test_embeded_special_tokens(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
sentence = "A, <mask> AllenNLP sentence."
|
||||
tokens_r = tokenizer_r.encode_plus(sentence, add_special_tokens=True, return_token_type_ids=True)
|
||||
tokens_p = tokenizer_p.encode_plus(sentence, add_special_tokens=True, return_token_type_ids=True)
|
||||
|
||||
# token_type_ids should put 0 everywhere
|
||||
self.assertEqual(sum(tokens_r["token_type_ids"]), sum(tokens_p["token_type_ids"]))
|
||||
|
||||
# attention_mask should put 1 everywhere, so sum over length should be 1
|
||||
self.assertEqual(
|
||||
sum(tokens_r["attention_mask"]) / len(tokens_r["attention_mask"]),
|
||||
sum(tokens_p["attention_mask"]) / len(tokens_p["attention_mask"]),
|
||||
)
|
||||
|
||||
tokens_r_str = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
|
||||
tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
|
||||
|
||||
# Rust correctly handles the space before the mask while python doesnt
|
||||
self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
|
||||
self.assertSequenceEqual(tokens_r["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
|
||||
|
||||
self.assertSequenceEqual(
|
||||
tokens_p_str, ["<s>", "A", ",", "<mask>", "ĠAllen", "N", "LP", "Ġsentence", ".", "</s>"]
|
||||
)
|
||||
self.assertSequenceEqual(
|
||||
tokens_r_str, ["<s>", "A", ",", "<mask>", "ĠAllen", "N", "LP", "Ġsentence", ".", "</s>"]
|
||||
)
|
||||
|
||||
@@ -17,27 +17,29 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers import BertTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers, slow
|
||||
from transformers.tokenization_bert import (
|
||||
VOCAB_FILES_NAMES,
|
||||
BasicTokenizer,
|
||||
BertTokenizer,
|
||||
BertTokenizerFast,
|
||||
WordpieceTokenizer,
|
||||
_is_control,
|
||||
_is_punctuation,
|
||||
_is_whitespace,
|
||||
)
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
from .test_tokenization_common import TokenizerTesterMixin, filter_non_english
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = BertTokenizer
|
||||
rust_tokenizer_class = BertTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
space_between_special_tokens = True
|
||||
from_pretrained_filter = filter_non_english
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -245,3 +247,55 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
assert encoded_sentence == [101] + text + [102]
|
||||
assert encoded_pair == [101] + text + [102] + text_2 + [102]
|
||||
|
||||
def test_offsets_with_special_characters(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
sentence = f"A, naïve {tokenizer_r.mask_token} AllenNLP sentence."
|
||||
tokens = tokenizer_r.encode_plus(
|
||||
sentence,
|
||||
return_attention_mask=False,
|
||||
return_token_type_ids=False,
|
||||
return_offsets_mapping=True,
|
||||
add_special_tokens=True,
|
||||
)
|
||||
|
||||
do_lower_case = tokenizer_r.do_lower_case if hasattr(tokenizer_r, "do_lower_case") else False
|
||||
expected_results = (
|
||||
[
|
||||
((0, 0), tokenizer_r.cls_token),
|
||||
((0, 1), "A"),
|
||||
((1, 2), ","),
|
||||
((3, 5), "na"),
|
||||
((5, 6), "##ï"),
|
||||
((6, 8), "##ve"),
|
||||
((9, 15), tokenizer_r.mask_token),
|
||||
((16, 21), "Allen"),
|
||||
((21, 23), "##NL"),
|
||||
((23, 24), "##P"),
|
||||
((25, 33), "sentence"),
|
||||
((33, 34), "."),
|
||||
((0, 0), tokenizer_r.sep_token),
|
||||
]
|
||||
if not do_lower_case
|
||||
else [
|
||||
((0, 0), tokenizer_r.cls_token),
|
||||
((0, 1), "a"),
|
||||
((1, 2), ","),
|
||||
((3, 8), "naive"),
|
||||
((9, 15), tokenizer_r.mask_token),
|
||||
((16, 21), "allen"),
|
||||
((21, 23), "##nl"),
|
||||
((23, 24), "##p"),
|
||||
((25, 33), "sentence"),
|
||||
((33, 34), "."),
|
||||
((0, 0), tokenizer_r.sep_token),
|
||||
]
|
||||
)
|
||||
|
||||
self.assertEqual(
|
||||
[e[1] for e in expected_results], tokenizer_r.convert_ids_to_tokens(tokens["input_ids"])
|
||||
)
|
||||
self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
|
||||
|
||||
@@ -17,9 +17,9 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers import BertGenerationTokenizer
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow
|
||||
from transformers.tokenization_bert_generation import BertGenerationTokenizer
|
||||
from transformers.testing_utils import require_sentencepiece, require_torch, slow
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -29,6 +29,7 @@ SPIECE_UNDERLINE = "▁"
|
||||
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/test_sentencepiece.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
class BertGenerationTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = BertGenerationTokenizer
|
||||
|
||||
@@ -19,12 +19,12 @@ import pickle
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import custom_tokenizers
|
||||
from transformers.tokenization_bert import WordpieceTokenizer
|
||||
from transformers.tokenization_bert_japanese import (
|
||||
VOCAB_FILES_NAMES,
|
||||
BertJapaneseTokenizer,
|
||||
CharacterTokenizer,
|
||||
MecabTokenizer,
|
||||
WordpieceTokenizer,
|
||||
)
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -17,8 +17,8 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import _torch_available
|
||||
from transformers.tokenization_camembert import CamembertTokenizer, CamembertTokenizerFast
|
||||
from transformers import CamembertTokenizer, CamembertTokenizerFast
|
||||
from transformers.testing_utils import _torch_available, require_sentencepiece, require_tokenizers
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -28,6 +28,8 @@ SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixture
|
||||
FRAMEWORK = "pt" if _torch_available else "tf"
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class CamembertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = CamembertTokenizer
|
||||
|
||||
@@ -14,16 +14,18 @@
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import inspect
|
||||
import os
|
||||
import pickle
|
||||
import re
|
||||
import shutil
|
||||
import tempfile
|
||||
from collections import OrderedDict
|
||||
from itertools import takewhile
|
||||
from typing import TYPE_CHECKING, Dict, List, Tuple, Union
|
||||
|
||||
from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast
|
||||
from transformers.testing_utils import require_tf, require_torch, slow
|
||||
from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast, is_torch_available
|
||||
from transformers.testing_utils import get_tests_dir, require_tf, require_tokenizers, require_torch, slow
|
||||
from transformers.tokenization_utils import AddedToken
|
||||
|
||||
|
||||
@@ -31,6 +33,18 @@ if TYPE_CHECKING:
|
||||
from transformers import PretrainedConfig, PreTrainedModel, TFPreTrainedModel
|
||||
|
||||
|
||||
NON_ENGLISH_TAGS = ["chinese", "dutch", "french", "finnish", "german", "multilingual"]
|
||||
|
||||
|
||||
def filter_non_english(_, pretrained_name: str):
|
||||
""" Filter all the model for non-english language """
|
||||
return not any([lang in pretrained_name for lang in NON_ENGLISH_TAGS])
|
||||
|
||||
|
||||
def filter_roberta_detectors(_, pretrained_name: str):
|
||||
return "detector" not in pretrained_name
|
||||
|
||||
|
||||
def merge_model_tokenizer_mappings(
|
||||
model_mapping: Dict["PretrainedConfig", Union["PreTrainedModel", "TFPreTrainedModel"]],
|
||||
tokenizer_mapping: Dict["PretrainedConfig", Tuple["PreTrainedTokenizer", "PreTrainedTokenizerFast"]],
|
||||
@@ -59,8 +73,32 @@ class TokenizerTesterMixin:
|
||||
rust_tokenizer_class = None
|
||||
test_rust_tokenizer = False
|
||||
space_between_special_tokens = False
|
||||
from_pretrained_kwargs = None
|
||||
from_pretrained_filter = None
|
||||
from_pretrained_vocab_key = "vocab_file"
|
||||
|
||||
def setUp(self) -> None:
|
||||
# Tokenizer.filter makes it possible to filter which Tokenizer to case based on all the
|
||||
# information available in Tokenizer (name, rust class, python class, vocab key name)
|
||||
if self.test_rust_tokenizer:
|
||||
tokenizers_list = [
|
||||
(
|
||||
self.rust_tokenizer_class,
|
||||
pretrained_name,
|
||||
self.from_pretrained_kwargs if self.from_pretrained_kwargs is not None else {},
|
||||
)
|
||||
for pretrained_name in self.rust_tokenizer_class.pretrained_vocab_files_map[
|
||||
self.from_pretrained_vocab_key
|
||||
].keys()
|
||||
if self.from_pretrained_filter is None
|
||||
or (self.from_pretrained_filter is not None and self.from_pretrained_filter(pretrained_name))
|
||||
]
|
||||
self.tokenizers_list = tokenizers_list[:1] # Let's just test the first pretrained vocab for speed
|
||||
else:
|
||||
self.tokenizers_list = []
|
||||
with open(f"{get_tests_dir()}/fixtures/sample_text.txt", encoding="utf-8") as f_data:
|
||||
self._data = f_data.read().replace("\n\n", "\n").strip()
|
||||
|
||||
def setUp(self):
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
def tearDown(self):
|
||||
@@ -123,6 +161,15 @@ class TokenizerTesterMixin:
|
||||
for i in range(len(batch_encode_plus_sequences["input_ids"]))
|
||||
]
|
||||
|
||||
def test_rust_tokenizer_signature(self):
|
||||
if not self.test_rust_tokenizer:
|
||||
return
|
||||
|
||||
signature = inspect.signature(self.rust_tokenizer_class.__init__)
|
||||
|
||||
self.assertIn("tokenizer_file", signature.parameters)
|
||||
self.assertIsNone(signature.parameters["tokenizer_file"].default)
|
||||
|
||||
def test_rust_and_python_full_tokenizers(self):
|
||||
if not self.test_rust_tokenizer:
|
||||
return
|
||||
@@ -206,7 +253,6 @@ class TokenizerTesterMixin:
|
||||
|
||||
shutil.rmtree(tmpdirname)
|
||||
|
||||
# Now let's start the test
|
||||
tokenizers = self.get_tokenizers(model_max_length=42)
|
||||
for tokenizer in tokenizers:
|
||||
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||
@@ -237,6 +283,39 @@ class TokenizerTesterMixin:
|
||||
|
||||
shutil.rmtree(tmpdirname)
|
||||
|
||||
# Test that we can also use the non-legacy saving format for fast tokenizers
|
||||
tokenizers = self.get_tokenizers(model_max_length=42)
|
||||
for tokenizer in tokenizers:
|
||||
if not tokenizer.is_fast:
|
||||
continue
|
||||
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||
# Isolate this from the other tests because we save additional tokens/etc
|
||||
tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
sample_text = " He is very happy, UNwant\u00E9d,running"
|
||||
tokenizer.add_tokens(["bim", "bambam"])
|
||||
additional_special_tokens = tokenizer.additional_special_tokens
|
||||
additional_special_tokens.append("new_additional_special_token")
|
||||
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
|
||||
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
|
||||
before_vocab = tokenizer.get_vocab()
|
||||
tokenizer.save_pretrained(tmpdirname)
|
||||
|
||||
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
|
||||
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
|
||||
after_vocab = after_tokenizer.get_vocab()
|
||||
self.assertListEqual(before_tokens, after_tokens)
|
||||
self.assertDictEqual(before_vocab, after_vocab)
|
||||
self.assertIn("bim", after_vocab)
|
||||
self.assertIn("bambam", after_vocab)
|
||||
self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
|
||||
self.assertEqual(after_tokenizer.model_max_length, 42)
|
||||
|
||||
tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
|
||||
self.assertEqual(tokenizer.model_max_length, 43)
|
||||
|
||||
shutil.rmtree(tmpdirname)
|
||||
|
||||
def test_pickle_tokenizer(self):
|
||||
"""Google pickle __getstate__ __setstate__ if you are struggling with this."""
|
||||
tokenizers = self.get_tokenizers()
|
||||
@@ -258,6 +337,7 @@ class TokenizerTesterMixin:
|
||||
|
||||
self.assertListEqual(subwords, subwords_loaded)
|
||||
|
||||
@require_tokenizers
|
||||
def test_pickle_added_tokens(self):
|
||||
tok1 = AddedToken("<s>", rstrip=True, lstrip=True, normalized=False, single_word=True)
|
||||
tok2 = pickle.loads(pickle.dumps(tok1))
|
||||
@@ -419,6 +499,7 @@ class TokenizerTesterMixin:
|
||||
|
||||
self.assertEqual(text_2, output_text)
|
||||
|
||||
@require_tokenizers
|
||||
def test_encode_decode_with_spaces(self):
|
||||
tokenizers = self.get_tokenizers(do_lower_case=False)
|
||||
for tokenizer in tokenizers:
|
||||
@@ -437,6 +518,15 @@ class TokenizerTesterMixin:
|
||||
self.assertIn(decoded, [output, output.lower()])
|
||||
|
||||
def test_pretrained_model_lists(self):
|
||||
# We should have at least one default checkpoint for each tokenizer
|
||||
# We should specify the max input length as well (used in some part to list the pretrained checkpoints)
|
||||
self.assertGreaterEqual(len(self.tokenizer_class.pretrained_vocab_files_map), 1)
|
||||
self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_vocab_files_map.values())[0]), 1)
|
||||
self.assertEqual(
|
||||
len(list(self.tokenizer_class.pretrained_vocab_files_map.values())[0]),
|
||||
len(self.tokenizer_class.max_model_input_sizes),
|
||||
)
|
||||
|
||||
weights_list = list(self.tokenizer_class.max_model_input_sizes.keys())
|
||||
weights_lists_2 = []
|
||||
for file_id, map_list in self.tokenizer_class.pretrained_vocab_files_map.items():
|
||||
@@ -1226,6 +1316,7 @@ class TokenizerTesterMixin:
|
||||
encoded_sequences_batch_padded_2[key],
|
||||
)
|
||||
|
||||
@require_tokenizers
|
||||
def test_added_token_serializable(self):
|
||||
tokenizers = self.get_tokenizers(do_lower_case=False)
|
||||
for tokenizer in tokenizers:
|
||||
@@ -1652,3 +1743,772 @@ class TokenizerTesterMixin:
|
||||
self.assertEqual(batch_encoder_only.input_ids.shape[1], 3)
|
||||
self.assertEqual(batch_encoder_only.attention_mask.shape[1], 3)
|
||||
self.assertNotIn("decoder_input_ids", batch_encoder_only)
|
||||
|
||||
def test_is_fast(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Check is_fast is set correctly
|
||||
self.assertFalse(tokenizer_p.is_fast)
|
||||
self.assertTrue(tokenizer_r.is_fast)
|
||||
|
||||
def test_fast_only_inputs(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Ensure None raise an error
|
||||
self.assertRaises(TypeError, tokenizer_r.tokenize, None)
|
||||
self.assertRaises(TypeError, tokenizer_r.encode, None)
|
||||
self.assertRaises(TypeError, tokenizer_r.encode_plus, None)
|
||||
self.assertRaises(TypeError, tokenizer_r.batch_encode_plus, None)
|
||||
|
||||
def test_alignement_methods(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
words = ["Wonderful", "no", "inspiration", "example", "with", "subtoken"]
|
||||
text = " ".join(words)
|
||||
batch_size = 3
|
||||
|
||||
encoding = tokenizer_r.encode_plus(text, add_special_tokens=False)
|
||||
|
||||
batch_encoding = tokenizer_r.batch_encode_plus([text] * batch_size, add_special_tokens=False)
|
||||
num_tokens = len(encoding["input_ids"])
|
||||
|
||||
last_word_index = len(words) - 1
|
||||
last_token_index = num_tokens - 1
|
||||
last_batch_index = batch_size - 1
|
||||
last_char_index = len(text) - 1
|
||||
|
||||
# words, tokens
|
||||
self.assertEqual(len(encoding.words(0)), num_tokens)
|
||||
self.assertEqual(max(encoding.words(0)), last_word_index)
|
||||
self.assertEqual(min(encoding.words(0)), 0)
|
||||
self.assertEqual(len(batch_encoding.words(last_batch_index)), num_tokens)
|
||||
self.assertEqual(max(batch_encoding.words(last_batch_index)), last_word_index)
|
||||
self.assertEqual(min(batch_encoding.words(last_batch_index)), 0)
|
||||
self.assertEqual(len(encoding.tokens(0)), num_tokens)
|
||||
|
||||
# Assert token_to_word
|
||||
self.assertEqual(encoding.token_to_word(0), 0)
|
||||
self.assertEqual(encoding.token_to_word(0, 0), 0)
|
||||
self.assertEqual(encoding.token_to_word(last_token_index), last_word_index)
|
||||
self.assertEqual(encoding.token_to_word(0, last_token_index), last_word_index)
|
||||
self.assertEqual(batch_encoding.token_to_word(1, 0), 0)
|
||||
self.assertEqual(batch_encoding.token_to_word(0, last_token_index), last_word_index)
|
||||
self.assertEqual(batch_encoding.token_to_word(last_batch_index, last_token_index), last_word_index)
|
||||
|
||||
# Assert word_to_tokens
|
||||
self.assertEqual(encoding.word_to_tokens(0).start, 0)
|
||||
self.assertEqual(encoding.word_to_tokens(0, 0).start, 0)
|
||||
self.assertEqual(encoding.word_to_tokens(last_word_index).end, last_token_index + 1)
|
||||
self.assertEqual(encoding.word_to_tokens(0, last_word_index).end, last_token_index + 1)
|
||||
self.assertEqual(batch_encoding.word_to_tokens(1, 0).start, 0)
|
||||
self.assertEqual(batch_encoding.word_to_tokens(0, last_word_index).end, last_token_index + 1)
|
||||
self.assertEqual(
|
||||
batch_encoding.word_to_tokens(last_batch_index, last_word_index).end, last_token_index + 1
|
||||
)
|
||||
|
||||
# Assert token_to_chars
|
||||
self.assertEqual(encoding.token_to_chars(0).start, 0)
|
||||
self.assertEqual(encoding.token_to_chars(0, 0).start, 0)
|
||||
self.assertEqual(encoding.token_to_chars(last_token_index).end, last_char_index + 1)
|
||||
self.assertEqual(encoding.token_to_chars(0, last_token_index).end, last_char_index + 1)
|
||||
self.assertEqual(batch_encoding.token_to_chars(1, 0).start, 0)
|
||||
self.assertEqual(batch_encoding.token_to_chars(0, last_token_index).end, last_char_index + 1)
|
||||
self.assertEqual(
|
||||
batch_encoding.token_to_chars(last_batch_index, last_token_index).end, last_char_index + 1
|
||||
)
|
||||
|
||||
# Assert char_to_token
|
||||
self.assertEqual(encoding.char_to_token(0), 0)
|
||||
self.assertEqual(encoding.char_to_token(0, 0), 0)
|
||||
self.assertEqual(encoding.char_to_token(last_char_index), last_token_index)
|
||||
self.assertEqual(encoding.char_to_token(0, last_char_index), last_token_index)
|
||||
self.assertEqual(batch_encoding.char_to_token(1, 0), 0)
|
||||
self.assertEqual(batch_encoding.char_to_token(0, last_char_index), last_token_index)
|
||||
self.assertEqual(batch_encoding.char_to_token(last_batch_index, last_char_index), last_token_index)
|
||||
|
||||
# Assert char_to_word
|
||||
self.assertEqual(encoding.char_to_word(0), 0)
|
||||
self.assertEqual(encoding.char_to_word(0, 0), 0)
|
||||
self.assertEqual(encoding.char_to_word(last_char_index), last_word_index)
|
||||
self.assertEqual(encoding.char_to_word(0, last_char_index), last_word_index)
|
||||
self.assertEqual(batch_encoding.char_to_word(1, 0), 0)
|
||||
self.assertEqual(batch_encoding.char_to_word(0, last_char_index), last_word_index)
|
||||
self.assertEqual(batch_encoding.char_to_word(last_batch_index, last_char_index), last_word_index)
|
||||
|
||||
# Assert word_to_chars
|
||||
self.assertEqual(encoding.word_to_chars(0).start, 0)
|
||||
self.assertEqual(encoding.word_to_chars(0, 0).start, 0)
|
||||
self.assertEqual(encoding.word_to_chars(last_word_index).end, last_char_index + 1)
|
||||
self.assertEqual(encoding.word_to_chars(0, last_word_index).end, last_char_index + 1)
|
||||
self.assertEqual(batch_encoding.word_to_chars(1, 0).start, 0)
|
||||
self.assertEqual(batch_encoding.word_to_chars(0, last_word_index).end, last_char_index + 1)
|
||||
self.assertEqual(
|
||||
batch_encoding.word_to_chars(last_batch_index, last_word_index).end, last_char_index + 1
|
||||
)
|
||||
|
||||
def test_tokenization_python_rust_equals(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Ensure basic input match
|
||||
input_p = tokenizer_p.encode_plus(self._data)
|
||||
input_r = tokenizer_r.encode_plus(self._data)
|
||||
|
||||
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
|
||||
self.assertSequenceEqual(input_p[key], input_r[key])
|
||||
|
||||
input_pairs_p = tokenizer_p.encode_plus(self._data, self._data)
|
||||
input_pairs_r = tokenizer_r.encode_plus(self._data, self._data)
|
||||
|
||||
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
|
||||
self.assertSequenceEqual(input_pairs_p[key], input_pairs_r[key])
|
||||
|
||||
# Ensure truncation match
|
||||
input_p = tokenizer_p.encode_plus(self._data, max_length=512, truncation=True)
|
||||
input_r = tokenizer_r.encode_plus(self._data, max_length=512, truncation=True)
|
||||
|
||||
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
|
||||
self.assertSequenceEqual(input_p[key], input_r[key])
|
||||
|
||||
# Ensure truncation with stride match
|
||||
input_p = tokenizer_p.encode_plus(
|
||||
self._data, max_length=512, truncation=True, stride=3, return_overflowing_tokens=True
|
||||
)
|
||||
input_r = tokenizer_r.encode_plus(
|
||||
self._data, max_length=512, truncation=True, stride=3, return_overflowing_tokens=True
|
||||
)
|
||||
|
||||
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
|
||||
self.assertSequenceEqual(input_p[key], input_r[key][0])
|
||||
|
||||
def test_num_special_tokens_to_add_equal(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Check we have the same number of added_tokens for both pair and non-pair inputs.
|
||||
self.assertEqual(
|
||||
tokenizer_r.num_special_tokens_to_add(False), tokenizer_p.num_special_tokens_to_add(False)
|
||||
)
|
||||
self.assertEqual(
|
||||
tokenizer_r.num_special_tokens_to_add(True), tokenizer_p.num_special_tokens_to_add(True)
|
||||
)
|
||||
|
||||
def test_max_length_equal(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Check we have the correct max_length for both pair and non-pair inputs.
|
||||
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
|
||||
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
|
||||
|
||||
def test_special_tokens_map_equal(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Assert the set of special tokens match.
|
||||
self.assertSequenceEqual(
|
||||
tokenizer_p.special_tokens_map.items(),
|
||||
tokenizer_r.special_tokens_map.items(),
|
||||
)
|
||||
|
||||
def test_add_tokens(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
vocab_size = len(tokenizer_r)
|
||||
self.assertEqual(tokenizer_r.add_tokens(""), 0)
|
||||
self.assertEqual(tokenizer_r.add_tokens("testoken"), 1)
|
||||
self.assertEqual(tokenizer_r.add_tokens(["testoken1", "testtoken2"]), 2)
|
||||
self.assertEqual(len(tokenizer_r), vocab_size + 3)
|
||||
|
||||
self.assertEqual(tokenizer_r.add_special_tokens({}), 0)
|
||||
self.assertEqual(tokenizer_r.add_special_tokens({"bos_token": "[BOS]", "eos_token": "[EOS]"}), 2)
|
||||
self.assertRaises(
|
||||
AssertionError, tokenizer_r.add_special_tokens, {"additional_special_tokens": "<testtoken1>"}
|
||||
)
|
||||
self.assertEqual(tokenizer_r.add_special_tokens({"additional_special_tokens": ["<testtoken2>"]}), 1)
|
||||
self.assertEqual(
|
||||
tokenizer_r.add_special_tokens({"additional_special_tokens": ["<testtoken3>", "<testtoken4>"]}), 2
|
||||
)
|
||||
self.assertEqual(len(tokenizer_r), vocab_size + 8)
|
||||
|
||||
def test_offsets_mapping(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
text = "Wonderful no inspiration example with subtoken"
|
||||
pair = "Along with an awesome pair"
|
||||
|
||||
# No pair
|
||||
tokens_with_offsets = tokenizer_r.encode_plus(
|
||||
text, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
|
||||
)
|
||||
added_tokens = tokenizer_r.num_special_tokens_to_add(False)
|
||||
offsets = tokens_with_offsets["offset_mapping"]
|
||||
|
||||
# Assert there is the same number of tokens and offsets
|
||||
self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
|
||||
|
||||
# Assert there is online added_tokens special_tokens
|
||||
self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
|
||||
|
||||
# Pairs
|
||||
tokens_with_offsets = tokenizer_r.encode_plus(
|
||||
text, pair, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
|
||||
)
|
||||
added_tokens = tokenizer_r.num_special_tokens_to_add(True)
|
||||
offsets = tokens_with_offsets["offset_mapping"]
|
||||
|
||||
# Assert there is the same number of tokens and offsets
|
||||
self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
|
||||
|
||||
# Assert there is online added_tokens special_tokens
|
||||
self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
|
||||
|
||||
def test_batch_encode_dynamic_overflowing(self):
|
||||
"""
|
||||
When calling batch_encode with multiple sequence it can returns different number of
|
||||
overflowing encoding for each sequence:
|
||||
[
|
||||
Sequence 1: [Encoding 1, Encoding 2],
|
||||
Sequence 2: [Encoding 1],
|
||||
Sequence 3: [Encoding 1, Encoding 2, ... Encoding N]
|
||||
]
|
||||
This needs to be padded so that it can represented as a tensor
|
||||
"""
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
tokenizer = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
with self.subTest(
|
||||
"{} ({}, {})".format(tokenizer.__class__.__name__, pretrained_name, tokenizer.__class__.__name__)
|
||||
):
|
||||
|
||||
returned_tensor = "pt" if is_torch_available() else "tf"
|
||||
|
||||
if not tokenizer.pad_token or tokenizer.pad_token_id < 0:
|
||||
return
|
||||
|
||||
tokens = tokenizer.encode_plus(
|
||||
"HuggingFace is solving NLP one commit at a time",
|
||||
max_length=6,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
return_tensors=returned_tensor,
|
||||
return_overflowing_tokens=True,
|
||||
)
|
||||
|
||||
for key in filter(lambda x: "overflow_to_sample_mapping" not in x, tokens.keys()):
|
||||
self.assertEqual(len(tokens[key].shape), 2)
|
||||
|
||||
# Mono sample
|
||||
tokens = tokenizer.batch_encode_plus(
|
||||
["HuggingFace is solving NLP one commit at a time"],
|
||||
max_length=6,
|
||||
padding=True,
|
||||
truncation="only_first",
|
||||
return_tensors=returned_tensor,
|
||||
return_overflowing_tokens=True,
|
||||
)
|
||||
|
||||
for key in filter(lambda x: "overflow_to_sample_mapping" not in x, tokens.keys()):
|
||||
self.assertEqual(len(tokens[key].shape), 2)
|
||||
self.assertEqual(tokens[key].shape[-1], 6)
|
||||
|
||||
# Multi sample
|
||||
tokens = tokenizer.batch_encode_plus(
|
||||
["HuggingFace is solving NLP one commit at a time", "Very tiny input"],
|
||||
max_length=6,
|
||||
padding=True,
|
||||
truncation="only_first",
|
||||
return_tensors=returned_tensor,
|
||||
return_overflowing_tokens=True,
|
||||
)
|
||||
|
||||
for key in filter(lambda x: "overflow_to_sample_mapping" not in x, tokens.keys()):
|
||||
self.assertEqual(len(tokens[key].shape), 2)
|
||||
self.assertEqual(tokens[key].shape[-1], 6)
|
||||
|
||||
def test_compare_pretokenized_inputs(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
if hasattr(tokenizer_p, "add_prefix_space") and not tokenizer_p.add_prefix_space:
|
||||
continue # Too hard to test for now
|
||||
|
||||
# Input string
|
||||
pretokenized_input_simple = "This is a sample input".split()
|
||||
pretokenized_input_pair = "This is a sample pair".split()
|
||||
|
||||
# Test encode for pretokenized inputs
|
||||
output_r = tokenizer_r.encode(
|
||||
pretokenized_input_simple, is_split_into_words=True, add_special_tokens=False
|
||||
)
|
||||
output_p = tokenizer_p.encode(
|
||||
pretokenized_input_simple, is_split_into_words=True, add_special_tokens=False
|
||||
)
|
||||
self.assertEqual(output_p, output_r)
|
||||
|
||||
kwargs = {
|
||||
"is_split_into_words": True,
|
||||
# "return_token_type_ids": True, # Use the defaults for each tokenizers
|
||||
# "return_attention_mask": True, # Use the defaults for each tokenizers
|
||||
"return_overflowing_tokens": False,
|
||||
"return_special_tokens_mask": True,
|
||||
"return_offsets_mapping": False, # Not implemented in python tokenizers
|
||||
# "add_special_tokens": False,
|
||||
}
|
||||
batch_kwargs = {
|
||||
"is_split_into_words": True,
|
||||
# "return_token_type_ids": True, # Use the defaults for each tokenizers
|
||||
# "return_attention_mask": True, # Use the defaults for each tokenizers
|
||||
"return_overflowing_tokens": False,
|
||||
"return_special_tokens_mask": True,
|
||||
"return_offsets_mapping": False, # Not implemented in python tokenizers
|
||||
# "add_special_tokens": False,
|
||||
}
|
||||
# Test encode_plus for pretokenized inputs
|
||||
output_r = tokenizer_r.encode_plus(pretokenized_input_simple, **kwargs)
|
||||
output_p = tokenizer_p.encode_plus(pretokenized_input_simple, **kwargs)
|
||||
for key in output_p.keys():
|
||||
self.assertEqual(output_p[key], output_r[key])
|
||||
|
||||
# Test batch_encode_plus for pretokenized inputs
|
||||
input_batch = ([pretokenized_input_simple] * 2) + [pretokenized_input_simple + pretokenized_input_pair]
|
||||
output_r = tokenizer_r.batch_encode_plus(input_batch, **batch_kwargs)
|
||||
output_p = tokenizer_p.batch_encode_plus(input_batch, **batch_kwargs)
|
||||
for key in output_p.keys():
|
||||
self.assertEqual(output_p[key], output_r[key])
|
||||
|
||||
# Test encode for pretokenized inputs pairs
|
||||
output_r = tokenizer_r.encode(
|
||||
pretokenized_input_simple, pretokenized_input_pair, is_split_into_words=True
|
||||
)
|
||||
output_p = tokenizer_p.encode(
|
||||
pretokenized_input_simple, pretokenized_input_pair, is_split_into_words=True
|
||||
)
|
||||
self.assertEqual(output_p, output_r)
|
||||
|
||||
# Test encode_plus for pretokenized inputs
|
||||
output_r = tokenizer_r.encode_plus(pretokenized_input_simple, pretokenized_input_pair, **kwargs)
|
||||
output_p = tokenizer_p.encode_plus(pretokenized_input_simple, pretokenized_input_pair, **kwargs)
|
||||
for key in output_p.keys():
|
||||
self.assertEqual(output_p[key], output_r[key])
|
||||
|
||||
# Test batch_encode_plus for pretokenized inputs
|
||||
input_batch_pair = ([pretokenized_input_simple, pretokenized_input_pair] * 2) + [
|
||||
pretokenized_input_simple + pretokenized_input_pair,
|
||||
pretokenized_input_pair,
|
||||
]
|
||||
output_r = tokenizer_r.batch_encode_plus(input_batch_pair, **batch_kwargs)
|
||||
output_p = tokenizer_p.batch_encode_plus(input_batch_pair, **batch_kwargs)
|
||||
for key in output_p.keys():
|
||||
self.assertEqual(output_p[key], output_r[key])
|
||||
|
||||
def test_create_token_type_ids(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
input_simple = [1, 2, 3]
|
||||
input_pair = [1, 2, 3]
|
||||
|
||||
# Generate output
|
||||
output_r = tokenizer_r.create_token_type_ids_from_sequences(input_simple)
|
||||
output_p = tokenizer_p.create_token_type_ids_from_sequences(input_simple)
|
||||
self.assertEqual(output_p, output_r)
|
||||
|
||||
# Generate pair output
|
||||
output_r = tokenizer_r.create_token_type_ids_from_sequences(input_simple, input_pair)
|
||||
output_p = tokenizer_p.create_token_type_ids_from_sequences(input_simple, input_pair)
|
||||
self.assertEqual(output_p, output_r)
|
||||
|
||||
def test_build_inputs_with_special_tokens(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
# # Input string
|
||||
# input_simple = tokenizer_p.tokenize("This is a sample input", add_special_tokens=False)
|
||||
# input_pair = tokenizer_p.tokenize("This is a sample pair", add_special_tokens=False)
|
||||
|
||||
# # Generate output
|
||||
# output_r = tokenizer_r.build_inputs_with_special_tokens(input_simple)
|
||||
# output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple)
|
||||
# self.assertEqual(output_p, output_r)
|
||||
|
||||
# # Generate pair output
|
||||
# output_r = tokenizer_r.build_inputs_with_special_tokens(input_simple, input_pair)
|
||||
# output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple, input_pair)
|
||||
# self.assertEqual(output_p, output_r)
|
||||
|
||||
# Input tokens id
|
||||
input_simple = tokenizer_p.encode("This is a sample input", add_special_tokens=False)
|
||||
input_pair = tokenizer_p.encode("This is a sample pair", add_special_tokens=False)
|
||||
|
||||
# Generate output
|
||||
output_r = tokenizer_r.build_inputs_with_special_tokens(input_simple)
|
||||
output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple)
|
||||
self.assertEqual(output_p, output_r)
|
||||
|
||||
# Generate pair output
|
||||
output_r = tokenizer_r.build_inputs_with_special_tokens(input_simple, input_pair)
|
||||
output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple, input_pair)
|
||||
self.assertEqual(output_p, output_r)
|
||||
|
||||
def test_padding(self, max_length=50):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
def assert_padded_input_match(input_r: list, input_p: list, max_length: int):
|
||||
|
||||
# Ensure we match max_length
|
||||
self.assertEqual(len(input_r), max_length)
|
||||
self.assertEqual(len(input_p), max_length)
|
||||
|
||||
# Ensure the number of padded tokens is the same
|
||||
padded_tokens_r = list(takewhile(lambda i: i == tokenizer_r.pad_token_id, reversed(input_r)))
|
||||
padded_tokens_p = list(takewhile(lambda i: i == tokenizer_p.pad_token_id, reversed(input_p)))
|
||||
self.assertSequenceEqual(padded_tokens_r, padded_tokens_p)
|
||||
|
||||
def assert_batch_padded_input_match(input_r: dict, input_p: dict, max_length: int):
|
||||
for i_r in input_r.values():
|
||||
self.assertEqual(len(i_r), 2), self.assertEqual(len(i_r[0]), max_length), self.assertEqual(
|
||||
len(i_r[1]), max_length
|
||||
)
|
||||
self.assertEqual(len(i_r), 2), self.assertEqual(len(i_r[0]), max_length), self.assertEqual(
|
||||
len(i_r[1]), max_length
|
||||
)
|
||||
|
||||
for i_r, i_p in zip(input_r["input_ids"], input_p["input_ids"]):
|
||||
assert_padded_input_match(i_r, i_p, max_length)
|
||||
|
||||
for i_r, i_p in zip(input_r["attention_mask"], input_p["attention_mask"]):
|
||||
self.assertSequenceEqual(i_r, i_p)
|
||||
|
||||
# Encode - Simple input
|
||||
input_r = tokenizer_r.encode("This is a simple input", max_length=max_length, pad_to_max_length=True)
|
||||
input_p = tokenizer_p.encode("This is a simple input", max_length=max_length, pad_to_max_length=True)
|
||||
assert_padded_input_match(input_r, input_p, max_length)
|
||||
input_r = tokenizer_r.encode("This is a simple input", max_length=max_length, padding="max_length")
|
||||
input_p = tokenizer_p.encode("This is a simple input", max_length=max_length, padding="max_length")
|
||||
assert_padded_input_match(input_r, input_p, max_length)
|
||||
|
||||
input_r = tokenizer_r.encode("This is a simple input", padding="longest")
|
||||
input_p = tokenizer_p.encode("This is a simple input", padding=True)
|
||||
assert_padded_input_match(input_r, input_p, len(input_r))
|
||||
|
||||
# Encode - Pair input
|
||||
input_r = tokenizer_r.encode(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
|
||||
)
|
||||
input_p = tokenizer_p.encode(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
|
||||
)
|
||||
assert_padded_input_match(input_r, input_p, max_length)
|
||||
input_r = tokenizer_r.encode(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, padding="max_length"
|
||||
)
|
||||
input_p = tokenizer_p.encode(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, padding="max_length"
|
||||
)
|
||||
assert_padded_input_match(input_r, input_p, max_length)
|
||||
input_r = tokenizer_r.encode("This is a simple input", "This is a pair", padding=True)
|
||||
input_p = tokenizer_p.encode("This is a simple input", "This is a pair", padding="longest")
|
||||
assert_padded_input_match(input_r, input_p, len(input_r))
|
||||
|
||||
# Encode_plus - Simple input
|
||||
input_r = tokenizer_r.encode_plus(
|
||||
"This is a simple input", max_length=max_length, pad_to_max_length=True
|
||||
)
|
||||
input_p = tokenizer_p.encode_plus(
|
||||
"This is a simple input", max_length=max_length, pad_to_max_length=True
|
||||
)
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
|
||||
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
|
||||
input_r = tokenizer_r.encode_plus(
|
||||
"This is a simple input", max_length=max_length, padding="max_length"
|
||||
)
|
||||
input_p = tokenizer_p.encode_plus(
|
||||
"This is a simple input", max_length=max_length, padding="max_length"
|
||||
)
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
|
||||
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
|
||||
|
||||
input_r = tokenizer_r.encode_plus("This is a simple input", padding="longest")
|
||||
input_p = tokenizer_p.encode_plus("This is a simple input", padding=True)
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], len(input_r["input_ids"]))
|
||||
|
||||
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
|
||||
|
||||
# Encode_plus - Pair input
|
||||
input_r = tokenizer_r.encode_plus(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
|
||||
)
|
||||
input_p = tokenizer_p.encode_plus(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
|
||||
)
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
|
||||
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
|
||||
input_r = tokenizer_r.encode_plus(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, padding="max_length"
|
||||
)
|
||||
input_p = tokenizer_p.encode_plus(
|
||||
"This is a simple input", "This is a pair", max_length=max_length, padding="max_length"
|
||||
)
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
|
||||
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
|
||||
input_r = tokenizer_r.encode_plus("This is a simple input", "This is a pair", padding="longest")
|
||||
input_p = tokenizer_p.encode_plus("This is a simple input", "This is a pair", padding=True)
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], len(input_r["input_ids"]))
|
||||
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
|
||||
|
||||
# Batch_encode_plus - Simple input
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"],
|
||||
max_length=max_length,
|
||||
pad_to_max_length=True,
|
||||
)
|
||||
input_p = tokenizer_p.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"],
|
||||
max_length=max_length,
|
||||
pad_to_max_length=True,
|
||||
)
|
||||
assert_batch_padded_input_match(input_r, input_p, max_length)
|
||||
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"],
|
||||
max_length=max_length,
|
||||
padding="max_length",
|
||||
)
|
||||
input_p = tokenizer_p.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"],
|
||||
max_length=max_length,
|
||||
padding="max_length",
|
||||
)
|
||||
assert_batch_padded_input_match(input_r, input_p, max_length)
|
||||
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"],
|
||||
max_length=max_length,
|
||||
padding="longest",
|
||||
)
|
||||
input_p = tokenizer_p.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"],
|
||||
max_length=max_length,
|
||||
padding=True,
|
||||
)
|
||||
assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]))
|
||||
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"], padding="longest"
|
||||
)
|
||||
input_p = tokenizer_p.batch_encode_plus(
|
||||
["This is a simple input 1", "This is a simple input 2"], padding=True
|
||||
)
|
||||
assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]))
|
||||
|
||||
# Batch_encode_plus - Pair input
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
[
|
||||
("This is a simple input 1", "This is a simple input 2"),
|
||||
("This is a simple pair 1", "This is a simple pair 2"),
|
||||
],
|
||||
max_length=max_length,
|
||||
truncation=True,
|
||||
padding="max_length",
|
||||
)
|
||||
input_p = tokenizer_p.batch_encode_plus(
|
||||
[
|
||||
("This is a simple input 1", "This is a simple input 2"),
|
||||
("This is a simple pair 1", "This is a simple pair 2"),
|
||||
],
|
||||
max_length=max_length,
|
||||
truncation=True,
|
||||
padding="max_length",
|
||||
)
|
||||
assert_batch_padded_input_match(input_r, input_p, max_length)
|
||||
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
[
|
||||
("This is a simple input 1", "This is a simple input 2"),
|
||||
("This is a simple pair 1", "This is a simple pair 2"),
|
||||
],
|
||||
padding=True,
|
||||
)
|
||||
input_p = tokenizer_p.batch_encode_plus(
|
||||
[
|
||||
("This is a simple input 1", "This is a simple input 2"),
|
||||
("This is a simple pair 1", "This is a simple pair 2"),
|
||||
],
|
||||
padding="longest",
|
||||
)
|
||||
assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]))
|
||||
|
||||
# Using pad on single examples after tokenization
|
||||
input_r = tokenizer_r.encode_plus("This is a input 1")
|
||||
input_r = tokenizer_r.pad(input_r)
|
||||
|
||||
input_p = tokenizer_r.encode_plus("This is a input 1")
|
||||
input_p = tokenizer_r.pad(input_p)
|
||||
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], len(input_r["input_ids"]))
|
||||
|
||||
# Using pad on single examples after tokenization
|
||||
input_r = tokenizer_r.encode_plus("This is a input 1")
|
||||
input_r = tokenizer_r.pad(input_r, max_length=max_length, padding="max_length")
|
||||
|
||||
input_p = tokenizer_r.encode_plus("This is a input 1")
|
||||
input_p = tokenizer_r.pad(input_p, max_length=max_length, padding="max_length")
|
||||
|
||||
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
|
||||
|
||||
# Using pad after tokenization
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
["This is a input 1", "This is a much longer input whilch should be padded"]
|
||||
)
|
||||
input_r = tokenizer_r.pad(input_r)
|
||||
|
||||
input_p = tokenizer_r.batch_encode_plus(
|
||||
["This is a input 1", "This is a much longer input whilch should be padded"]
|
||||
)
|
||||
input_p = tokenizer_r.pad(input_p)
|
||||
|
||||
assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]))
|
||||
|
||||
# Using pad after tokenization
|
||||
input_r = tokenizer_r.batch_encode_plus(
|
||||
["This is a input 1", "This is a much longer input whilch should be padded"]
|
||||
)
|
||||
input_r = tokenizer_r.pad(input_r, max_length=max_length, padding="max_length")
|
||||
|
||||
input_p = tokenizer_r.batch_encode_plus(
|
||||
["This is a input 1", "This is a much longer input whilch should be padded"]
|
||||
)
|
||||
input_p = tokenizer_r.pad(input_p, max_length=max_length, padding="max_length")
|
||||
|
||||
assert_batch_padded_input_match(input_r, input_p, max_length)
|
||||
|
||||
def test_save_pretrained(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
tmpdirname2 = tempfile.mkdtemp()
|
||||
|
||||
tokenizer_r_files = tokenizer_r.save_pretrained(tmpdirname2)
|
||||
tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2)
|
||||
# Checks it save with the same files
|
||||
self.assertSequenceEqual(tokenizer_r_files, tokenizer_p_files)
|
||||
|
||||
# Checks everything loads correctly in the same way
|
||||
tokenizer_rp = tokenizer_r.from_pretrained(tmpdirname2)
|
||||
tokenizer_pp = tokenizer_p.from_pretrained(tmpdirname2)
|
||||
|
||||
# Check special tokens are set accordingly on Rust and Python
|
||||
for key in tokenizer_pp.special_tokens_map:
|
||||
self.assertTrue(hasattr(tokenizer_rp, key))
|
||||
# self.assertEqual(getattr(tokenizer_rp, key), getattr(tokenizer_pp, key))
|
||||
# self.assertEqual(getattr(tokenizer_rp, key + "_id"), getattr(tokenizer_pp, key + "_id"))
|
||||
|
||||
shutil.rmtree(tmpdirname2)
|
||||
|
||||
def test_embeded_special_tokens(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
sentence = "A, <mask> AllenNLP sentence."
|
||||
tokens_r = tokenizer_r.encode_plus(
|
||||
sentence,
|
||||
add_special_tokens=True,
|
||||
)
|
||||
tokens_p = tokenizer_p.encode_plus(
|
||||
sentence,
|
||||
add_special_tokens=True,
|
||||
)
|
||||
|
||||
for key in tokens_p.keys():
|
||||
self.assertEqual(tokens_r[key], tokens_p[key])
|
||||
|
||||
if "token_type_ids" in tokens_r:
|
||||
self.assertEqual(sum(tokens_r["token_type_ids"]), sum(tokens_p["token_type_ids"]))
|
||||
|
||||
tokens_r = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
|
||||
tokens_p = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
|
||||
self.assertSequenceEqual(tokens_r, tokens_p)
|
||||
|
||||
def test_compare_add_special_tokens(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
simple_num_special_tokens_to_add = tokenizer_r.num_special_tokens_to_add(pair=False)
|
||||
# pair_num_special_tokens_to_add = tokenizer_r.num_special_tokens_to_add(pair=True)
|
||||
|
||||
for text in ["", " "]:
|
||||
# tokenize()
|
||||
no_special_tokens = tokenizer_r.tokenize(text, add_special_tokens=False)
|
||||
with_special_tokens = tokenizer_r.tokenize(text, add_special_tokens=True)
|
||||
self.assertEqual(
|
||||
len(no_special_tokens), len(with_special_tokens) - simple_num_special_tokens_to_add
|
||||
)
|
||||
|
||||
# encode()
|
||||
no_special_tokens = tokenizer_r.encode(text, add_special_tokens=False)
|
||||
with_special_tokens = tokenizer_r.encode(text, add_special_tokens=True)
|
||||
self.assertEqual(
|
||||
len(no_special_tokens), len(with_special_tokens) - simple_num_special_tokens_to_add
|
||||
)
|
||||
|
||||
# encode_plus()
|
||||
no_special_tokens = tokenizer_r.encode_plus(text, add_special_tokens=False)
|
||||
with_special_tokens = tokenizer_r.encode_plus(text, add_special_tokens=True)
|
||||
for key in no_special_tokens.keys():
|
||||
self.assertEqual(
|
||||
len(no_special_tokens[key]),
|
||||
len(with_special_tokens[key]) - simple_num_special_tokens_to_add,
|
||||
)
|
||||
|
||||
# # batch_encode_plus
|
||||
no_special_tokens = tokenizer_r.batch_encode_plus([text, text], add_special_tokens=False)
|
||||
with_special_tokens = tokenizer_r.batch_encode_plus([text, text], add_special_tokens=True)
|
||||
for key in no_special_tokens.keys():
|
||||
for i_no, i_with in zip(no_special_tokens[key], with_special_tokens[key]):
|
||||
self.assertEqual(len(i_no), len(i_with) - simple_num_special_tokens_to_add)
|
||||
|
||||
def test_compare_prepare_for_model(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
string_sequence = "Asserting that both tokenizers are equal"
|
||||
python_output = tokenizer_p.prepare_for_model(
|
||||
tokenizer_p.encode(string_sequence, add_special_tokens=False)
|
||||
)
|
||||
rust_output = tokenizer_r.prepare_for_model(
|
||||
tokenizer_r.encode(string_sequence, add_special_tokens=False)
|
||||
)
|
||||
for key in python_output:
|
||||
self.assertEqual(python_output[key], rust_output[key])
|
||||
|
||||
@@ -14,12 +14,13 @@
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
|
||||
from transformers import DistilBertTokenizer, DistilBertTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers, slow
|
||||
|
||||
from .test_tokenization_bert import BertTokenizationTest
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class DistilBertTokenizationTest(BertTokenizationTest):
|
||||
|
||||
tokenizer_class = DistilBertTokenizer
|
||||
|
||||
@@ -14,8 +14,7 @@
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_dpr import (
|
||||
from transformers import (
|
||||
DPRContextEncoderTokenizer,
|
||||
DPRContextEncoderTokenizerFast,
|
||||
DPRQuestionEncoderTokenizer,
|
||||
@@ -24,11 +23,13 @@ from transformers.tokenization_dpr import (
|
||||
DPRReaderTokenizer,
|
||||
DPRReaderTokenizerFast,
|
||||
)
|
||||
from transformers.testing_utils import require_tokenizers, slow
|
||||
from transformers.tokenization_utils_base import BatchEncoding
|
||||
|
||||
from .test_tokenization_bert import BertTokenizationTest
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class DPRContextEncoderTokenizationTest(BertTokenizationTest):
|
||||
|
||||
tokenizer_class = DPRContextEncoderTokenizer
|
||||
@@ -36,6 +37,7 @@ class DPRContextEncoderTokenizationTest(BertTokenizationTest):
|
||||
test_rust_tokenizer = True
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class DPRQuestionEncoderTokenizationTest(BertTokenizationTest):
|
||||
|
||||
tokenizer_class = DPRQuestionEncoderTokenizer
|
||||
@@ -43,6 +45,7 @@ class DPRQuestionEncoderTokenizationTest(BertTokenizationTest):
|
||||
test_rust_tokenizer = True
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class DPRReaderTokenizationTest(BertTokenizationTest):
|
||||
|
||||
tokenizer_class = DPRReaderTokenizer
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -17,14 +17,18 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_funnel import VOCAB_FILES_NAMES, FunnelTokenizer, FunnelTokenizerFast
|
||||
from transformers import FunnelTokenizer, FunnelTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers
|
||||
from transformers.tokenization_funnel import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class FunnelTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = FunnelTokenizer
|
||||
rust_tokenizer_class = FunnelTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
space_between_special_tokens = True
|
||||
|
||||
|
||||
@@ -18,16 +18,20 @@ import json
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_gpt2 import VOCAB_FILES_NAMES, GPT2Tokenizer, GPT2TokenizerFast
|
||||
from transformers import GPT2Tokenizer, GPT2TokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers
|
||||
from transformers.tokenization_gpt2 import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = GPT2Tokenizer
|
||||
rust_tokenizer_class = GPT2TokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
from_pretrained_kwargs = {"add_prefix_space": True}
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -125,3 +129,47 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
# It's very difficult to mix/test pretokenization with byte-level
|
||||
# And get both GPT2 and Roberta to work at the same time (mostly an issue of adding a space before the string)
|
||||
pass
|
||||
|
||||
def test_padding(self, max_length=15):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Simple input
|
||||
s = "This is a simple input"
|
||||
s2 = ["This is a simple input 1", "This is a simple input 2"]
|
||||
p = ("This is a simple input", "This is a pair")
|
||||
p2 = [
|
||||
("This is a simple input 1", "This is a simple input 2"),
|
||||
("This is a simple pair 1", "This is a simple pair 2"),
|
||||
]
|
||||
|
||||
# Simple input tests
|
||||
self.assertRaises(ValueError, tokenizer_r.encode, s, max_length=max_length, padding="max_length")
|
||||
|
||||
# Simple input
|
||||
self.assertRaises(ValueError, tokenizer_r.encode_plus, s, max_length=max_length, padding="max_length")
|
||||
|
||||
# Simple input
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
tokenizer_r.batch_encode_plus,
|
||||
s2,
|
||||
max_length=max_length,
|
||||
padding="max_length",
|
||||
)
|
||||
|
||||
# Pair input
|
||||
self.assertRaises(ValueError, tokenizer_r.encode, p, max_length=max_length, padding="max_length")
|
||||
|
||||
# Pair input
|
||||
self.assertRaises(ValueError, tokenizer_r.encode_plus, p, max_length=max_length, padding="max_length")
|
||||
|
||||
# Pair input
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
tokenizer_r.batch_encode_plus,
|
||||
p2,
|
||||
max_length=max_length,
|
||||
padding="max_length",
|
||||
)
|
||||
|
||||
@@ -18,12 +18,14 @@ import json
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_herbert import VOCAB_FILES_NAMES, HerbertTokenizer, HerbertTokenizerFast
|
||||
from transformers import HerbertTokenizer, HerbertTokenizerFast
|
||||
from transformers.testing_utils import get_tests_dir, require_tokenizers, slow
|
||||
from transformers.tokenization_herbert import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class HerbertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = HerbertTokenizer
|
||||
@@ -33,6 +35,10 @@ class HerbertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
|
||||
# Use a simpler test file without japanese/chinese characters
|
||||
with open(f"{get_tests_dir()}/fixtures/sample_text_no_unicode.txt", encoding="utf-8") as f_data:
|
||||
self._data = f_data.read().replace("\n\n", "\n").strip()
|
||||
|
||||
vocab = [
|
||||
"<s>",
|
||||
"</s>",
|
||||
|
||||
@@ -17,14 +17,20 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_layoutlm import VOCAB_FILES_NAMES, LayoutLMTokenizer
|
||||
from transformers import LayoutLMTokenizer, LayoutLMTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers
|
||||
from transformers.tokenization_layoutlm import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class LayoutLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = LayoutLMTokenizer
|
||||
rust_tokenizer_class = LayoutLMTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
space_between_special_tokens = True
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
|
||||
@@ -17,12 +17,14 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers import LxmertTokenizer, LxmertTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers
|
||||
from transformers.tokenization_bert import VOCAB_FILES_NAMES
|
||||
from transformers.tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class LxmertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = LxmertTokenizer
|
||||
|
||||
@@ -20,9 +20,12 @@ import unittest
|
||||
from pathlib import Path
|
||||
from shutil import copyfile
|
||||
|
||||
from transformers.testing_utils import _torch_available
|
||||
from transformers.tokenization_marian import MarianTokenizer, save_json, vocab_files_names
|
||||
from transformers.tokenization_utils import BatchEncoding
|
||||
from transformers import BatchEncoding, MarianTokenizer
|
||||
from transformers.testing_utils import _sentencepiece_available, _torch_available, require_sentencepiece
|
||||
|
||||
|
||||
if _sentencepiece_available:
|
||||
from transformers.tokenization_marian import save_json, vocab_files_names
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -35,6 +38,7 @@ ORG_NAME = "Helsinki-NLP/"
|
||||
FRAMEWORK = "pt" if _torch_available else "tf"
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
class MarianTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = MarianTokenizer
|
||||
|
||||
@@ -1,11 +1,26 @@
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers import AutoTokenizer, BatchEncoding, MBartTokenizer, MBartTokenizerFast, is_torch_available
|
||||
from transformers.testing_utils import require_torch
|
||||
from transformers import (
|
||||
SPIECE_UNDERLINE,
|
||||
AutoTokenizer,
|
||||
BatchEncoding,
|
||||
MBartTokenizer,
|
||||
MBartTokenizerFast,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
_sentencepiece_available,
|
||||
require_sentencepiece,
|
||||
require_tokenizers,
|
||||
require_torch,
|
||||
)
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
from .test_tokenization_xlm_roberta import SAMPLE_VOCAB, SPIECE_UNDERLINE
|
||||
|
||||
|
||||
if _sentencepiece_available:
|
||||
from .test_tokenization_xlm_roberta import SAMPLE_VOCAB
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -15,6 +30,8 @@ EN_CODE = 250004
|
||||
RO_CODE = 250020
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
tokenizer_class = MBartTokenizer
|
||||
rust_tokenizer_class = MBartTokenizerFast
|
||||
@@ -105,6 +122,8 @@ class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class MBartEnroIntegrationTest(unittest.TestCase):
|
||||
checkpoint_name = "facebook/mbart-large-en-ro"
|
||||
src_text = [
|
||||
|
||||
@@ -18,11 +18,14 @@ import json
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_openai import VOCAB_FILES_NAMES, OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
|
||||
from transformers import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers
|
||||
from transformers.tokenization_openai import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class OpenAIGPTTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = OpenAIGPTTokenizer
|
||||
@@ -80,3 +83,47 @@ class OpenAIGPTTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
input_tokens = tokens + ["<unk>"]
|
||||
input_bpe_tokens = [14, 15, 20]
|
||||
self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
|
||||
|
||||
def test_padding(self, max_length=15):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
|
||||
# Simple input
|
||||
s = "This is a simple input"
|
||||
s2 = ["This is a simple input 1", "This is a simple input 2"]
|
||||
p = ("This is a simple input", "This is a pair")
|
||||
p2 = [
|
||||
("This is a simple input 1", "This is a simple input 2"),
|
||||
("This is a simple pair 1", "This is a simple pair 2"),
|
||||
]
|
||||
|
||||
# Simple input tests
|
||||
self.assertRaises(ValueError, tokenizer_r.encode, s, max_length=max_length, padding="max_length")
|
||||
|
||||
# Simple input
|
||||
self.assertRaises(ValueError, tokenizer_r.encode_plus, s, max_length=max_length, padding="max_length")
|
||||
|
||||
# Simple input
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
tokenizer_r.batch_encode_plus,
|
||||
s2,
|
||||
max_length=max_length,
|
||||
padding="max_length",
|
||||
)
|
||||
|
||||
# Pair input
|
||||
self.assertRaises(ValueError, tokenizer_r.encode, p, max_length=max_length, padding="max_length")
|
||||
|
||||
# Pair input
|
||||
self.assertRaises(ValueError, tokenizer_r.encode_plus, p, max_length=max_length, padding="max_length")
|
||||
|
||||
# Pair input
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
tokenizer_r.batch_encode_plus,
|
||||
p2,
|
||||
max_length=max_length,
|
||||
padding="max_length",
|
||||
)
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
import unittest
|
||||
|
||||
from transformers import PegasusTokenizer, PegasusTokenizerFast
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import get_tests_dir, require_torch
|
||||
from transformers.tokenization_pegasus import PegasusTokenizer, PegasusTokenizerFast
|
||||
from transformers.testing_utils import get_tests_dir, require_sentencepiece, require_tokenizers, require_torch
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -10,6 +10,8 @@ from .test_tokenization_common import TokenizerTesterMixin
|
||||
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece_no_bos.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class PegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = PegasusTokenizer
|
||||
|
||||
@@ -17,9 +17,9 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers import SPIECE_UNDERLINE, ReformerTokenizer, ReformerTokenizerFast
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_torch, slow
|
||||
from transformers.tokenization_reformer import SPIECE_UNDERLINE, ReformerTokenizer, ReformerTokenizerFast
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -27,6 +27,8 @@ from .test_tokenization_common import TokenizerTesterMixin
|
||||
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/test_sentencepiece.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class ReformerTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = ReformerTokenizer
|
||||
|
||||
@@ -18,16 +18,19 @@ import json
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_roberta import VOCAB_FILES_NAMES, AddedToken, RobertaTokenizer, RobertaTokenizerFast
|
||||
from transformers import AddedToken, RobertaTokenizer, RobertaTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers, slow
|
||||
from transformers.tokenization_roberta import VOCAB_FILES_NAMES
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class RobertaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
tokenizer_class = RobertaTokenizer
|
||||
rust_tokenizer_class = RobertaTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
from_pretrained_kwargs = {"cls_token": "<s>"}
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -158,3 +161,38 @@ class RobertaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
mask_loc = encoded.index(mask_ind)
|
||||
first_char = tokenizer.convert_ids_to_tokens(encoded[mask_loc + 1])[0]
|
||||
self.assertNotEqual(first_char, space_encoding)
|
||||
|
||||
def test_pretokenized_inputs(self):
|
||||
pass
|
||||
|
||||
def test_embeded_special_tokens(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
|
||||
sentence = "A, <mask> AllenNLP sentence."
|
||||
tokens_r = tokenizer_r.encode_plus(sentence, add_special_tokens=True, return_token_type_ids=True)
|
||||
tokens_p = tokenizer_p.encode_plus(sentence, add_special_tokens=True, return_token_type_ids=True)
|
||||
|
||||
# token_type_ids should put 0 everywhere
|
||||
self.assertEqual(sum(tokens_r["token_type_ids"]), sum(tokens_p["token_type_ids"]))
|
||||
|
||||
# attention_mask should put 1 everywhere, so sum over length should be 1
|
||||
self.assertEqual(
|
||||
sum(tokens_r["attention_mask"]) / len(tokens_r["attention_mask"]),
|
||||
sum(tokens_p["attention_mask"]) / len(tokens_p["attention_mask"]),
|
||||
)
|
||||
|
||||
tokens_r_str = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
|
||||
tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
|
||||
|
||||
# Rust correctly handles the space before the mask while python doesnt
|
||||
self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
|
||||
self.assertSequenceEqual(tokens_r["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
|
||||
|
||||
self.assertSequenceEqual(
|
||||
tokens_p_str, ["<s>", "A", ",", "<mask>", "ĠAllen", "N", "LP", "Ġsentence", ".", "</s>"]
|
||||
)
|
||||
self.assertSequenceEqual(
|
||||
tokens_r_str, ["<s>", "A", ",", "<mask>", "ĠAllen", "N", "LP", "Ġsentence", ".", "</s>"]
|
||||
)
|
||||
|
||||
@@ -14,15 +14,18 @@
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
|
||||
from transformers import SqueezeBertTokenizer, SqueezeBertTokenizerFast
|
||||
from transformers.testing_utils import require_tokenizers, slow
|
||||
|
||||
from .test_tokenization_bert import BertTokenizationTest
|
||||
|
||||
|
||||
@require_tokenizers
|
||||
class SqueezeBertTokenizationTest(BertTokenizationTest):
|
||||
|
||||
tokenizer_class = SqueezeBertTokenizer
|
||||
rust_tokenizer_class = SqueezeBertTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
|
||||
def get_rust_tokenizer(self, **kwargs):
|
||||
return SqueezeBertTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
@@ -16,11 +16,9 @@
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import BatchEncoding
|
||||
from transformers import SPIECE_UNDERLINE, BatchEncoding, T5Tokenizer, T5TokenizerFast
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import _torch_available, get_tests_dir
|
||||
from transformers.tokenization_t5 import T5Tokenizer, T5TokenizerFast
|
||||
from transformers.tokenization_xlnet import SPIECE_UNDERLINE
|
||||
from transformers.testing_utils import _torch_available, get_tests_dir, require_sentencepiece, require_tokenizers
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -30,6 +28,8 @@ SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
|
||||
FRAMEWORK = "pt" if _torch_available else "tf"
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class T5TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = T5Tokenizer
|
||||
|
||||
@@ -19,7 +19,7 @@ from typing import Callable, Optional
|
||||
import numpy as np
|
||||
|
||||
from transformers import BatchEncoding, BertTokenizer, BertTokenizerFast, PreTrainedTokenizer, TensorType
|
||||
from transformers.testing_utils import require_tf, require_torch, slow
|
||||
from transformers.testing_utils import require_tf, require_tokenizers, require_torch, slow
|
||||
from transformers.tokenization_gpt2 import GPT2Tokenizer
|
||||
|
||||
|
||||
@@ -68,6 +68,7 @@ class TokenizerUtilsTest(unittest.TestCase):
|
||||
self.assertEqual(TensorType("pt"), TensorType.PYTORCH)
|
||||
self.assertEqual(TensorType("np"), TensorType.NUMPY)
|
||||
|
||||
@require_tokenizers
|
||||
def test_batch_encoding_pickle(self):
|
||||
import numpy as np
|
||||
|
||||
@@ -92,6 +93,7 @@ class TokenizerUtilsTest(unittest.TestCase):
|
||||
)
|
||||
|
||||
@require_tf
|
||||
@require_tokenizers
|
||||
def test_batch_encoding_pickle_tf(self):
|
||||
import tensorflow as tf
|
||||
|
||||
@@ -112,6 +114,7 @@ class TokenizerUtilsTest(unittest.TestCase):
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_batch_encoding_pickle_pt(self):
|
||||
import torch
|
||||
|
||||
@@ -128,6 +131,7 @@ class TokenizerUtilsTest(unittest.TestCase):
|
||||
tokenizer_r("Small example to encode", return_tensors=TensorType.PYTORCH), torch.equal
|
||||
)
|
||||
|
||||
@require_tokenizers
|
||||
def test_batch_encoding_is_fast(self):
|
||||
tokenizer_p = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
tokenizer_r = BertTokenizerFast.from_pretrained("bert-base-cased")
|
||||
|
||||
@@ -17,9 +17,9 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers import SPIECE_UNDERLINE, XLMRobertaTokenizer, XLMRobertaTokenizerFast
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_xlm_roberta import SPIECE_UNDERLINE, XLMRobertaTokenizer, XLMRobertaTokenizerFast
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, slow
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -27,6 +27,8 @@ from .test_tokenization_common import TokenizerTesterMixin
|
||||
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/test_sentencepiece.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class XLMRobertaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = XLMRobertaTokenizer
|
||||
|
||||
@@ -17,8 +17,8 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import slow
|
||||
from transformers.tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer, XLNetTokenizerFast
|
||||
from transformers import SPIECE_UNDERLINE, XLNetTokenizer, XLNetTokenizerFast
|
||||
from transformers.testing_utils import require_sentencepiece, require_tokenizers, slow
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
@@ -26,6 +26,8 @@ from .test_tokenization_common import TokenizerTesterMixin
|
||||
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/test_sentencepiece.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class XLNetTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = XLNetTokenizer
|
||||
|
||||
@@ -23,7 +23,7 @@ import numpy as np
|
||||
|
||||
from transformers import AutoTokenizer, PretrainedConfig, TrainingArguments, is_torch_available
|
||||
from transformers.file_utils import WEIGHTS_NAME
|
||||
from transformers.testing_utils import get_tests_dir, require_torch, slow
|
||||
from transformers.testing_utils import get_tests_dir, require_sentencepiece, require_tokenizers, require_torch, slow
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -151,6 +151,8 @@ if is_torch_available():
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class TrainerIntegrationTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
args = TrainingArguments(".")
|
||||
|
||||
Reference in New Issue
Block a user