Add WhisperTokenizerFast (#21222)

* Add WhisperTokenizerFast * Fixup * Up * Up * Improve tests * Update src/transformers/models/whisper/tokenization_whisper_fast.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Keep stride in whisper pipelien test * Remove unknown token special case * Reduce vocabulary size in tests * Fix vocab size assertion * Sync copied changes from WhisperTokenizer * Skip pipeline tests * Update assertion * Remove Whisper tokenizer dependency on sentencepiece * Format --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-02-21 06:58:54 +01:00
parent 8b3db33a76
commit deafc24388
12 changed files with 568 additions and 8 deletions
--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -406,7 +406,7 @@ Flax), PyTorch, and/or TensorFlow.
 |           Wav2Vec2            |       ✅       |       ❌       |       ✅        |         ✅         |      ✅      |
 |      Wav2Vec2-Conformer       |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             WavLM             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-|            Whisper            |       ✅       |       ❌       |       ✅        |         ✅         |      ✅      |
+|            Whisper            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
 |            X-CLIP             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             X-MOD             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             XGLM              |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
--- a/docs/source/en/model_doc/whisper.mdx
+++ b/docs/source/en/model_doc/whisper.mdx
@@ -45,6 +45,15 @@ The original code can be found [here](https://github.com/openai/whisper).
    - create_token_type_ids_from_sequences
    - save_vocabulary

+## WhisperTokenizerFast
+
+[[autodoc]] WhisperTokenizerFast
+    - set_prefix_tokens
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+
 ## WhisperFeatureExtractor

 [[autodoc]] WhisperFeatureExtractor