Add WhisperTokenizerFast (#21222)

* Add WhisperTokenizerFast

* Fixup

* Up

* Up

* Improve tests

* Update src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Keep stride in whisper pipelien test

* Remove unknown token special case

* Reduce vocabulary size in tests

* Fix vocab size assertion

* Sync copied changes from WhisperTokenizer

* Skip pipeline tests

* Update assertion

* Remove Whisper tokenizer dependency on sentencepiece

* Format

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
Jonatan Kłosko
2023-02-21 06:58:54 +01:00
committed by GitHub
parent 8b3db33a76
commit deafc24388
12 changed files with 568 additions and 8 deletions

View File

@@ -406,7 +406,7 @@ Flax), PyTorch, and/or TensorFlow.
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
| Whisper | ✅ | | ✅ | ✅ | ✅ |
| Whisper | ✅ | | ✅ | ✅ | ✅ |
| X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| X-MOD | ❌ | ❌ | ✅ | ❌ | ❌ |
| XGLM | ✅ | ✅ | ✅ | ✅ | ✅ |

View File

@@ -45,6 +45,15 @@ The original code can be found [here](https://github.com/openai/whisper).
- create_token_type_ids_from_sequences
- save_vocabulary
## WhisperTokenizerFast
[[autodoc]] WhisperTokenizerFast
- set_prefix_tokens
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
## WhisperFeatureExtractor
[[autodoc]] WhisperFeatureExtractor