Conversion from slow to fast for BPE spm vocabs contained an error. (#10120)

* Conversion from slow to fast for BPE spm vocabs contained an error.

- There is only 1 test currently (tokenizers + slow) that used the modified path
and it's reformer, which does not contain any ids modification so the
bug was silent for now.
- The real issue is that vocab variable was overloaded by
SentencePieceExtractor, leading to Slow specific vocab oddities to be
completely ignored
- The bug was reported here https://github.com/huggingface/transformers/issues/9518
- Ran the complete tokenization test suite with slow without error
(`RUN_SLOW=1 pytest -sv tests/test_tokenization_*`)

* Remove rebase error.

* Adding the fixture.

This commit is contained in:

Nicolas Patry

2021-02-13 14:24:53 +01:00

committed by

GitHub

parent dd3a7f9641

commit c9837a0d27

3 changed files with 28 additions and 3 deletions

BIN
tests/fixtures/test_sentencepiece_bpe.model vendored Normal file

View File

Binary file not shown.

Conversion from slow to fast for BPE spm vocabs contained an error. (#10120)

BIN tests/fixtures/test_sentencepiece_bpe.model vendored Normal file View File

BIN
tests/fixtures/test_sentencepiece_bpe.model vendored Normal file

View File