Fix for fast tokenizers save_pretrained compatibility with Python. (#2933)

* Renamed file generate by tokenizers when calling save_pretrained to match python.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added save_vocabulary tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove python quick and dirty fix for clean Rust impl.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.5.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added some save_pretrained / from_pretrained unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.5.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* flake8

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Making sure there is really a bug in unittest

* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
This commit is contained in:
Funtowicz Morgan
2020-02-25 00:20:42 +01:00
committed by GitHub
parent ee60840ee6
commit 4cd9c0971c
4 changed files with 83 additions and 24 deletions

View File

@@ -92,7 +92,7 @@ setup(
packages=find_packages("src"),
install_requires=[
"numpy",
"tokenizers == 0.5.0",
"tokenizers == 0.5.2",
# accessing files from S3 directly
"boto3",
# filesystem locks e.g. to prevent parallel downloads