Add a TF in-graph tokenizer for BERT (#17701)

* Add a TF in-graph tokenizer for BERT

* Add from_pretrained

* Add proper truncation, option handling to match other tokenizers

* Add proper imports and guards

* Add test, fix all the bugs exposed by said test

* Fix truncation of paired texts in graph mode, more test updates

* Small fixes, add a (very careful) test for savedmodel

* Add tensorflow-text dependency, make fixup

* Update documentation

* Update documentation

* make fixup

* Slight changes to tests

* Add some docstring examples

* Update tests

* Update tests and add proper lowercasing/normalization

* make fixup

* Add docstring for padding!

* Mark slow tests

* make fixup

* Fall back to BertTokenizerFast if BertTokenizer is unavailable

* Fall back to BertTokenizerFast if BertTokenizer is unavailable

* make fixup

* Properly handle tensorflow-text dummies
This commit is contained in:
Matt
2022-06-27 12:06:21 +01:00
committed by GitHub
parent 401fcca6c5
commit ee0d001de7
12 changed files with 402 additions and 3 deletions

View File

@@ -155,6 +155,7 @@ _deps = [
"starlette",
"tensorflow-cpu>=2.3",
"tensorflow>=2.3",
"tensorflow-text",
"tf2onnx",
"timeout-decorator",
"timm",
@@ -238,8 +239,8 @@ extras = {}
extras["ja"] = deps_list("fugashi", "ipadic", "unidic_lite", "unidic")
extras["sklearn"] = deps_list("scikit-learn")
extras["tf"] = deps_list("tensorflow", "onnxconverter-common", "tf2onnx")
extras["tf-cpu"] = deps_list("tensorflow-cpu", "onnxconverter-common", "tf2onnx")
extras["tf"] = deps_list("tensorflow", "onnxconverter-common", "tf2onnx", "tensorflow-text")
extras["tf-cpu"] = deps_list("tensorflow-cpu", "onnxconverter-common", "tf2onnx", "tensorflow-text")
extras["torch"] = deps_list("torch")
extras["accelerate"] = deps_list("accelerate")