Support reading tiktoken tokenizer.model file (#31656)

* use existing TikTokenConverter to read tiktoken tokenizer.model file

* del test file

* create titktoken integration file

* adding tiktoken llama test

* ALTNATIVE IMPLEMENTATION: supports llama 405B

* fix one char

* remove redundant line

* small fix

* rm unused import

* flag for converting from tiktokeng

* remove unneeded file

* ruff

* remove llamatiktokenconverter, stick to general converter

* tiktoken support v2

* update test

* remove stale changes

* udpate doc

* protect import

* use is_protobuf_available

* add templateprocessor in tiktokenconverter

* reverting templateprocessor from tiktoken support

* update test

* add require_tiktoken

* dev-ci

* trigger build

* trigger build again

* dev-ci

* [build-ci-image] tiktoken

* dev-ci

* dev-ci

* dev-ci

* dev-ci

* change tiktoken file name

* feedback review

* feedback rev

* applying feedback, removing tiktoken converters

* conform test

* adding docs for review

* add doc file for review

* add doc file for review

* add doc file for review

* support loading model without config.json file

* Revert "support loading model without config.json file"

This reverts commit 2753602e51c34cef2f184eb11f36d2ad1b02babb.

* remove dev var

* updating docs

* safely import protobuf

* fix protobuf import error

* fix protobuf import error

* trying isort to fix ruff error

* fix ruff error

* try to fix ruff again

* try to fix ruff again

* try to fix ruff again

* doc table of contents

* add fix for consistency.dockerfile torchaudio

* ruff

* applying feedback

* minor typo

* merging with push-ci-image

* clean up imports

* revert dockerfile consistency
This commit is contained in:
Ita Zaporozhets
2024-09-06 08:24:02 -04:00
committed by GitHub
parent 342e800086
commit e48e5f1f13
13 changed files with 195 additions and 21 deletions

View File

@@ -99,6 +99,7 @@ _deps = [
"accelerate>=0.26.0",
"av==9.2.0", # Latest version of PyAV (10.0.0) has issues with audio stream.
"beautifulsoup4",
"blobfile",
"codecarbon==1.2.0",
"cookiecutter==1.7.3",
"dataclasses",
@@ -177,6 +178,7 @@ _deps = [
"tensorflow-probability<0.24",
"tf2onnx",
"timeout-decorator",
"tiktoken",
"timm<=0.9.16",
"tokenizers>=0.19,<0.20",
"torch",
@@ -311,6 +313,7 @@ extras["codecarbon"] = deps_list("codecarbon")
extras["video"] = deps_list("decord", "av")
extras["sentencepiece"] = deps_list("sentencepiece", "protobuf")
extras["tiktoken"] = deps_list("tiktoken", "blobfile")
extras["testing"] = (
deps_list(
"pytest",