Files
HuggingFace_transformer/docs/source/en/model_doc
Nicolas Patry 1670be4bde Adding Llama FastTokenizer support. (#22264)
* Adding Llama FastTokenizer support.

- Requires https://github.com/huggingface/tokenizers/pull/1183 version
- Only support byte_fallback for llama, raise otherwise (safety net).
- Lots of questions are special tokens

How to test:

```python

from transformers.convert_slow_tokenizer import convert_slow_tokenizer
from transformers import AutoTokenizer
from tokenizers import Tokenizer

tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")

if False:
    new_tokenizer = Tokenizer.from_file("tok.json")
else:
    new_tokenizer = convert_slow_tokenizer(tokenizer)
    new_tokenizer.save("tok.json")

strings = [
    "This is a test",
    "生活的真谛是",
    "生活的真谛是[MASK]。",
    # XXX: This one is problematic because of special tokens
    # "<s> Something something",
]

for string in strings:
    encoded = tokenizer(string)["input_ids"]
    encoded2 = new_tokenizer.encode(string).ids

    assert encoded == encoded2, f"{encoded} != {encoded2}"

    decoded = tokenizer.decode(encoded)
    decoded2 = new_tokenizer.decode(encoded2)

    assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"
```

The converter + some test script.

The test script.

Tmp save.

Adding Fast tokenizer + tests.

Adding the tokenization tests.

Correct combination.

Small fix.

Fixing tests.

Fixing with latest update.

Rebased.

fix copies + normalized added tokens  + copies.

Adding doc.

TMP.

Doc + split files.

Doc.

Versions + try import.

Fix Camembert + warnings -> Error.

Fix by ArthurZucker.

Not a decorator.

* Fixing comments.

* Adding more to docstring.

* Doc rewriting.
2023-04-06 09:53:03 +02:00
..
2023-03-16 13:41:48 +03:00
2023-01-04 09:18:57 +01:00
2022-04-04 10:25:46 -04:00
2022-04-04 10:25:46 -04:00
2022-04-04 10:25:46 -04:00
2023-02-28 15:42:55 +01:00
2023-04-04 16:05:22 +01:00
2022-04-04 10:25:46 -04:00
2023-01-17 17:18:56 +01:00
2022-11-09 18:31:22 +01:00
2023-03-14 12:08:14 +03:00
2022-04-04 10:25:46 -04:00
2023-01-17 17:18:56 +01:00
2022-11-29 10:38:01 +00:00
2023-02-15 10:35:14 -08:00
2023-02-20 16:37:11 +03:00
2022-11-08 19:54:41 +00:00
2022-04-04 10:25:46 -04:00
2023-01-17 17:18:56 +01:00
2022-04-04 10:25:46 -04:00
2022-04-04 10:25:46 -04:00
2023-01-16 20:37:07 +03:00
2022-04-04 10:25:46 -04:00
2023-04-04 14:53:06 +02:00
2022-04-04 10:25:46 -04:00
2023-03-22 16:53:52 +01:00
2023-02-15 10:35:14 -08:00
2022-04-04 10:25:46 -04:00
2023-04-04 12:41:12 -04:00
2023-03-24 19:45:57 +00:00
2022-04-04 10:25:46 -04:00
2023-02-03 12:43:46 -05:00
2022-12-16 16:24:01 +01:00
2022-04-08 10:57:51 +02:00
2023-02-15 18:10:30 +00:00
2022-06-21 10:24:50 +02:00
2022-11-08 19:54:41 +00:00
2023-01-17 17:18:56 +01:00
2023-01-17 17:18:56 +01:00
2023-02-07 16:43:19 -05:00
2022-04-04 10:25:46 -04:00