Files
HuggingFace_transformer/tests
Nicolas Patry a3bd763732 Better heuristic for token-classification pipeline. (#12611)
* Better heuristic for token-classification pipeline.

Relooking at the problem makes thing actually much simpler,
when we look at ids from a tokenizer, we have no way in **general**
to recover if some substring is part of a word or not.

However, within the pipeline, with offsets we still have access to the
original string, so we can simply look if previous character (if it
exists) of a token, is actually a space. This will obviously be wrong
for tokenizers that contain spaces within tokens, tokenizers where
offsets include spaces too (Don't think there are a lot).

This heuristic hopefully is fully bc and still can handle non-word based
tokenizers.

* Updating test with real values.

* We still need the older "correct" heuristic to prevent fusing
punctuation.

* Adding a real warning when important.
2021-07-26 16:21:26 +02:00
..
2021-06-09 11:51:13 -04:00
2020-12-07 18:36:34 -05:00
2020-12-07 18:36:34 -05:00
2020-12-07 18:36:34 -05:00
2020-12-07 18:36:34 -05:00
2021-05-12 13:48:15 +05:30
2021-04-23 09:17:37 -04:00
2021-01-27 21:25:11 +03:00
2020-12-07 18:36:34 -05:00
2021-06-23 13:13:32 +01:00
2020-12-07 18:36:34 -05:00
2020-12-07 18:36:34 -05:00
2021-01-27 21:25:11 +03:00
2021-05-05 12:38:01 +02:00
2021-06-01 19:07:37 +01:00
2020-12-07 18:36:34 -05:00
2021-05-12 13:48:15 +05:30
2021-06-01 19:07:37 +01:00
2020-12-07 18:36:34 -05:00
2020-12-07 18:36:34 -05:00
2021-04-26 13:50:34 +02:00
2021-04-21 11:11:20 -04:00