@@ -284,6 +284,12 @@ The tokenizer also accept pre-tokenized inputs. This is particularly useful when
|
|||||||
predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
|
predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
|
||||||
`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
|
`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
|
||||||
|
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
|
||||||
|
like BPE).
|
||||||
|
|
||||||
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
|
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
|
||||||
tokenizer. For instance, we have:
|
tokenizer. For instance, we have:
|
||||||
|
|
||||||
|
|||||||
@@ -1088,7 +1088,8 @@ ENCODE_KWARGS_DOCSTRING = r"""
|
|||||||
returned to provide some overlap between truncated and overflowing sequences. The value of this
|
returned to provide some overlap between truncated and overflowing sequences. The value of this
|
||||||
argument defines the number of overlapping tokens.
|
argument defines the number of overlapping tokens.
|
||||||
is_pretokenized (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
is_pretokenized (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
Whether or not the input is already tokenized.
|
Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer
|
||||||
|
will skip the pre-tokenization step. This is useful for NER or token classification.
|
||||||
pad_to_multiple_of (:obj:`int`, `optional`):
|
pad_to_multiple_of (:obj:`int`, `optional`):
|
||||||
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
|
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
|
||||||
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
|
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
|
||||||
|
|||||||
Reference in New Issue
Block a user