Clarify description of the is_split_into_words argument (#11449)

* Improve documentation for is_split_into_words argument

* Change description wording
This commit is contained in:
Kostas Stathoulopoulos
2021-04-26 16:29:36 +01:00
committed by GitHub
parent ab2cabb964
commit 6715e3b6a1
3 changed files with 9 additions and 5 deletions

View File

@@ -1286,8 +1286,9 @@ ENCODE_KWARGS_DOCSTRING = r"""
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_split_into_words (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer
will skip the pre-tokenization step. This is useful for NER or token classification.
Whether or not the input is already pre-tokenized (e.g., split into words). If set to :obj:`True`,
the tokenizer assumes the input is already split into words (for instance, by splitting it on
whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).