Example of pad_to_multiple_of for padding and truncation guide & docstring update (#22278)

* added an example of pad_to_multiple_of

* make style

* addressed feedback
This commit is contained in:
Maria Khalusova
2023-03-20 14:18:55 -04:00
committed by GitHub
parent fb0a38b4f2
commit 7bd8650512
2 changed files with 4 additions and 2 deletions

View File

@@ -50,6 +50,7 @@ The following table summarizes the recommended way to setup padding and truncati
| | | `tokenizer(batch_sentences, padding='longest')` | | | | `tokenizer(batch_sentences, padding='longest')` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` | | | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | | | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
| | padding to a multiple of a value | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8) |
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or | | truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` | | | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or | | | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |

View File

@@ -1342,8 +1342,9 @@ ENCODE_KWARGS_DOCSTRING = r"""
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification. which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (`int`, *optional*): pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable If set will pad the sequence to a multiple of the provided value. Requires `padding` to be activated.
the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta). This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta).
return_tensors (`str` or [`~utils.TensorType`], *optional*): return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are: If set, will return tensors instead of list of python integers. Acceptable values are: