Create concept guide section (#16369)

*  create concept guide section

* 🖍 make fixup

* 🖍 apply feedback

Co-authored-by: Steven <stevhliu@gmail.com>
This commit is contained in:
Steven Liu
2022-03-25 12:51:43 -07:00
committed by GitHub
parent ed2ee373d0
commit b320d87ece
8 changed files with 113 additions and 815 deletions

View File

@@ -494,65 +494,4 @@ A processor combines a feature extractor and tokenizer. Load a processor with [`
Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
## Everything you always wanted to know about padding and truncation
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
three arguments you need to know for this are `padding`, `truncation` and `max_length`.
- `padding` controls the padding. It can be a boolean or a string which should be:
- `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
a single sequence).
- `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
padding will still be applied to it.
- `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
behavior.
- `truncation` controls the truncation. It can be a boolean or a string which should be:
- `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
truncate token by token, removing a token from the longest sequence in the pair until the proper length is
reached.
- `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
- `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
- `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
default behavior.
- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
it will default to the maximum length the model can accept. If the model has no specific maximum input length,
truncation/padding to `max_length` is deactivated.
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
| Truncation | Padding | Instruction |
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
| no truncation | no padding | `tokenizer(batch_sentences)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True)` or |
| | | `tokenizer(batch_sentences, padding='longest')` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` |
| | padding to specific length | Not possible |
| truncation to specific length | no padding | `tokenizer(batch_sentences, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` |
| | padding to max model input length | Not possible |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.