Create concept guide section (#16369)
* ✨ create concept guide section * 🖍 make fixup * 🖍 apply feedback Co-authored-by: Steven <stevhliu@gmail.com>
This commit is contained in:
@@ -494,65 +494,4 @@ A processor combines a feature extractor and tokenizer. Load a processor with [`
|
||||
|
||||
Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.
|
||||
|
||||
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
|
||||
|
||||
## Everything you always wanted to know about padding and truncation
|
||||
|
||||
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
|
||||
truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
|
||||
three arguments you need to know for this are `padding`, `truncation` and `max_length`.
|
||||
|
||||
- `padding` controls the padding. It can be a boolean or a string which should be:
|
||||
|
||||
- `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
|
||||
a single sequence).
|
||||
- `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
|
||||
by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
|
||||
padding will still be applied to it.
|
||||
- `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
|
||||
behavior.
|
||||
|
||||
- `truncation` controls the truncation. It can be a boolean or a string which should be:
|
||||
|
||||
- `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
|
||||
the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
|
||||
truncate token by token, removing a token from the longest sequence in the pair until the proper length is
|
||||
reached.
|
||||
- `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
|
||||
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
|
||||
the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
|
||||
- `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
|
||||
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
|
||||
the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
|
||||
- `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
|
||||
default behavior.
|
||||
|
||||
- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
|
||||
it will default to the maximum length the model can accept. If the model has no specific maximum input length,
|
||||
truncation/padding to `max_length` is deactivated.
|
||||
|
||||
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
|
||||
any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
|
||||
`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
|
||||
|
||||
| Truncation | Padding | Instruction |
|
||||
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
|
||||
| no truncation | no padding | `tokenizer(batch_sentences)` |
|
||||
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True)` or |
|
||||
| | | `tokenizer(batch_sentences, padding='longest')` |
|
||||
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
|
||||
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
|
||||
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
|
||||
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
|
||||
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |
|
||||
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` |
|
||||
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or |
|
||||
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` |
|
||||
| | padding to specific length | Not possible |
|
||||
| truncation to specific length | no padding | `tokenizer(batch_sentences, truncation=True, max_length=42)` or |
|
||||
| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` |
|
||||
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or |
|
||||
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` |
|
||||
| | padding to max model input length | Not possible |
|
||||
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or |
|
||||
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
|
||||
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
|
||||
Reference in New Issue
Block a user