Create concept guide section (#16369)

* ✨ create concept guide section * 🖍 make fixup * 🖍 apply feedback Co-authored-by: Steven <stevhliu@gmail.com>
2022-03-25 12:51:43 -07:00
parent ed2ee373d0
commit b320d87ece
8 changed files with 113 additions and 815 deletions
--- a/docs/source/preprocessing.mdx
+++ b/docs/source/preprocessing.mdx
@@ -494,65 +494,4 @@ A processor combines a feature extractor and tokenizer. Load a processor with [`

 Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.

-Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
-
-## Everything you always wanted to know about padding and truncation
-
-We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
-truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
-three arguments you need to know for this are `padding`, `truncation` and `max_length`.
-
- `padding` controls the padding. It can be a boolean or a string which should be:
-
-  - `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
-    a single sequence).
-  - `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
-    by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
-    padding will still be applied to it.
-  - `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
-    behavior.
-
- `truncation` controls the truncation. It can be a boolean or a string which should be:
-
-  - `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
-    the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
-    truncate token by token, removing a token from the longest sequence in the pair until the proper length is
-    reached.
-  - `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
-    length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
-    the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-  - `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
-    length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
-    the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-  - `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
-    default behavior.
-
- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
-  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
-  truncation/padding to `max_length` is deactivated.
-
-Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
-any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
-`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
-
-| Truncation                           | Padding                           | Instruction                                                                                 |
-|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
-| no truncation                        | no padding                        | `tokenizer(batch_sentences)`                                                           |
-|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True)` or                                          |
-|                                      |                                   | `tokenizer(batch_sentences, padding='longest')`                                        |
-|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')`                                     |
-|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
-| truncation to max model input length | no padding                        | `tokenizer(batch_sentences, truncation=True)` or                                       |
-|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
-|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
-|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
-|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
-|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
-|                                      | padding to specific length        | Not possible                                                                                |
-| truncation to specific length        | no padding                        | `tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
-|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
-|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
-|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
-|                                      | padding to max model input length | Not possible                                                                                |
-|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
-|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
+Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.