From 1f6885bad0a7584cab593151a121f230ec085c00 Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Tue, 1 Nov 2022 10:37:20 -0700 Subject: [PATCH] add dataset (#20005) --- docs/source/en/quicktour.mdx | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/docs/source/en/quicktour.mdx b/docs/source/en/quicktour.mdx index 4f17485342..2227480d58 100644 --- a/docs/source/en/quicktour.mdx +++ b/docs/source/en/quicktour.mdx @@ -432,19 +432,30 @@ Depending on your task, you'll typically pass the following parameters to [`Trai >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ``` -4. Your preprocessed train and test datasets: +4. Load a dataset: ```py - >>> train_dataset = dataset["train"] # doctest: +SKIP - >>> eval_dataset = dataset["eval"] # doctest: +SKIP + >>> from datasets import load_dataset + + >>> dataset = load_dataset("rottten_tomatoes") ``` -5. A [`DataCollator`] to create a batch of examples from your dataset: +5. Create a function to tokenize the dataset, and apply it over the entire dataset with [`~datasets.Dataset.map`]: ```py - >>> from transformers import DefaultDataCollator + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) - >>> data_collator = DefaultDataCollator() + + >>> dataset = dataset.map(tokenize_dataset, batched=True) + ``` + +6. A [`DataCollatorWithPadding`] to create a batch of examples from your dataset: + + ```py + >>> from transformers import DataCollatorWithPadding + + >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) ``` Now gather all these classes in [`Trainer`]: