Framework split (#16030)
* First files * More files * Last files * Style
This commit is contained in:
@@ -157,6 +157,8 @@ Apply the `group_texts` function over the entire dataset:
|
||||
|
||||
For causal language modeling, use [`DataCollatorForLanguageModeling`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:
|
||||
|
||||
```py
|
||||
@@ -164,7 +166,21 @@ You can use the end of sequence token as the padding token, and set `mlm=False`.
|
||||
|
||||
>>> tokenizer.pad_token = tokenizer.eos_token
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
|
||||
For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.
|
||||
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> tokenizer.pad_token = tokenizer.eos_token
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:
|
||||
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
|
||||
@@ -175,13 +191,10 @@ For masked language modeling, use the same [`DataCollatorForLanguageModeling`] e
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> tokenizer.pad_token = tokenizer.eos_token
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
|
||||
===PT-TF-SPLIT===
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Causal language modeling
|
||||
|
||||
|
||||
@@ -89,6 +89,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
|
||||
|
||||
`DataCollatorForMultipleChoice` will flatten all the model inputs, apply padding, and then unflatten the results:
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from dataclasses import dataclass
|
||||
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
|
||||
@@ -128,7 +130,10 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
|
||||
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
|
||||
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
|
||||
... return batch
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from dataclasses import dataclass
|
||||
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
|
||||
>>> from typing import Optional, Union
|
||||
@@ -168,6 +173,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
|
||||
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
|
||||
... return batch
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Fine-tune with Trainer
|
||||
|
||||
|
||||
@@ -134,15 +134,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
|
||||
|
||||
Use [`DefaultDataCollator`] to create a batch of examples. Unlike other data collators in 🤗 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DefaultDataCollator
|
||||
|
||||
>>> data_collator = DefaultDataCollator()
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DefaultDataCollator
|
||||
|
||||
>>> data_collator = DefaultDataCollator(return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Fine-tune with Trainer
|
||||
|
||||
|
||||
@@ -74,15 +74,22 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
|
||||
|
||||
Use [`DataCollatorWithPadding`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorWithPadding
|
||||
|
||||
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorWithPadding
|
||||
|
||||
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Fine-tune with Trainer
|
||||
|
||||
|
||||
@@ -93,15 +93,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
|
||||
|
||||
Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForSeq2Seq
|
||||
|
||||
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForSeq2Seq
|
||||
|
||||
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Fine-tune with Trainer
|
||||
|
||||
|
||||
@@ -134,15 +134,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
|
||||
|
||||
Use [`DataCollatorForTokenClassification`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForTokenClassification
|
||||
|
||||
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForTokenClassification
|
||||
|
||||
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Fine-tune with Trainer
|
||||
|
||||
|
||||
@@ -95,15 +95,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
|
||||
|
||||
Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForSeq2Seq
|
||||
|
||||
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
|
||||
===PT-TF-SPLIT===
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForSeq2Seq
|
||||
|
||||
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Fine-tune with Trainer
|
||||
|
||||
|
||||
Reference in New Issue
Block a user