Generate: move generation_*.py src files into generation/*.py (#20096)
* move generation_*.py src files into generation/*.py * populate generation.__init__ with lazy loading * move imports and references from generation.xxx.object to generation.object
This commit is contained in:
@@ -58,7 +58,7 @@ This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The
|
||||
- Model predictions are intended to be identical to the original implementation when
|
||||
`forced_bos_token_id=0`. This only works, however, if the string you pass to
|
||||
[`fairseq.encode`] starts with a space.
|
||||
- [`~generation_utils.GenerationMixin.generate`] should be used for conditional generation tasks like
|
||||
- [`~generation.GenerationMixin.generate`] should be used for conditional generation tasks like
|
||||
summarization, see the example in that docstrings.
|
||||
- Models that load the *facebook/bart-large-cnn* weights will not have a `mask_token_id`, or be able to perform
|
||||
mask-filling tasks.
|
||||
@@ -188,4 +188,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
## FlaxBartForCausalLM
|
||||
|
||||
[[autodoc]] FlaxBartForCausalLM
|
||||
- __call__
|
||||
- __call__
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
*Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.*
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> Donut high-level overview. Taken from the <a href="https://arxiv.org/abs/2111.15664">original paper</a>. </small>
|
||||
|
||||
@@ -40,7 +40,7 @@ Tips:
|
||||
## Inference
|
||||
|
||||
Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
||||
[`~generation_utils.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||
|
||||
The [`DonutFeatureExtractor`] class is responsible for preprocessing the input image and
|
||||
[`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The
|
||||
@@ -211,4 +211,4 @@ We refer to the [tutorial notebooks](https://github.com/NielsRogge/Transformers-
|
||||
## DonutSwinModel
|
||||
|
||||
[[autodoc]] DonutSwinModel
|
||||
- forward
|
||||
- forward
|
||||
|
||||
@@ -53,7 +53,7 @@ Tips:
|
||||
|
||||
### Generation
|
||||
|
||||
The [`~generation_utils.GenerationMixin.generate`] method can be used to generate text using GPT-J
|
||||
The [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J
|
||||
model.
|
||||
|
||||
```python
|
||||
|
||||
@@ -38,7 +38,7 @@ Tips:
|
||||
## Inference
|
||||
|
||||
Speech2Text2's [`SpeechEncoderDecoderModel`] model accepts raw waveform input values from speech and
|
||||
makes use of [`~generation_utils.GenerationMixin.generate`] to translate the input speech
|
||||
makes use of [`~generation.GenerationMixin.generate`] to translate the input speech
|
||||
autoregressively to the target language.
|
||||
|
||||
The [`Wav2Vec2FeatureExtractor`] class is responsible for preprocessing the input speech and
|
||||
|
||||
@@ -225,7 +225,7 @@ batch) leads to very slow training on TPU.
|
||||
|
||||
## Inference
|
||||
|
||||
At inference time, it is recommended to use [`~generation_utils.GenerationMixin.generate`]. This
|
||||
At inference time, it is recommended to use [`~generation.GenerationMixin.generate`]. This
|
||||
method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder
|
||||
and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.
|
||||
There's also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how
|
||||
@@ -244,7 +244,7 @@ Das Haus ist wunderbar.
|
||||
```
|
||||
|
||||
Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when doing generation without using
|
||||
[`~generation_utils.GenerationMixin.generate`], make sure you start it with the `pad_token_id`.
|
||||
[`~generation.GenerationMixin.generate`], make sure you start it with the `pad_token_id`.
|
||||
|
||||
The example above only shows a single example. You can also do batched inference, like so:
|
||||
|
||||
|
||||
@@ -30,7 +30,7 @@ show that the TrOCR model outperforms the current state-of-the-art models on bot
|
||||
tasks.*
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> TrOCR architecture. Taken from the <a href="https://arxiv.org/abs/2109.10282">original paper</a>. </small>
|
||||
|
||||
@@ -53,7 +53,7 @@ Tips:
|
||||
## Inference
|
||||
|
||||
TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
||||
[`~generation_utils.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||
|
||||
The [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] class is responsible for preprocessing the input image and
|
||||
[`RobertaTokenizer`/`XLMRobertaTokenizer`] decodes the generated target tokens to the target string. The
|
||||
@@ -64,20 +64,20 @@ into a single instance to both extract the input features and decode the predict
|
||||
|
||||
``` py
|
||||
>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
|
||||
>>> import requests
|
||||
>>> import requests
|
||||
>>> from PIL import Image
|
||||
|
||||
>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
|
||||
>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
|
||||
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
|
||||
|
||||
>>> # load image from the IAM dataset
|
||||
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
|
||||
>>> # load image from the IAM dataset
|
||||
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
|
||||
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||
|
||||
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
|
||||
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
|
||||
>>> generated_ids = model.generate(pixel_values)
|
||||
|
||||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
||||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
||||
```
|
||||
|
||||
See the [model hub](https://huggingface.co/models?filter=trocr) to look for TrOCR checkpoints.
|
||||
|
||||
@@ -24,7 +24,7 @@ The abstract from the paper is the following:
|
||||
Tips:
|
||||
|
||||
- The model usually performs well without requiring any finetuning.
|
||||
- The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation_utils.GenerationMixin.generate`] function for inference.
|
||||
- The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation.GenerationMixin.generate`] function for inference.
|
||||
- Inference is currently only implemented for short-form i.e. audio is pre-segmented into <=30s segments. Long-form (including timestamps) will be implemented in a future release.
|
||||
- One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user