[docs] Redesign (#31757)

* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-03 10:33:46 -08:00
parent 6aa9888463
commit c0f8d055ce
423 changed files with 10925 additions and 14569 deletions
--- a/docs/source/en/pipeline_tutorial.md
+++ b/docs/source/en/pipeline_tutorial.md
@@ -1,4 +1,4 @@
-<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.

 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -14,333 +14,332 @@ rendered properly in your Markdown viewer.

 -->

-# Pipelines for inference
+# Pipeline

-The [`pipeline`] makes it simple to use any model from the [Hub](https://huggingface.co/models) for inference on any language, computer vision, speech, and multimodal tasks. Even if you don't have experience with a specific modality or aren't familiar with the underlying code behind the models, you can still use them for inference with the [`pipeline`]! This tutorial will teach you to:
+The [`Pipeline`] is a simple but powerful inference API that is readily available for a variety of machine learning tasks with any model from the Hugging Face [Hub](https://hf.co/models).

-* Use a [`pipeline`] for inference.
-* Use a specific tokenizer or model.
-* Use a [`pipeline`] for audio, vision, and multimodal tasks.
+Tailor the [`Pipeline`] to your task with task specific parameters such as adding timestamps to an automatic speech recognition (ASR) pipeline for transcribing meeting notes. [`Pipeline`] supports GPUs, Apple Silicon, and half-precision weights to accelerate inference and save memory.

-<Tip>
+<Youtube id=tiZFewofSLM/>

-Take a look at the [`pipeline`] documentation for a complete list of supported tasks and available parameters.
+Transformers has two pipeline classes, a generic [`Pipeline`] and many individual task-specific pipelines like [`TextGenerationPipeline`] or [`VisualQuestionAnsweringPipeline`]. Load these individual pipelines by setting the task identifier in the `task` parameter in [`Pipeline`]. You can find the task identifier for each pipeline in their API documentation.

-</Tip>
+Each task is configured to use a default pretrained model and preprocessor, but this can be overriden with the `model` parameter if you want to use a different model.

-## Pipeline usage
-
-While each task has an associated [`pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains 
-all the task-specific pipelines. The [`pipeline`] automatically loads a default model and a preprocessing class capable 
-of inference for your task. Let's take the example of using the [`pipeline`] for automatic speech recognition (ASR), or
-speech-to-text.
-
-
-1. Start by creating a [`pipeline`] and specify the inference task:
+For example, to use the [`TextGenerationPipeline`] with [Gemma 2](./model_doc/gemma2), set `task="text-generation"` and `model="google/gemma-2-2b"`.

 ```py
->>> from transformers import pipeline
+from transformers import pipeline

->>> transcriber = pipeline(task="automatic-speech-recognition")
+pipeline = pipeline(task="text-generation", model="google/gemma-2-2b")
+pipeline("the secret to baking a really good cake is ")
+[{'generated_text': 'the secret to baking a really good cake is 1. the right ingredients 2. the'}]
 ```

-2. Pass your input to the [`pipeline`]. In the case of speech recognition, this is an audio input file:
+When you have more than one input, pass them as a list.

 ```py
->>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
-{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="cuda")
+pipeline(["the secret to baking a really good cake is ", "a baguette is "])
+[[{'generated_text': 'the secret to baking a really good cake is 1. the right ingredients 2. the'}],
+ [{'generated_text': 'a baguette is 100% bread.\n\na baguette is 100%'}]]
 ```

-Not the result you had in mind? Check out some of the [most downloaded automatic speech recognition models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending) 
-on the Hub to see if you can get a better transcription.
+This guide will introduce you to the [`Pipeline`], demonstrate its features, and show how to configure its various parameters.

-Let's try the [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2) model from OpenAI. Whisper was released 
-2 years later than Wav2Vec2, and was trained on close to 10x more data. As such, it beats Wav2Vec2 on most downstream 
-benchmarks. It also has the added benefit of predicting punctuation and casing, neither of which are possible with  
-Wav2Vec2.
+## Tasks

-Let's give it a try here to see how it performs. Set `torch_dtype="auto"` to automatically load the most memory-efficient data type the weights are stored in.
+[`Pipeline`] is compatible with many machine learning tasks across different modalities. Pass an appropriate input to the pipeline and it will handle the rest.
+
+Here are some examples of how to use [`Pipeline`] for different tasks and modalities.
+
+<hfoptions id="tasks">
+<hfoption id="summarization">

 ```py
->>> transcriber = pipeline(model="openai/whisper-large-v2", torch_dtype="auto")
->>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
+from transformers import pipeline
+
+pipeline = pipeline(task="summarization", model="google/pegasus-billsum")
+pipeline("Section was formerly set out as section 44 of this title. As originally enacted, this section contained two further provisions that 'nothing in this act shall be construed as in any wise affecting the grant of lands made to the State of California by virtue of the act entitled 'An act authorizing a grant to the State of California of the Yosemite Valley, and of the land' embracing the Mariposa Big-Tree Grove, approved June thirtieth, eighteen hundred and sixty-four; or as affecting any bona-fide entry of land made within the limits above described under any law of the United States prior to the approval of this act.' The first quoted provision was omitted from the Code because the land, granted to the state of California pursuant to the Act cite, was receded to the United States. Resolution June 11, 1906, No. 27, accepted the recession.")
+[{'summary_text': 'Instructs the Secretary of the Interior to convey to the State of California all right, title, and interest of the United States in and to specified lands which are located within the Yosemite and Mariposa National Forests, California.'}]
+```
+
+</hfoption>
+<hfoption id="automatic speech recognition">
+
+```py
+from transformers import pipeline
+
+pipeline = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v3")
+pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
 {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}
 ```

-Now this result looks more accurate! For a deep-dive comparison on Wav2Vec2 vs Whisper, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/asr_models).
-We really encourage you to check out the Hub for models in different languages, models specialized in your field, and more.
-You can check out and compare model results directly from your browser on the Hub to see if it fits or 
-handles corner cases better than other ones.
-And if you don't find a model for your use case, you can always start [training](training) your own!
-
-If you have several inputs, you can pass your input as a list:
+</hfoption>
+<hfoption id="image classification">

 ```py
-transcriber(
-    [
-        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac",
-        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac",
-    ]
-)
+from transformers import pipeline
+
+pipeline = pipeline(task="image-classification", model="google/vit-base-patch16-224")
+pipeline(images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+[{'label': 'lynx, catamount', 'score': 0.43350091576576233},
+ {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor',
+  'score': 0.034796204417943954},
+ {'label': 'snow leopard, ounce, Panthera uncia',
+  'score': 0.03240183740854263},
+ {'label': 'Egyptian cat', 'score': 0.02394474856555462},
+ {'label': 'tiger cat', 'score': 0.02288915030658245}]
 ```

-Pipelines are great for experimentation as switching from one model to another is trivial; however, there are some ways to optimize them for larger workloads than experimentation. See the following guides that dive into iterating over whole datasets or using pipelines in a webserver:
-of the docs:
-* [Using pipelines on a dataset](#using-pipelines-on-a-dataset)
-* [Using pipelines for a webserver](./pipeline_webserver)
+</hfoption>
+<hfoption id="visual question answering">
+
+```py
+from transformers import pipeline
+
+pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base")
+pipeline(
+    image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
+    question="What is in the image?",
+)
+[{'answer': 'statue of liberty'}]
+```
+
+</hfoption>
+</hfoptions>

 ## Parameters

-[`pipeline`] supports many parameters; some are task specific, and some are general to all pipelines.
-In general, you can specify parameters anywhere you want:
+At a minimum, [`Pipeline`] only requires a task identifier, model, and the appropriate input. But there are many parameters available to configure the pipeline with, from task-specific parameters to optimizing performance.

-```py
-transcriber = pipeline(model="openai/whisper-large-v2", my_parameter=1)
-
-out = transcriber(...)  # This will use `my_parameter=1`.
-out = transcriber(..., my_parameter=2)  # This will override and use `my_parameter=2`.
-out = transcriber(...)  # This will go back to using `my_parameter=1`.
-```
-
-Let's check out 3 important ones:
+This section introduces you to some of the more important parameters.

 ### Device

-If you use `device=n`, the pipeline automatically puts the model on the specified device.
-This will work regardless of whether you are using PyTorch or Tensorflow.
+[`Pipeline`] is compatible with many hardware types, including GPUs, CPUs, Apple Silicon, and more. Configure the hardware type with the `device` parameter. By default, [`Pipeline`] runs on a CPU which is given by `device=-1`.
+
+<hfoptions id="device">
+<hfoption id="GPU">
+
+To run [`Pipeline`] on a GPU, set `device` to the associated CUDA device id. For example, `device=0` runs on the first GPU.

 ```py
-transcriber = pipeline(model="openai/whisper-large-v2", device=0)
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device=0)
+pipeline("the secret to baking a really good cake is ")
 ```

-If the model is too large for a single GPU and you are using PyTorch, you can set `torch_dtype='float16'` to enable FP16 precision inference. Usually this would not cause significant performance drops but make sure you evaluate it on your models!
+You could also let [Accelerate](https://hf.co/docs/accelerate/index), a library for distributed training, automatically choose how to load and store the model weights on the appropriate device. This is especially useful if you have multiple devices. Accelerate loads and stores the model weights on the fastest device first, and then moves the weights to other devices (CPU, hard drive) as needed. Set `device_map="auto"` to let Accelerate choose the device.

-Alternatively, you can set `device_map="auto"` to automatically 
-determine how to load and store the model weights. Using the `device_map` argument requires the 🤗 [Accelerate](https://huggingface.co/docs/accelerate)
-package:
-
-```bash
-pip install --upgrade accelerate
-```
-
-The following code automatically loads and stores model weights across devices:
+> [!TIP]
+> Make sure have [Accelerate](https://hf.co/docs/accelerate/basic_tutorials/install) is installed.
+>
+> ```py
+> !pip install -U accelerate
+> ```

 ```py
-transcriber = pipeline(model="openai/whisper-large-v2", device_map="auto")
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device_map="auto")
+pipeline("the secret to baking a really good cake is ")
 ```

-Note that if  `device_map="auto"` is passed, there is no need to add the argument `device=device` when instantiating your `pipeline` as you may encounter some unexpected behavior!
+</hfoption>
+<hfoption id="Apple silicon">

-### Batch size
-
-By default, pipelines will not batch inference for reasons explained in detail [here](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching). The reason is that batching is not necessarily faster, and can actually be quite slower in some cases.
-
-But if it works in your use case, you can use:
+To run [`Pipeline`] on Apple silicon, set `device="mps"`.

 ```py
-transcriber = pipeline(model="openai/whisper-large-v2", device=0, batch_size=2)
-audio_filenames = [f"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/{i}.flac" for i in range(1, 5)]
-texts = transcriber(audio_filenames)
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="mps")
+pipeline("the secret to baking a really good cake is ")
 ```

-This runs the pipeline on the 4 provided audio files, but it will pass them in batches of 2
-to the model (which is on a GPU, where batching is more likely to help) without requiring any further code from you. 
-The output should always match what you would have received without batching. It is only meant as a way to help you get more speed out of a pipeline.
+</hfoption>
+</hfoptions>

-Pipelines can also alleviate some of the complexities of batching because, for some pipelines, a single item (like a long audio file) needs to be chunked into multiple parts to be processed by a model. The pipeline performs this [*chunk batching*](./main_classes/pipelines#pipeline-chunk-batching) for you.
+### Batch inference

-### Task specific parameters
-
-All tasks provide task specific parameters which allow for additional flexibility and options to help you get your job done.
-For instance, the [`transformers.AutomaticSpeechRecognitionPipeline.__call__`] method has a `return_timestamps` parameter which sounds promising for subtitling videos:
+[`Pipeline`] can also process batches of inputs with the `batch_size` parameter. Batch inference may improve speed, especially on a GPU, but it isn't guaranteed. Other variables such as hardware, data, and the model itself can affect whether batch inference improves speed. For this reason, batch inference is disabled by default.

+In the example below, when there are 4 inputs and `batch_size` is set to 2, [`Pipeline`] passes a batch of 2 inputs to the model at a time.

 ```py
->>> transcriber = pipeline(model="openai/whisper-large-v2", return_timestamps=True)
->>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
-{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.', 'chunks': [{'timestamp': (0.0, 11.88), 'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its'}, {'timestamp': (11.88, 12.38), 'text': ' creed.'}]}
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="cuda", batch_size=2)
+pipeline(["the secret to baking a really good cake is", "a baguette is", "paris is the", "hotdogs are"])
+[[{'generated_text': 'the secret to baking a really good cake is to use a good cake mix.\n\ni’'}],
+ [{'generated_text': 'a baguette is'}],
+ [{'generated_text': 'paris is the most beautiful city in the world.\n\ni’ve been to paris 3'}],
+ [{'generated_text': 'hotdogs are a staple of the american diet. they are a great source of protein and can'}]]
 ```

-As you can see, the model inferred the text and also outputted **when** the various sentences were pronounced.
+Another good use case for batch inference is for streaming data in [`Pipeline`].

-There are many parameters available for each task, so check out each task's API reference to see what you can tinker with!
-For instance, the [`~transformers.AutomaticSpeechRecognitionPipeline`] has a `chunk_length_s` parameter which is helpful 
-for working on really long audio files (for example, subtitling entire movies or hour-long videos) that a model typically 
-cannot handle on its own:
+```py
+from transformers import pipeline
+from transformers.pipelines.pt_utils import KeyDataset
+import datasets

-```python
->>> transcriber = pipeline(model="openai/whisper-large-v2", chunk_length_s=30)
->>> transcriber("https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav")
-{'text': " So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know. You get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option, it was way too big a project. So I planned things out and I decided I kind of had to go something like this. This is how the year would go. So I'd start off light and I'd bump it up"}
+# KeyDataset is a utility that returns the item in the dict returned by the dataset
+dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
+pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
+for out in pipeline(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
+    print(out)
 ```

-If you can't find a parameter that would really help you out, feel free to [request it](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)!
+Keep the following general rules of thumb in mind for determining whether batch inference can help improve performance.

+1. The only way to know for sure is to measure performance on your model, data, and hardware.
+2. Don't batch inference if you're constrained by latency (a live inference product for example).
+3. Don't batch inference if you're using a CPU.
+4. Don't batch inference if you don't know the `sequence_length` of your data. Measure performance, iteratively add to `sequence_length`, and include out-of-memory (OOM) checks to recover from failures.
+5. Do batch inference if your `sequence_length` is regular, and keep pushing it until you reach an OOM error. The larger the GPU, the more helpful batch inference is.
+6. Do make sure you can handle OOM errors if you decide to do batch inference.

-## Using pipelines on a dataset
+### Task-specific parameters

-The pipeline can also run inference on a large dataset. The easiest way we recommend doing this is by using an iterator:
+[`Pipeline`] accepts any parameters that are supported by each individual task pipeline. Make sure to check out each individual task pipeline to see what type of parameters are available. If you can't find a parameter that is useful for your use case, please feel free to open a GitHub [issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml) to request it!
+
+The examples below demonstrate some of the task-specific parameters available.
+
+<hfoptions id="task-specific-parameters">
+<hfoption id="automatic speech recognition">
+
+Pass the `return_timestamps="word"` parameter to [`Pipeline`] to return when each word was spoken.
+
+```py
+from transformers import pipeline
+
+pipeline = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v3")
+pipeline(audio="https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", return_timestamp="word")
+{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.',
+ 'chunks': [{'text': ' I', 'timestamp': (0.0, 1.1)},
+  {'text': ' have', 'timestamp': (1.1, 1.44)},
+  {'text': ' a', 'timestamp': (1.44, 1.62)},
+  {'text': ' dream', 'timestamp': (1.62, 1.92)},
+  {'text': ' that', 'timestamp': (1.92, 3.7)},
+  {'text': ' one', 'timestamp': (3.7, 3.88)},
+  {'text': ' day', 'timestamp': (3.88, 4.24)},
+  {'text': ' this', 'timestamp': (4.24, 5.82)},
+  {'text': ' nation', 'timestamp': (5.82, 6.78)},
+  {'text': ' will', 'timestamp': (6.78, 7.36)},
+  {'text': ' rise', 'timestamp': (7.36, 7.88)},
+  {'text': ' up', 'timestamp': (7.88, 8.46)},
+  {'text': ' and', 'timestamp': (8.46, 9.2)},
+  {'text': ' live', 'timestamp': (9.2, 10.34)},
+  {'text': ' out', 'timestamp': (10.34, 10.58)},
+  {'text': ' the', 'timestamp': (10.58, 10.8)},
+  {'text': ' true', 'timestamp': (10.8, 11.04)},
+  {'text': ' meaning', 'timestamp': (11.04, 11.4)},
+  {'text': ' of', 'timestamp': (11.4, 11.64)},
+  {'text': ' its', 'timestamp': (11.64, 11.8)},
+  {'text': ' creed.', 'timestamp': (11.8, 12.3)}]}
+```
+
+</hfoption>
+<hfoption id="text generation">
+
+Pass `return_full_text=False` to [`Pipeline`] to only return the generated text instead of the full text (prompt and generated text).
+
+[`~TextGenerationPipeline.__call__`] also supports additional keyword arguments from the [`~GenerationMixin.generate`] method. To return more than one generated sequence, set `num_return_sequences` to a value greater than 1.
+
+```py
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="openai-community/gpt2")
+pipeline("the secret to baking a good cake is", num_return_sequences=4, return_full_text=False)
+[{'generated_text': ' how easy it is for me to do it with my hands. You must not go nuts, or the cake is going to fall out.'},
+ {'generated_text': ' to prepare the cake before baking. The key is to find the right type of icing to use and that icing makes an amazing frosting cake.\n\nFor a good icing cake, we give you the basics'},
+ {'generated_text': " to remember to soak it in enough water and don't worry about it sticking to the wall. In the meantime, you could remove the top of the cake and let it dry out with a paper towel.\n"},
+ {'generated_text': ' the best time to turn off the oven and let it stand 30 minutes. After 30 minutes, stir and bake a cake in a pan until fully moist.\n\nRemove the cake from the heat for about 12'}]
+```
+
+</hfoption>
+</hfoptions>
+
+## Chunk batching
+
+There are some instances where you need to process data in chunks.
+
+- for some data types, a single input (for example, a really long audio file) may need to be chunked into multiple parts before it can be processed
+- for some tasks, like zero-shot classification or question answering, a single input may need multiple forward passes which can cause issues with the `batch_size` parameter
+
+The [ChunkPipeline](https://github.com/huggingface/transformers/blob/99e0ab6ed888136ea4877c6d8ab03690a1478363/src/transformers/pipelines/base.py#L1387) class is designed to handle these use cases. Both pipeline classes are used in the same way, but since [ChunkPipeline](https://github.com/huggingface/transformers/blob/99e0ab6ed888136ea4877c6d8ab03690a1478363/src/transformers/pipelines/base.py#L1387) can automatically handle batching, you don't need to worry about the number of forward passes your inputs trigger. Instead, you can optimize `batch_size` independently of the inputs.
+
+The example below shows how it differs from [`Pipeline`].
+
+```py
+# ChunkPipeline
+all_model_outputs = []
+for preprocessed in pipeline.preprocess(inputs):
+    model_outputs = pipeline.model_forward(preprocessed)
+    all_model_outputs.append(model_outputs)
+outputs =pipeline.postprocess(all_model_outputs)
+
+# Pipeline
+preprocessed = pipeline.preprocess(inputs)
+model_outputs = pipeline.forward(preprocessed)
+outputs = pipeline.postprocess(model_outputs)
+```
+
+## Large datasets
+
+For inference with large datasets, you can iterate directly over the dataset itself. This avoids immediately allocating memory for the entire dataset, and you don't need to worry about creating batches yourself. Try [Batch inference](#batch-inference) with the `batch_size` parameter to see if it improves performance.
+
+```py
+from transformers.pipelines.pt_utils import KeyDataset
+from transformers import pipeline
+from datasets import load_dataset
+
+dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
+pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
+for out in pipeline(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
+    print(out)
+```
+
+Other ways to run inference on large datasets with [`Pipeline`] include using an iterator or generator.

 ```py
 def data():
    for i in range(1000):
        yield f"My example {i}"

-
-pipe = pipeline(model="openai-community/gpt2", device=0)
+pipeline = pipeline(model="openai-community/gpt2", device=0)
 generated_characters = 0
-for out in pipe(data()):
+for out in pipeline(data()):
    generated_characters += len(out[0]["generated_text"])
 ```

-The iterator `data()` yields each result, and the pipeline automatically
-recognizes the input is iterable and will start fetching the data while
-it continues to process it on the GPU (this uses [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) under the hood).
-This is important because you don't have to allocate memory for the whole dataset
-and you can feed the GPU as fast as possible.
+## Large models

-Since batching could speed things up, it may be useful to try tuning the `batch_size` parameter here.
-
-The simplest way to iterate over a dataset is to just load one from 🤗 [Datasets](https://github.com/huggingface/datasets/):
+[Accelerate](https://hf.co/docs/accelerate/index) enables a couple of optimizations for running large models with [`Pipeline`]. Make sure Accelerate is installed first.

 ```py
-# KeyDataset is a util that will just output the item we're interested in.
-from transformers.pipelines.pt_utils import KeyDataset
-from datasets import load_dataset
-
-pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0)
-dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]")
-
-for out in pipe(KeyDataset(dataset, "audio")):
-    print(out)
+!pip install -U accelerate
 ```

+The `device_map="auto"` setting is useful for automatically distributing the model across the fastest devices (GPUs) first before dispatching to other slower devices if available (CPU, hard drive).

-## Using pipelines for a webserver
+[`Pipeline`] supports half-precision weights (torch.float16), which can be significantly faster and save memory. Performance loss is negligible for most models, especially for larger ones. If your hardware supports it, you can enable torch.bfloat16 instead for more range.

-<Tip>
-Creating an inference engine is a complex topic which deserves it's own
-page.
-</Tip>
+> [!TIP]
+> Inputs are internally converted to torch.float16 and it only works for models with a PyTorch backend.

-[Link](./pipeline_webserver)
-
-## Vision pipeline
-
-Using a [`pipeline`] for vision tasks is practically identical.
-
-Specify your task and pass your image to the classifier. The image can be a link, a local path or a base64-encoded image. For example, what species of cat is shown below?
-
-![pipeline-cat-chonk](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg)
+Lastly, [`Pipeline`] also accepts quantized models to reduce memory usage even further. Make sure you have the [bitsandbytes](https://hf.co/docs/bitsandbytes/installation) library installed first, and then add `load_in_8bit=True` to `model_kwargs` in the pipeline.

 ```py
->>> from transformers import pipeline
-
->>> vision_classifier = pipeline(model="google/vit-base-patch16-224")
->>> preds = vision_classifier(
-...     images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-... )
->>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
->>> preds
-[{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}]
-```
-
-## Text pipeline
-
-Using a [`pipeline`] for NLP tasks is practically identical.
-
-```py
->>> from transformers import pipeline
-
->>> # This model is a `zero-shot-classification` model.
->>> # It will classify text, except you are free to choose any label you might imagine
->>> classifier = pipeline(model="facebook/bart-large-mnli")
->>> classifier(
-...     "I have a problem with my iphone that needs to be resolved asap!!",
-...     candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
-... )
-{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]}
-```
-
-## Multimodal pipeline
-
-The [`pipeline`] supports more than one modality. For example, a visual question answering (VQA) task combines text and image. Feel free to use any image link you like and a question you want to ask about the image. The image can be a URL or a local path to the image.
-
-For example, if you use this [invoice image](https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png):
-
-```py
->>> from transformers import pipeline
-
->>> vqa = pipeline(model="impira/layoutlm-document-qa")
->>> output = vqa(
-...     image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png",
-...     question="What is the invoice number?",
-... )
->>> output[0]["score"] = round(output[0]["score"], 3)
->>> output
-[{'score': 0.425, 'answer': 'us-001', 'start': 16, 'end': 16}]
-```
-
-<Tip>
-
-To run the example above you need to have [`pytesseract`](https://pypi.org/project/pytesseract/) installed in addition to 🤗 Transformers:
-
-```bash
-sudo apt install -y tesseract-ocr
-pip install pytesseract
-```
-
-</Tip>
-
-## Using `pipeline` on large models with 🤗 `accelerate`:
-
-You can easily run `pipeline` on large models using 🤗 `accelerate`! First make sure you have installed `accelerate` with `pip install accelerate`. 
-
-First load your model using `device_map="auto"`! We will use `facebook/opt-1.3b` for our example.
-
-```py
-# pip install accelerate
 import torch
-from transformers import pipeline
+from transformers import pipeline, BitsAndBytesConfig

-pipe = pipeline(model="facebook/opt-1.3b", torch_dtype=torch.bfloat16, device_map="auto")
-output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
+pipeline = pipeline(model="google/gemma-7b", torch_dtype=torch.bfloat16, device_map="auto", model_kwargs={"quantization_config": BitsAndBytesConfig(load_in_8bit=True)})
+pipeline("the secret to baking a good cake is ")
+[{'generated_text': 'the secret to baking a good cake is 1. the right ingredients 2. the right'}]
 ```
-
-You can also pass 8-bit loaded models if you install `bitsandbytes` and add the argument `load_in_8bit=True`
-
-```py
-# pip install accelerate bitsandbytes
-import torch
-from transformers import pipeline
-
-pipe = pipeline(model="facebook/opt-1.3b", device_map="auto", model_kwargs={"load_in_8bit": True})
-output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
-```
-
-Note that you can replace the checkpoint with any Hugging Face model that supports large model loading, such as BLOOM.
-
-## Creating web demos from pipelines with `gradio`
-
-Pipelines are automatically supported in [Gradio](https://github.com/gradio-app/gradio/), a library that makes creating beautiful and user-friendly machine learning apps on the web a breeze. First, make sure you have Gradio installed:
-
-```
-pip install gradio
-```
-
-Then, you can create a web demo around an image classification pipeline (or any other pipeline) in a single line of code by calling Gradio's [`Interface.from_pipeline`](https://www.gradio.app/docs/interface#interface-from-pipeline) function to launch the pipeline. This creates an intuitive drag-and-drop interface in your browser:
-
-```py
-from transformers import pipeline
-import gradio as gr
-
-pipe = pipeline("image-classification", model="google/vit-base-patch16-224")
-
-gr.Interface.from_pipeline(pipe).launch()
-```
-
-
-![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/panda-classification.png)
-
-By default, the web demo runs on a local server. If you'd like to share it with others, you can generate a temporary public
-link by setting `share=True` in `launch()`. You can also host your demo on [Hugging Face Spaces](https://huggingface.co/spaces) for a permanent link.