From 8b1347149437fa4624d59be26f8724ca87f05cfe Mon Sep 17 00:00:00 2001 From: Maria Khalusova Date: Fri, 15 Sep 2023 12:15:07 -0400 Subject: [PATCH] [docs] IDEFICS guide and task guides restructure (#26035) * initial commit for the IDEFICS task guide * conversational example * updated TOC * fixed typos * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * addressed feedback * bad_words_ids * Apply suggestions from code review Co-authored-by: Victor SANH * rank classification note * feedback addressed --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Victor SANH --- docs/source/en/_toctree.yml | 5 + docs/source/en/tasks/idefics.md | 426 ++++++++++++++++++++++++++++++++ 2 files changed, 431 insertions(+) create mode 100644 docs/source/en/tasks/idefics.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 69feee1053..5c847c30a0 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -88,6 +88,11 @@ - local: generation_strategies title: Customize the generation strategy title: Generation + - isExpanded: false + sections: + - local: tasks/idefics + title: Image tasks with IDEFICS + title: Prompting title: Task Guides - sections: - local: fast_tokenizers diff --git a/docs/source/en/tasks/idefics.md b/docs/source/en/tasks/idefics.md new file mode 100644 index 0000000000..0e81efca12 --- /dev/null +++ b/docs/source/en/tasks/idefics.md @@ -0,0 +1,426 @@ + + +# Image tasks with IDEFICS + +[[open-in-colab]] + +While individual tasks can be tackled by fine-tuning specialized models, an alternative approach +that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning. +For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more. +This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can +solve image-text tasks with a large multimodal model called IDEFICS. + +[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198), +a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image +and text inputs and generates coherent text as output. It can answer questions about images, describe visual content, +create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b) +and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed +versions of the model adapted for conversational use cases. + +This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However, +being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether +this approach suits your use case better than fine-tuning specialized models for each individual task. + +In this guide, you'll learn how to: +- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#loading-the-quantized-version-of-the-model) +- Use IDEFICS for: + - [Image captioning](#image-captioning) + - [Prompted image captioning](#prompted-image-captioning) + - [Few-shot prompting](#few-shot-prompting) + - [Visual question answering](#visual-question-answering) + - [Image classificaiton](#image-classification) + - [Image-guided text generation](#image-guided-text-generation) +- [Run inference in batch mode](#running-inference-in-batch-mode) +- [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use) + +Before you begin, make sure you have all the necessary libraries installed. + +```bash +pip install -q bitsandbytes sentencepiece accelerate transformers +``` + + +To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory. + + +## Loading the model + +Let's start by loading the model's 9 billion parameters checkpoint: + +```py +>>> checkpoint = "HuggingFaceM4/idefics-9b" +``` + +Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint. +The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of +preparing text and image inputs for the model. + +```py +>>> import torch + +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor + +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") +``` + +Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized +manner given existing devices. + +### Quantized model + +If high-memory GPU availability is an issue, you can load the quantized version of the model. To load the model and the +processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed +on the fly while loading. + +```py +>>> import torch +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig + +>>> quantization_config = BitsAndBytesConfig( +... load_in_4bit=True, +... bnb_4bit_compute_dtype=torch.float16, +... ) + +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> model = IdeficsForVisionText2Text.from_pretrained( +... checkpoint, +... quantization_config=quantization_config, +... device_map="auto" +... ) +``` + +Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for. + +## Image captioning + +Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired +people navigate through different situations, for instance, explore image content online. + +To illustrate the task, get an image to be captioned, e.g.: + +
+ Image of a puppy in a flower bed +
+ +Photo by [Hendo Wang](https://unsplash.com/@hendoo). + +IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the +model, only the preprocessed input image. Without a text prompt, the model will start generating text from the +BOS (beginning-of-sequence) token thus creating a caption. + +As image input to the model, you can use either an image object (`PIL.Image`) or a url from which the image can be retrieved. + +```py +>>> prompt = [ +... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +A puppy in a flower bed +``` + + + +It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing +the `max_new_tokens`: the model will want to generate a new `` or `` token when there +is no image being generated by the model. +You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide. + + +## Prompted image captioning + +You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take +another image to illustrate: + +
+ Image of the Eiffel Tower at night +
+ +Photo by [Denys Nevozhai](https://unsplash.com/@dnevozhai). + +Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs. + +```py +>>> prompt = [ +... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "This is an image of ", +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +This is an image of the Eiffel Tower in Paris, France. +``` + +## Few-shot prompting + +While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with +other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning. +By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples. + +Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model +that in addition to learning what the object in an image is, we would also like to get some interesting information about it. +Then, let's see, if we can get the same response format for an image of the Statue of Liberty: + +
+ Image of the Statue of Liberty +
+ +Photo by [Juan Mayobre](https://unsplash.com/@jmayobres). + +```py +>>> prompt = ["User:", +... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", +... "User:", +... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", +... "Describe this image.\nAssistant:" +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +User: Describe this image. +Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. +User: Describe this image. +Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. +``` + +Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks, +feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.). + +## Visual question answering + +Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image +captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer +service (questions about products based on images), and image retrieval. + +Let's get a new image for this task: + +
+ Image of a couple having a picnic +
+ +Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos). + +You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions: + +```py +>>> prompt = [ +... "Instruction: Provide an answer to the question. Use the image to answer.\n", +... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "Question: Where are these people and what's the weather like? Answer:" +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Provide an answer to the question. Use the image to answer. + Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day. +``` + +## Image classification + +IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing +labeled examples from those specific categories. Given a list of categories and using its image and text understanding +capabilities, the model can infer which category the image likely belongs to. + +Say, we have this image of a vegetable stand: + +
+ Image of a vegetable stand +
+ +Photo by [Peter Wendt](https://unsplash.com/@peterwendt). + +We can instruct the model to classify the image into one of the categories that we have: + +```py +>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] +>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", +... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "Category: " +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=4, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. +Category: Vegetables +``` + +In the example above we instruct the model to classify the image into a single category, however, you can also prompt the model to do rank classification. + +## Image-guided text generation + +For more creative applications, you can use image-guided text generation to generate text based on an image. This can be +useful to create descriptions of products, ads, descriptions of a scene, etc. + +Let's prompt IDEFICS to write a story based on a simple image of a red door: + +
+ Image of a red door with a pumpkin on the steps +
+ +Photo by [Craig Tidball](https://unsplash.com/@devonshiremedia). + +```py +>>> prompt = ["Instruction: Use the image to write a story. \n", +... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", +... "Story: \n"] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Use the image to write a story. + Story: +Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. + +One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. + +The little girl ran inside and told her mother about the man. + +Her mother said, “Don’t worry, honey. He’s just a friendly ghost.” + +The little girl wasn’t sure if she believed her mother, but she went outside anyway. + +When she got to the door, the man was gone. + +The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. + +He was wearing a long black coat and a top hat. + +The little girl ran +``` + +Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost. + + + +For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help +you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies) +to learn more. + + +## Running inference in batch mode + +All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference +for a batch of examples by passing a list of prompts: + +```py +>>> prompts = [ +... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "This is an image of ", +... ], +... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "This is an image of ", +... ], +... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "This is an image of ", +... ], +... ] + +>>> inputs = processor(prompts, return_tensors="pt") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> for i,t in enumerate(generated_text): +... print(f"{i}:\n{t}\n") +0: +This is an image of the Eiffel Tower in Paris, France. + +1: +This is an image of a couple on a picnic blanket. + +2: +This is an image of a vegetable stand. +``` + +## IDEFICS instruct for conversational use + +For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub: +`HuggingFaceM4/idefics-80b-instruct` and `HuggingFaceM4/idefics-9b-instruct`. + +These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction +fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings. + +The use and prompting for the conversational use is very similar to using the base models: + +```py +>>> import torch +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor + +>>> device = "cuda" if torch.cuda.is_available() else "cpu" + +>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct" +>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device) +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> prompts = [ +... [ +... "User: What is in this image?", +... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", +... "", + +... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.", + +... "\nUser:", +... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", +... "And who is that?", + +... "\nAssistant:", +... ], +... ] + +>>> # --batched mode +>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) +>>> # --single sample mode +>>> # inputs = processor(prompts[0], return_tensors="pt").to(device) + +>>> # Generation args +>>> exit_condition = processor.tokenizer("", add_special_tokens=False).input_ids +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> for i, t in enumerate(generated_text): +... print(f"{i}:\n{t}\n") +```