From 28f3d431d4b8b74a458a5583297d5101483edb74 Mon Sep 17 00:00:00 2001 From: Nicolas Patry Date: Tue, 6 Dec 2022 10:47:31 +0100 Subject: [PATCH] Rework the pipeline tutorial (#20437) * [WIP] Rework the pipeline tutorial - Switch to `asr` instead of another NLP task. - It also has simpler to understand results. - Added a section with interaction with `datasets`. - Added a section with writing a simple webserver. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Addressing comments. * Links. * Fixing docs format. * Adding pipeline_webserver to _toctree. * Warnig -> Tip warnings={true}. * Fix link ? * Links ? * Fixing link, adding chunk batching. * Oops. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/pipeline_tutorial.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/_toctree.yml | 2 + docs/source/en/pipeline_tutorial.mdx | 240 ++++++++++++++++++-------- docs/source/en/pipeline_webserver.mdx | 161 +++++++++++++++++ 3 files changed, 334 insertions(+), 69 deletions(-) create mode 100644 docs/source/en/pipeline_webserver.mdx diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 1669a3f0c0..50ed4fc106 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -146,6 +146,8 @@ title: BERTology - local: perplexity title: Perplexity of fixed-length models + - local: pipeline_webserver + title: Pipelines for webserver inference title: Conceptual guides - sections: - sections: diff --git a/docs/source/en/pipeline_tutorial.mdx b/docs/source/en/pipeline_tutorial.mdx index 4f21c8e4a2..40ed561add 100644 --- a/docs/source/en/pipeline_tutorial.mdx +++ b/docs/source/en/pipeline_tutorial.mdx @@ -33,100 +33,189 @@ While each task has an associated [`pipeline`], it is simpler to use the general ```py >>> from transformers import pipeline ->>> generator = pipeline(task="text-generation") +>>> generator = pipeline(task="automatic-speech-recognition") ``` 2. Pass your input text to the [`pipeline`]: ```py ->>> generator( -... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone" -... ) # doctest: +SKIP -[{'generated_text': 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Seven for the Iron-priests at the door to the east, and thirteen for the Lord Kings at the end of the mountain'}] +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'} ``` -If you have more than one input, pass your input as a list: +Not the result you had in mind? Check out some of the [most downloaded automatic speech recognition models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) on the Hub to see if you can get a better transcription. +Let's try [openai/whisper-large](https://huggingface.co/openai/whisper-large): ```py ->>> generator( -... [ -... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone", -... "Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne", -... ] -... ) # doctest: +SKIP +>>> generator = pipeline(model="openai/whisper-large") +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} ``` -Any additional parameters for your task can also be included in the [`pipeline`]. The `text-generation` task has a [`~generation.GenerationMixin.generate`] method with several parameters for controlling the output. For example, if you want to generate more than one output, set the `num_return_sequences` parameter: +Now this result looks more accurate! +We really encourage you to check out the Hub for models in different languages, models specialized in your field, and more. +You can check out and compare model results directly from your browser on the Hub to see if it fits or +handles corner cases better than other ones. +And if you don't find a model for your use case, you can always start [training](training) your own! + +If you have several inputs, you can pass your input as a list: ```py ->>> generator( -... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone", -... num_return_sequences=2, -... ) # doctest: +SKIP +generator( + [ + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", + ] +) ``` -### Choose a model and tokenizer +If you want to iterate over a whole dataset, or want to use it for inference in a webserver, check out dedicated parts -The [`pipeline`] accepts any model from the [Hub](https://huggingface.co/models). There are tags on the Hub that allow you to filter for a model you'd like to use for your task. Once you've picked an appropriate model, load it with the corresponding `AutoModelFor` and [`AutoTokenizer`] class. For example, load the [`AutoModelForCausalLM`] class for a causal language modeling task: +[Using pipelines on a dataset](#using-pipelines-on-a-dataset) + +[Using pipelines for a webserver](./pipeline_webserver) + +## Parameters + +[`pipeline`] supports many parameters; some are task specific, and some are general to all pipelines. +In general you can specify parameters anywhere you want: ```py ->>> from transformers import AutoTokenizer, AutoModelForCausalLM - ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +generator(model="openai/whisper-large", my_parameter=1) +out = generate(...) # This will use `my_parameter=1`. +out = generate(..., my_parameter=2) # This will override and use `my_parameter=2`. +out = generate(...) # This will go back to using `my_parameter=1`. ``` -Create a [`pipeline`] for your task, and specify the model and tokenizer you've loaded: +Let's check out 3 important ones: + +### Device + +If you use `device=n`, the pipeline automatically puts the model on the specified device. +This will work regardless of whether you are using PyTorch or Tensorflow. ```py ->>> from transformers import pipeline - ->>> generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer) +generator(model="openai/whisper-large", device=0) ``` -Pass your input text to the [`pipeline`] to generate some text: +If the model is too large for a single GPU, you can set `device_map="auto"` to allow 🤗 [Accelerate](https://huggingface.co/docs/accelerate) to automatically determine how to load and store the model weights. ```py ->>> generator( -... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone" -... ) # doctest: +SKIP -[{'generated_text': 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Seven for the Dragon-lords (for them to rule in a world ruled by their rulers, and all who live within the realm'}] +#!pip install accelerate +generator(model="openai/whisper-large", device_map="auto") ``` -## Audio pipeline +### Batch size -The [`pipeline`] also supports audio tasks like audio classification and automatic speech recognition. +By default, pipelines will not batch inference for reasons explained in detail [here](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching). The reason is that batching is not necessarily faster, and can actually be quite slower in some cases. -For example, let's classify the emotion in this audio clip: +But if it works in your use case, you can use: ```py ->>> from datasets import load_dataset ->>> import torch - ->>> torch.manual_seed(42) # doctest: +IGNORE_RESULT ->>> ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") ->>> audio_file = ds[0]["audio"]["path"] +generator(model="openai/whisper-large", device=0, batch_size=2) +audio_filenames = [f"audio_{i}.flac" for i in range(10)] +texts = generator(audio_filenames) ``` -Find an [audio classification](https://huggingface.co/models?pipeline_tag=audio-classification) model on the Model Hub for emotion recognition and load it in the [`pipeline`]: +This runs the pipeline on the 10 provided audio files, but it will pass them in batches of 2 +to the model (which is on a GPU, where batching is more likely to help) without requiring any further code from you. +The output should always match what you would have received without batching. It is only meant as a way to help you get more speed out of a pipeline. + +Pipelines can also alleviate some of the complexities of batching because, for some pipelines, a single item (like a long audio file) needs to be chunked into multiple parts to be processed by a model. The pipeline performs this [*chunk batching*](./main_classes/pipelines#pipeline-chunk-batching) for you. + +### Task specific parameters + +All tasks provide task specific parameters which allow for additional flexibility and options to help you get your job done. +For instance, the [`transformers.AutomaticSpeechRecognitionPipeline.__call__`] method has a `return_timestamps` parameter which sounds promising for subtitling videos: + ```py ->>> from transformers import pipeline - ->>> audio_classifier = pipeline( -... task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" -... ) +>>> # Not using whisper, as it cannot provide timestamps. +>>> generator = pipeline(model="facebook/wav2vec2-large-960h-lv60-self", return_timestamps="word") +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP AND LIVE OUT THE TRUE MEANING OF ITS CREED', + 'chunks': [ + {'text': 'I', 'timestamp': (1.22, 1.24)}, + {'text': 'HAVE', 'timestamp': (1.42, 1.58)}, + {'text': 'A', 'timestamp': (1.66, 1.68)}, + {'text': 'DREAM', 'timestamp': (1.76, 2.14)}, + {'text': 'BUT', 'timestamp': (3.68, 3.8)}, + {'text': 'ONE', 'timestamp': (3.94, 4.06)}, + {'text': 'DAY', 'timestamp': (4.16, 4.3)}, + {'text': 'THIS', 'timestamp': (6.36, 6.54)}, + {'text': 'NATION', 'timestamp': (6.68, 7.1)}, + {'text': 'WILL', 'timestamp': (7.32, 7.56)}, + {'text': 'RISE', 'timestamp': (7.8, 8.26)}, + {'text': 'UP', 'timestamp': (8.38, 8.48)}, + {'text': 'AND', 'timestamp': (10.08, 10.18)}, + {'text': 'LIVE', 'timestamp': (10.26, 10.48)}, + {'text': 'OUT', 'timestamp': (10.58, 10.7)}, + {'text': 'THE', 'timestamp': (10.82, 10.9)}, + {'text': 'TRUE', 'timestamp': (10.98, 11.18)}, + {'text': 'MEANING', 'timestamp': (11.26, 11.58)}, + {'text': 'OF', 'timestamp': (11.66, 11.7)}, + {'text': 'ITS', 'timestamp': (11.76, 11.88)}, + {'text': 'CREED', 'timestamp': (12.0, 12.38)} +]} ``` -Pass the audio file to the [`pipeline`]: +As you can see, the model inferred the text and also outputted **when** the various words were pronounced +in the sentence. + +There are many parameters available for each task, so check out each task's API reference to see what you can tinker with! +For instance, the [`~transformers.AutomaticSpeechRecognitionPipeline`] has a `chunk_length_s` parameter which is helpful for working on really long audio files (for example, subtitling entire movies or hour-long videos) that a model typically cannot handle on its own. + + +If you can't find a parameter that would really help you out, feel free to [request it](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)! + + +## Using pipelines on a dataset + +The pipeline can also run inference on a large dataset. The easiest way we recommend doing this is by using an iterator: ```py ->>> preds = audio_classifier(audio_file) ->>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] ->>> preds -[{'score': 0.1315, 'label': 'calm'}, {'score': 0.1307, 'label': 'neutral'}, {'score': 0.1274, 'label': 'sad'}, {'score': 0.1261, 'label': 'fearful'}, {'score': 0.1242, 'label': 'happy'}] +def data(): + for i in range(1000): + yield f"My example {i}" + + +pipe = pipe(model="gpt2", device=0) +generated_characters = 0 +for out in pipe(data()): + generated_characters += len(out["generated_text"]) ``` +The iterator `data()` yields each result, and the pipeline automatically +recognizes the input is iterable and will start fetching the data while +it continues to process it on the GPU (this uses [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) under the hood). +This is important because you don't have to allocate memory for the whole dataset +and you can feed the GPU as fast as possible. + +Since batching could speed things up, it may be useful to try tuning the `batch_size` parameter here. + +The simplest way to iterate over a dataset is to just load one from 🤗 [Datasets](https://github.com/huggingface/datasets/): + +```py +# KeyDataset is a util that will just output the item we're interested in. +from transformers.pipelines.pt_utils import KeyDataset + +pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0) +dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]") + +for out in pipe(KeyDataset(dataset["audio"])): + print(out) +``` + + +## Using pipelines for a webserver + + +Creating an inference engine is a complex topic which deserves it's own +page. + + +[Link](./pipeline_webserver) + ## Vision pipeline Using a [`pipeline`] for vision tasks is practically identical. @@ -138,7 +227,7 @@ Specify your task and pass your image to the classifier. The image can be a link ```py >>> from transformers import pipeline ->>> vision_classifier = pipeline(task="image-classification") +>>> vision_classifier = pipeline(model="google/vit-base-patch16-224") >>> preds = vision_classifier( ... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" ... ) @@ -147,25 +236,38 @@ Specify your task and pass your image to the classifier. The image can be a link [{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}] ``` -## Multimodal pipeline +### Text pipeline -The [`pipeline`] supports more than one modality. For example, a visual question answering (VQA) task combines text and image. Feel free to use any image link you like and a question you want to ask about the image. The image can be a URL or a local path to the image. - -For example, if you use the same image from the vision pipeline above: - -```py ->>> image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" ->>> question = "Where is the cat?" -``` - -Create a pipeline for `vqa` and pass it the image and question: +Using a [`pipeline`] for NLP tasks is practically identical. ```py >>> from transformers import pipeline ->>> vqa = pipeline(task="vqa") ->>> preds = vqa(image=image, question=question) ->>> preds = [{"score": round(pred["score"], 4), "answer": pred["answer"]} for pred in preds] ->>> preds -[{'score': 0.911, 'answer': 'snow'}, {'score': 0.8786, 'answer': 'in snow'}, {'score': 0.6714, 'answer': 'outside'}, {'score': 0.0293, 'answer': 'on ground'}, {'score': 0.0272, 'answer': 'ground'}] +>>> # This model is a `zero-shot-classification` model. +>>> # It will classify text, except you are free to choose any label you might imagine +>>> classifier = pipeline(model="facebook/bart-large-mnli") +>>> classifier( +... "I have a problem with my iphone that needs to be resolved asap!!", +... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], +... ) +{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', + 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], + 'scores': [0.504,0.479,0.013,0.003,0.002]} +``` + +### Multimodal pipeline + +The [`pipeline`] supports more than one modality. For example, a visual question answering (VQA) task combines text and image. Feel free to use any image link you like and a question you want to ask about the image. The image can be a URL or a local path to the image. + +For example, if you use this [invoice image](https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png): + +```py +>>> from transformers import pipeline + +>>> vqa = pipeline(model="impira/layoutlm-document-qa") +>>> vqa( +... image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", +... question="What is the invoice number?", +... ) +[{'score': 0.635722279548645, 'answer': '1110212019', 'start': 22, 'end': 22}] ``` diff --git a/docs/source/en/pipeline_webserver.mdx b/docs/source/en/pipeline_webserver.mdx new file mode 100644 index 0000000000..d9f12fa2b3 --- /dev/null +++ b/docs/source/en/pipeline_webserver.mdx @@ -0,0 +1,161 @@ +# Using pipelines for a webserver + + +Creating an inference engine is a complex topic, and the "best" solution +will most likely depend on your problem space. Are you on CPU or GPU? Do +you want the lowest latency, the highest throughput, support for +many models, or just highly optimize 1 specific model? +There are many ways to tackle this topic, so what we are going to present is a good default +to get started which may not necessarily be the most optimal solution for you. + + + +The key thing to understand is that we can use an iterator, just like you would [on a +dataset](pipeline_tutorial#using-pipelines-on-a-dataset), since a webserver is basically a system that waits for requests and +treats them as they come in. + +Usually webservers are multiplexed (multithreaded, async, etc..) to handle various +requests concurrently. Pipelines on the other hand (and mostly the underlying models) +are not really great for parallelism; they take up a lot of RAM, so it's best to give them all the available resources when they are running or it's a compute-intensive job. + +We are going to solve that by having the webserver handle the light load of receiving +and sending requests, and having a single thread handling the actual work. +This example is going to use `starlette`. The actual framework is not really +important, but you might have to tune or change the code if you are using another +one to achieve the same effect. + +Create `server.py`: + +```py +from starlette.applications import Starlette +from starlette.responses import JSONResponse +from starlette.routing import Route +from transformers import pipeline +import asyncio + + +async def homepage(request): + payload = await request.body() + string = payload.decode("utf-8") + response_q = asyncio.Queue() + await request.app.model_queue.put((string, response_q)) + output = await response_q.get() + return JSONResponse(output) + + +async def server_loop(q): + pipe = pipeline(model="bert-base-uncased") + while True: + (string, response_q) = await q.get() + out = pipe(string) + await response_q.put(out) + + +app = Starlette( + routes=[ + Route("/", homepage, methods=["POST"]), + ], +) + + +@app.on_event("startup") +async def startup_event(): + q = asyncio.Queue() + app.model_queue = q + asyncio.create_task(server_loop(q)) +``` + +Now you can start it with: +```bash +uvicorn server:app +``` + +And you can query it: +```bash +curl -X POST -d "test [MASK]" http://localhost:8000/ +#[{"score":0.7742936015129089,"token":1012,"token_str":".","sequence":"test."},...] +``` + +And there you go, now you have a good idea of how to create a webserver! + +What is really important is that we load the model only **once**, so there are no copies +of the model on the webserver. This way, no unnecessary RAM is being used. +Then the queuing mechanism allows you to do fancy stuff like maybe accumulating a few +items before inferring to use dynamic batching: + +```py +(string, rq) = await q.get() +strings = [] +queues = [] +while True: + try: + (string, rq) = await asyncio.wait_for(q.get(), timeout=0.001) # 1ms + except asyncio.exceptions.TimeoutError: + break + strings.append(string) + queues.append(rq) +strings +outs = pipe(strings, batch_size=len(strings)) +for (rq, out) in zip(queues, outs): + await rq.put(out) +``` + + +Do not activate this without checking it makes sense for your load! + + +The proposed code is optimized for readability, not for being the best code. +First of all, there's no batch size limit which is usually not a +great idea. Next, the timeout is reset on every queue fetch, meaning you could +wait much more than 1ms before running the inference (delaying the first request +by that much). + +It would be better to have a single 1ms deadline. + +This will always wait for 1ms even if the queue is empty, which might not be the +best since you probably want to start doing inference if there's nothing in the queue. +But maybe it does make sense if batching is really crucial for your use case. +Again, there's really no one best solution. + + +## Few things you might want to consider + +### Error checking + +There's a lot that can go wrong in production: out of memory, out of space, +loading the model might fail, the query might be wrong, the query might be +correct but still fail to run because of a model misconfiguration, and so on. + +Generally, it's good if the server outputs the errors to the user, so +adding a lot of `try..except` statements to show those errors is a good +idea. But keep in mind it may also be a security risk to reveal all those errors depending +on your security context. + +### Circuit breaking + +Webservers usually look better when they do circuit breaking. It means they +return proper errors when they're overloaded instead of just waiting for the query indefinitely. Return a 503 error instead of waiting for a super long time or a 504 after a long time. + +This is relatively easy to implement in the proposed code since there is a single queue. +Looking at the queue size is a basic way to start returning errors before your +webserver fails under load. + +### Blocking the main thread + +Currently PyTorch is not async aware, and computation will block the main +thread while running. That means it would be better if PyTorch was forced to run +on its own thread/process. This wasn't done here because the code is a lot more +complex (mostly because threads and async and queues don't play nice together). +But ultimately it does the same thing. + +This would be important if the inference of single items were long (> 1s) because +in this case, it means every query during inference would have to wait for 1s before +even receiving an error. + +### Dynamic batching + +In general, batching is not necessarily an improvement over passing 1 item at +a time (see [batching details](./main_classes/pipelines#pipeline-batching) for more information). But it can be very effective +when used in the correct setting. In the API, there is no dynamic +batching by default (too much opportunity for a slowdown). But for BLOOM inference - +which is a very large model - dynamic batching is **essential** to provide a decent experience for everyone.