[Large PR] Entire rework of pipelines. (#13308)
* Enabling dataset iteration on pipelines. Enabling dataset iteration on pipelines. Unifying parameters under `set_parameters` function. Small fix. Last fixes after rebase Remove print. Fixing text2text `generate_kwargs` No more `self.max_length`. Fixing tf only conversational. Consistency in start/stop index over TF/PT. Speeding up drastically on TF (nasty bug where max_length would increase a ton.) Adding test for support for non fast tokenizers. Fixign GPU usage on zero-shot. Fix working on Tf. Update src/transformers/pipelines/base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Update src/transformers/pipelines/base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Small cleanup. Remove all asserts + simple format. * Fixing audio-classification for large PR. * Overly explicity null checking. * Encapsulating GPU/CPU pytorch manipulation directly within `base.py`. * Removed internal state for parameters of the pipeline. Instead of overriding implicitly internal state, we moved to real named arguments on every `preprocess`, `_forward`, `postprocess` function. Instead `_sanitize_parameters` will be used to split all kwargs of both __init__ and __call__ into the 3 kinds of named parameters. * Move import warnings. * Small fixes. * Quality. * Another small fix, using the CI to debug faster. * Last fixes. * Last fix. * Small cleanup of tensor moving. * is not None. * Adding a bunch of docs + a iteration test. * Fixing doc style. * KeyDataset = None guard. * RRemoving the Cuda test for pipelines (was testing). * Even more simple iteration test. * Correct import . * Long day. * Fixes in docs. * [WIP] migrating object detection. * Fixed the target_size bug. * Fixup. * Bad variable name. * Fixing `ensure_on_device` respects original ModelOutput.
This commit is contained in:
143
docs/source/add_new_pipeline.rst
Normal file
143
docs/source/add_new_pipeline.rst
Normal file
@@ -0,0 +1,143 @@
|
||||
..
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
|
||||
How to add a pipeline to 🤗 Transformers?
|
||||
=======================================================================================================================
|
||||
|
||||
First and foremost, you need to decide the raw entries the pipeline will be able to take. It can be strings, raw bytes,
|
||||
dictionnaries or whatever seems to be the most likely desired input. Try to keep these inputs as pure Python as
|
||||
possible as it makes compatibility easier (even through other languages via JSON). Those will be the :obj:`inputs` of
|
||||
the pipeline (:obj:`preprocess`).
|
||||
|
||||
Then define the :obj:`outputs`. Same policy as the :obj:`inputs`. The simpler, the better. Those will be the outputs of
|
||||
:obj:`postprocess` method.
|
||||
|
||||
Start by inheriting the base class :obj:`Pipeline`. with the 4 methods needed to implement :obj:`preprocess`,
|
||||
:obj:`_forward`, :obj:`postprocess` and :obj:`_sanitize_parameters`.
|
||||
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import Pipeline
|
||||
|
||||
class MyPipeline(Pipeline):
|
||||
def _sanitize_parameters(self, **kwargs)
|
||||
preprocess_kwargs = {}
|
||||
if "maybe_arg" in kwargs:
|
||||
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
|
||||
return preprocess_kwargs, {}, {}
|
||||
|
||||
def preprocess(self, inputs, maybe_arg=2)
|
||||
model_input = Tensor(....)
|
||||
return {"model_input": model_input}
|
||||
|
||||
def _forward(self, model_inputs)
|
||||
# model_inputs == {"model_input": model_input}
|
||||
oututs = self.model(**model_inputs)
|
||||
# Maybe {"logits": Tensor(...)}
|
||||
return outputs
|
||||
|
||||
def postprocess(self, model_outputs)
|
||||
best_class = model_outputs["logits"].softmax(-1)
|
||||
return best_class
|
||||
|
||||
|
||||
The structure of this breakdown is to support relatively seemless support for CPU/GPU, while supporting doing
|
||||
pre/postprocessing on the CPU on different threads
|
||||
|
||||
:obj:`preprocess` will take the original defined inputs, and turn them something feedable to the model. It might
|
||||
contain more information and is usally a :obj:`Dict`.
|
||||
|
||||
:obj:`_forward` is the implementation detail and is not meant to be called directly :obj:`forward` is the preferred
|
||||
called method as it contains safeguards to make sure everything is working on the expected device. If anything is
|
||||
linked to a real model it belongs in the :obj:`_forward` method, anything else is in the preprocess/postrocess.
|
||||
|
||||
:obj:`postprocess` methods will take the output of :obj:`_forward` and turn it into the final output that were decided
|
||||
earlier.
|
||||
|
||||
:obj:`_sanitize_parameters` exists to allow users to pass any parameters whenever they wish, be it at initialization
|
||||
time ``pipeline(...., maybe_arg=4)`` or at call time ``pipe = pipeline(...); output = pipe(...., maybe_arg=4)``.
|
||||
|
||||
The returns of :obj:`_sanitize_parameters` are the 3 dicts of kwargs that will be passed directly to :obj:`preprocess`,
|
||||
:obj:`_forward` and :obj:`postprocess`. Don't fill anything if the caller didn't call with any extra parameter. That
|
||||
allows to keep the default arguments in the function definition which is always more "natural".
|
||||
|
||||
A classic example would be a :obj:`top_k` argument in the post processing in classification tasks.
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> pipe = pipeline("my-new-task")
|
||||
>>> pipe("This is a test")
|
||||
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
|
||||
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
|
||||
|
||||
>>> pipe("This is a test", top_k=2)
|
||||
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]
|
||||
|
||||
In order to achieve that, we'll update our :obj:`postprocess` method with a default parameter to :obj:`5`. and edit
|
||||
:obj:`_sanitize_parameters` to allow this new parameter.
|
||||
|
||||
|
||||
.. code-block::
|
||||
|
||||
|
||||
def postprocess(self, model_outputs, top_k=5)
|
||||
best_class = model_outputs["logits"].softmax(-1)
|
||||
# Add logic to handle top_k
|
||||
return best_class
|
||||
|
||||
def _sanitize_parameters(self, **kwargs)
|
||||
preprocess_kwargs = {}
|
||||
if "maybe_arg" in kwargs:
|
||||
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
|
||||
|
||||
postprocess_kwargs = {}
|
||||
if "top_k" in kwargs:
|
||||
preprocess_kwargs["top_k"] = kwargs["top_k"]
|
||||
return preprocess_kwargs, {}, postprocess_kwargs
|
||||
|
||||
Try to keep the inputs/outputs very simple and ideally JSON-serializable as it makes the pipeline usage very easy
|
||||
without requiring users to understand new kind of objects. It's also relatively common to support many different types
|
||||
of arguments for ease of use (audio files, can be filenames, URLs or pure bytes)
|
||||
|
||||
|
||||
|
||||
Adding it to the list of supported tasks
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Go to ``src/transformers/pipelines/__init__.py`` and fill in :obj:`SUPPORTED_TASKS` with your newly created pipeline.
|
||||
If possible it should provide a default model.
|
||||
|
||||
Adding tests
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Create a new file ``tests/test_pipelines_MY_PIPELINE.py`` with example with the other tests.
|
||||
|
||||
The :obj:`run_pipeline_test` function will be very generic and run on small random models on every possible
|
||||
architecture as defined by :obj:`model_mapping` and :obj:`tf_model_mapping`.
|
||||
|
||||
This is very important to test future compatibilty, meaning if someone adds a new model for
|
||||
:obj:`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's
|
||||
impossible to check for actual values, that's why There is a helper :obj:`ANY` that will simply attempt to match the
|
||||
output of the pipeline TYPE.
|
||||
|
||||
You also *need* to implement 2 (ideally 4) tests.
|
||||
|
||||
- :obj:`test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
|
||||
and test the pipeline outputs. The results should be the same as :obj:`test_small_model_tf`.
|
||||
- :obj:`test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
|
||||
and test the pipeline outputs. The results should be the same as :obj:`test_small_model_pt`.
|
||||
- :obj:`test_large_model_pt` (:obj:`optional`): Tests the pipeline on a real pipeline where the results are supposed to
|
||||
make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
|
||||
sure there is no drift in future releases
|
||||
- :obj:`test_large_model_tf` (:obj:`optional`): Tests the pipeline on a real pipeline where the results are supposed to
|
||||
make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
|
||||
sure there is no drift in future releases
|
||||
@@ -506,6 +506,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
migration
|
||||
contributing
|
||||
add_new_model
|
||||
add_new_pipeline
|
||||
fast_tokenizers
|
||||
performance
|
||||
parallelism
|
||||
|
||||
@@ -46,12 +46,53 @@ The pipeline abstraction
|
||||
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
|
||||
pipeline but requires an additional argument which is the `task`.
|
||||
|
||||
Simple call on one item:
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> pipe = pipeline("text-classification")
|
||||
>>> pipe("This restaurant is awesome")
|
||||
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
|
||||
|
||||
To call a pipeline on many items, you can either call with a `list`.
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> pipe = pipeline("text-classification")
|
||||
>>> pipe(["This restaurant is awesome", "This restaurant is aweful"])
|
||||
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
|
||||
{'label': 'NEGATIVE', 'score': 0.9996669292449951}]
|
||||
|
||||
|
||||
To iterate of full datasets it is recommended to use a :obj:`dataset` directly. This means you don't need to allocate
|
||||
the whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on
|
||||
GPU. If it doesn't don't hesitate to create an issue.
|
||||
|
||||
.. code-block::
|
||||
|
||||
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
|
||||
dataset = datasets.load_dataset("superb", name="asr", split="test")
|
||||
|
||||
# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
|
||||
# as we're not interested in the `target` part of the dataset.
|
||||
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
|
||||
print(out)
|
||||
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
|
||||
# {"text": ....}
|
||||
# ....
|
||||
|
||||
|
||||
.. autofunction:: transformers.pipeline
|
||||
|
||||
Implementing a pipeline
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
:doc:`Implementing a new pipeline <../add_new_pipeline>`
|
||||
|
||||
The task specific pipelines
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
AudioClassificationPipeline
|
||||
=======================================================================================================================
|
||||
|
||||
|
||||
@@ -67,8 +67,8 @@ make them readable. For instance:
|
||||
>>> classifier('We are very happy to show you the 🤗 Transformers library.')
|
||||
[{'label': 'POSITIVE', 'score': 0.9998}]
|
||||
|
||||
That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a
|
||||
`batch`, returning a list of dictionaries like this one:
|
||||
That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning
|
||||
a list of dictionaries like this one:
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -79,6 +79,8 @@ That's encouraging! You can use it on a list of sentences, which will be preproc
|
||||
label: POSITIVE, with score: 0.9998
|
||||
label: NEGATIVE, with score: 0.5309
|
||||
|
||||
To use with a large dataset, look at :doc:`iterating over a pipeline <./main_classes/pipelines>`
|
||||
|
||||
You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
|
||||
fairly neutral.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user