AutoImageProcessor (#20111)

* AutoImageProcessor skeleton * Update references * Add mapping in init * Add model image processors to __init__ for importing * Add AutoImageProcessor tests * Fix up * Image Processor documentation * Remove pdb * Update docs/source/en/model_doc/mobilevit.mdx * Update docs * Don't add whitespace on json files * Remove fixtures * Move checking model config down * Fix up * Add check for image processor * Remove FeatureExtractorMixin in docstrings * Rename model_tmpfile to config_tmpfile * Don't make None if not in image processor map
2022-11-08 19:54:41 +00:00
parent c08a1e26ab
commit 4eb918e656
51 changed files with 1371 additions and 123 deletions
--- a/docs/source/en/model_doc/flava.mdx
+++ b/docs/source/en/model_doc/flava.mdx
@@ -16,17 +16,17 @@ specific language governing permissions and limitations under the License.

 The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

-The paper aims at creating a single unified foundation model which can work across vision, language 
+The paper aims at creating a single unified foundation model which can work across vision, language
 as well as vision-and-language multimodal tasks.

 The abstract from the paper is the following:

-*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety 
-of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal 
-(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising 
-direction would be to use a single holistic universal model, as a "foundation", that targets all modalities 
-at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and 
-cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate 
+*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
+of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
+(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
+direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
+at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
+cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
 impressive performance on a wide range of 35 tasks spanning these target modalities.*


@@ -61,6 +61,11 @@ This model was contributed by [aps](https://huggingface.co/aps). The original co

 [[autodoc]] FlavaFeatureExtractor

+## FlavaImageProcessor
+
+[[autodoc]] FlavaImageProcessor
+    - preprocess
+
 ## FlavaForPreTraining

 [[autodoc]] FlavaForPreTraining