AutoImageProcessor (#20111)
* AutoImageProcessor skeleton * Update references * Add mapping in init * Add model image processors to __init__ for importing * Add AutoImageProcessor tests * Fix up * Image Processor documentation * Remove pdb * Update docs/source/en/model_doc/mobilevit.mdx * Update docs * Don't add whitespace on json files * Remove fixtures * Move checking model config down * Fix up * Add check for image processor * Remove FeatureExtractorMixin in docstrings * Rename model_tmpfile to config_tmpfile * Don't make None if not in image processor map
This commit is contained in:
@@ -16,17 +16,17 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.
|
||||
|
||||
The paper aims at creating a single unified foundation model which can work across vision, language
|
||||
The paper aims at creating a single unified foundation model which can work across vision, language
|
||||
as well as vision-and-language multimodal tasks.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
|
||||
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
|
||||
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
|
||||
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
|
||||
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
|
||||
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
|
||||
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
|
||||
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
|
||||
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
|
||||
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
|
||||
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
|
||||
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
|
||||
impressive performance on a wide range of 35 tasks spanning these target modalities.*
|
||||
|
||||
|
||||
@@ -61,6 +61,11 @@ This model was contributed by [aps](https://huggingface.co/aps). The original co
|
||||
|
||||
[[autodoc]] FlavaFeatureExtractor
|
||||
|
||||
## FlavaImageProcessor
|
||||
|
||||
[[autodoc]] FlavaImageProcessor
|
||||
- preprocess
|
||||
|
||||
## FlavaForPreTraining
|
||||
|
||||
[[autodoc]] FlavaForPreTraining
|
||||
|
||||
Reference in New Issue
Block a user