VisionTextDualEncoder (#13511)
* init vision_text_dual_encoder * fix merge * remove extra heads * fix tests * remove VISION_TEXT_DUAL_ENCODER_PRETRAINED_CONFIG_ARCHIVE_MAP * remove archive map * fix imports * fix more imports * fix init * delete tokenizers * fix imports * clean * support clip's vision model * handle None config * begin tests * more test and few fixes * warn about newly init weights * more tests * add loss to model * remove extra classes from doc * add processor * doc and small fixes * add start docstr * update flax model * flax tests * more flax tests * doc * quality * doc and quality * fix doc * doc * remove comments * update warning * quality * fix docs * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * replace asserts, fix imports * update imports * fix import * address some review comments * fix check * reduce tolerance * fix test * add flax integration test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * address Sylvain's comments * fix style * add pt_flax_equivalence test in PT tests * add pt integration test * update test * use pre-trained checkpoint in examples Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
56
docs/source/model_doc/vision_text_dual_encoder.rst
Normal file
56
docs/source/model_doc/vision_text_dual_encoder.rst
Normal file
@@ -0,0 +1,56 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
VisionTextDualEncoder
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The :class:`~transformers.VisionTextDualEncoderModel` can be used to initialize a vision-text dual encoder model with
|
||||
any pretrained vision autoencoding model as the vision encoder (*e.g.* :doc:`ViT <vit>`, :doc:`BEiT <beit>`, :doc:`DeiT
|
||||
<deit>`) and any pretrained text autoencoding model as the text encoder (*e.g.* :doc:`RoBERTa <roberta>`, :doc:`BERT
|
||||
<bert>`). Two projection layers are added on top of both the vision and text encoder to project the output embeddings
|
||||
to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a
|
||||
downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
|
||||
training and then can be used for zero-shot vision tasks such image-classification or retrieval.
|
||||
|
||||
In `LiT: Zero-Shot Transfer with Locked-image Text Tuning <https://arxiv.org/abs/2111.07991>`__ it is shown how
|
||||
leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvment on
|
||||
new zero-shot vision tasks such as image classification or retrieval.
|
||||
|
||||
VisionTextDualEncoderConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisionTextDualEncoderConfig
|
||||
:members:
|
||||
|
||||
|
||||
VisionTextDualEncoderProcessor
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisionTextDualEncoderProcessor
|
||||
:members:
|
||||
|
||||
|
||||
VisionTextDualEncoderModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisionTextDualEncoderModel
|
||||
:members: forward
|
||||
|
||||
|
||||
FlaxVisionTextDualEncoderModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.FlaxVisionTextDualEncoderModel
|
||||
:members: __call__
|
||||
Reference in New Issue
Block a user