Make Swin work with VisionEncoderDecoderModel (#15527)
* Add attribute_map * Add mention in docs * Set hidden_size attribute correctly * Add note about Transformer-based models only Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain>
This commit is contained in:
@@ -13,8 +13,8 @@ specific language governing permissions and limitations under the License.
|
|||||||
# Vision Encoder Decoder Models
|
# Vision Encoder Decoder Models
|
||||||
|
|
||||||
The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any
|
The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any
|
||||||
pretrained vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit))
|
pretrained Transformer-based vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit), [Swin](swin))
|
||||||
and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert)).
|
and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert), [DistilBERT](distilbert)).
|
||||||
|
|
||||||
The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
|
The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
|
||||||
example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
|
example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
|
||||||
|
|||||||
@@ -90,6 +90,10 @@ class SwinConfig(PretrainedConfig):
|
|||||||
```"""
|
```"""
|
||||||
model_type = "swin"
|
model_type = "swin"
|
||||||
|
|
||||||
|
attribute_map = {
|
||||||
|
"num_attention_heads": "num_heads",
|
||||||
|
}
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
image_size=224,
|
image_size=224,
|
||||||
@@ -130,3 +134,6 @@ class SwinConfig(PretrainedConfig):
|
|||||||
self.path_norm = patch_norm
|
self.path_norm = patch_norm
|
||||||
self.layer_norm_eps = layer_norm_eps
|
self.layer_norm_eps = layer_norm_eps
|
||||||
self.initializer_range = initializer_range
|
self.initializer_range = initializer_range
|
||||||
|
# we set the hidden_size attribute in order to make Swin work with VisionEncoderDecoderModel
|
||||||
|
# this indicates the channel dimension after the last stage of the model
|
||||||
|
self.hidden_size = embed_dim * 8
|
||||||
|
|||||||
Reference in New Issue
Block a user