AutoImageProcessor (#20111)

* AutoImageProcessor skeleton

* Update references

* Add mapping in init

* Add model image processors to __init__ for importing

* Add AutoImageProcessor tests

* Fix up

* Image Processor documentation

* Remove pdb

* Update docs/source/en/model_doc/mobilevit.mdx

* Update docs

* Don't add whitespace on json files

* Remove fixtures

* Move checking model config down

* Fix up

* Add check for image processor

* Remove FeatureExtractorMixin in docstrings

* Rename model_tmpfile to config_tmpfile

* Don't make None if not in image processor map
This commit is contained in:
amyeroberts
2022-11-08 19:54:41 +00:00
committed by GitHub
parent c08a1e26ab
commit 4eb918e656
51 changed files with 1371 additions and 123 deletions

View File

@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
## Overview
The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
The abstract from the paper is the following:
@@ -25,10 +25,10 @@ Tips:
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
- One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
TensorFlow Lite model:
```py
@@ -52,7 +52,7 @@ with open(tflite_filename, "wb") as f:
```
The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network
bandwidth can be constrained.
bandwidth can be constrained.
This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
@@ -68,6 +68,12 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
- __call__
- post_process_semantic_segmentation
## MobileViTImageProcessor
[[autodoc]] MobileViTImageProcessor
- preprocess
- post_process_semantic_segmentation
## MobileViTModel
[[autodoc]] MobileViTModel
@@ -86,14 +92,14 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
## TFMobileViTModel
[[autodoc]] TFMobileViTModel
- call
- call
## TFMobileViTForImageClassification
[[autodoc]] TFMobileViTForImageClassification
- call
- call
## TFMobileViTForSemanticSegmentation
[[autodoc]] TFMobileViTForSemanticSegmentation
- call
- call