PerceptionLM (#37878)

* plm template

* A working plm with fixed image features

* hacked processor

* First version that reproduced PLM output using PE from timm.

* Simplify and fix tie_word_embeddings

* Use PIL resize. Simplify converstion.

* First version that works with video input.

* simplifed image preprocessing (not batched)

* Minor fixes after rebasing on main.

* Video processor based on new API.

* Revert to use _preprocess for image processor.

* refactor with modular

* fix tie_word_embedding

* Testing with timm PE

* check in missed converstion from modular to model.py

* First working version of PLM with Eva PE. PLM-1B and 3B outputs are exactly the same as before. PLM-8B output has some differences.

* address review comments

* Fixed batching if video and image examples mixed.

* Simplify PE configuration.

* Enable AutoModel for PerceptionEncoder.

* Update PE config style.

* update all headers

* Minor fixes.

* Move lm_head to PerceptionLMForConditionalGeneration.
Fix vit_G model specification.

* Fix for testing_modeling_perception_lm.py

* Image processing refactoring to use more common parts.

* Fix processor test.

* update tests to use model from hub

* More test fixes.

* integration test GT update after rebasing; probably due to video preprocessing

* update test media path to hub

* Stop tracking local scripts

* address some review comments

* refactor image processing.

* small fixes

* update documentation and minor fixes

* remove scripts

* Minor fix for CI

* Fix image processing

* CI and doc fix

* CI formatting fix

* ruff fix

* ruff formatting

* ran utils/sort_auto_mappings.py

* update docstring

* more docstring udpates

* add vision_input_type default fallback for image processing

* more verbose variable naming

* test update

* Remove PE and PEConfig use AutoModel(TimmWrapper) instead

* Minor cleanup.

* Minor Fix: remove any ref to PE. Ruff format and check.

* fix docstring

* Fix modular/model consistency.Improvex docstringfor  .

* Fix PerceptionLMForConditionalGenerationModelTest

* ruff fix

* fix for check_repo

* minor formatting

* dummy size arg to fix for processor test.

* Update docstring for PerceptionLMConfig

* Minor fixes from review feedback.

* Revert some minor changes per reviewer feedback.

* update base_model_prefix

* address reviewer feedback

* fix comment in modeling file

* address reviewer feedback

* ruff format

* Pre-merge test update.

* reapply modular and fix checkpoint name

* processor test path

* use modular a bit more

* remove dead code

* add token decorator

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
This commit is contained in:
Shuming Hu
2025-07-11 02:07:32 -07:00
committed by GitHub
parent 4b47b2b8ea
commit bf607f6d3b
20 changed files with 3262 additions and 1 deletions

View File

@@ -1039,6 +1039,8 @@
title: PaliGemma
- local: model_doc/perceiver
title: Perceiver
- local: model_doc/perception_lm
title: PerceptionLM
- local: model_doc/phi4_multimodal
title: Phi4 Multimodal
- local: model_doc/pix2struct

View File

@@ -0,0 +1,68 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# PerceptionLM
## Overview
The PerceptionLM model was proposed in [PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding](https://ai.meta.com/research/publications/perceptionlm-open-access-data-and-models-for-detailed-visual-understanding/) by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of
a vision encoder with a small scale (<8B parameters) LLM decoder.
The abstract from the paper is the following:
*Vision-language models are integral to computer vision research, yet many high-performing models
remain closed-source, obscuring their data, design and training recipe. The research community
has responded by using distillation from black-box models to label training data, achieving strong
benchmark results, at the cost of measurable scientific progress. However, without knowing the details
of the teacher model and its data sources, scientific progress remains difficult to measure. In this
paper, we study building a Perception Language Model (PLM) in a fully open and reproducible
framework for transparent research in image and video understanding. We analyze standard training
pipelines without distillation from proprietary models and explore large-scale synthetic data to identify
critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M
human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded
video captions. Additionally, we introduce PLMVideoBench, a suite for evaluating challenging video
understanding tasks focusing on the ability to reason about “what”, “where”, “when”, and “how” of a
video. We make our work fully reproducible by providing data, training recipes, code & models.*
This model was contributed by [shumingh](https://huggingface.co/shumingh).
The original code can be found [here](https://github.com/facebookresearch/perception_models).
## PerceptionLMConfig
[[autodoc]] PerceptionLMConfig
## PerceptionLMProcessor
[[autodoc]] PerceptionLMProcessor
## PerceptionLMImageProcessorFast
[[autodoc]] PerceptionLMImageProcessorFast
## PerceptionLMVideoProcessor
[[autodoc]] PerceptionLMVideoProcessor
## PerceptionLMModel
[[autodoc]] PerceptionLMModel
## PerceptionLMForConditionalGeneration
[[autodoc]] PerceptionLMForConditionalGeneration
- forward