Skip to content

Commit 6515c25

Browse files
haotonglxenovaqubvel
authored
Add Prompt Depth Anything Model (#35401)
* add prompt depth anything model by modular transformer * add prompt depth anything docs and imports * update code style according transformers doc * update code style: import order issue is fixed by custom_init_isort * fix depth shape from B,1,H,W to B,H,W which is as the same as Depth Anything * move prompt depth anything to vision models in _toctree.yml * update backbone test; there is no need for resnet18 backbone test * update init file & pass RUN_SLOW tests * update len(prompt_depth) to prompt_depth.shape[0] Co-authored-by: Joshua Lochner <admin@xenova.com> * fix torch_int/model_doc * fix typo * update PromptDepthAnythingImageProcessor * fix typo * fix typo for prompt depth anything doc * update promptda overview image link of huggingface repo * fix some typos in promptda doc * Update image processing to include pad_image, prompt depth position, and related explanations for better clarity and functionality. * add copy disclaimer for prompt depth anything image processing * fix some format typos in image processing and conversion scripts * fix nn.ReLU(False) to nn.ReLU() * rename residual layer as it's a sequential layer * move size compute to a separate line/variable for easier debug in modular prompt depth anything * fix modular format for prompt depth anything * update modular prompt depth anything * fix scale to meter and some internal funcs warp * fix code style in image_processing_prompt_depth_anything.py * fix issues in image_processing_prompt_depth_anything.py * fix issues in image_processing_prompt_depth_anything.py * fix issues in prompt depth anything * update converting script similar to mllamma * update testing for modeling prompt depth anything * update testing for image_processing_prompt_depth_anything * fix assertion in image_processing_prompt_depth_anything * Update src/transformers/models/prompt_depth_anything/modular_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update src/transformers/models/prompt_depth_anything/modular_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update src/transformers/models/prompt_depth_anything/image_processing_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update src/transformers/models/prompt_depth_anything/image_processing_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update src/transformers/models/prompt_depth_anything/image_processing_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update docs/source/en/model_doc/prompt_depth_anything.md Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update docs/source/en/model_doc/prompt_depth_anything.md Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * update some testing * fix testing * fix * add return doc for forward of prompt depth anything * Update src/transformers/models/prompt_depth_anything/modular_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Update tests/models/prompt_depth_anything/test_modeling_prompt_depth_anything.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * fix prompt depth order * fix format for testing prompt depth anything * fix minor issues in prompt depth anything doc * fix format for modular prompt depth anything * revert format for modular prompt depth anything * revert format for modular prompt depth anything * update format for modular prompt depth anything * fix parallel testing errors * fix doc for prompt depth anything * Add header * Fix imports * Licence header --------- Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
1 parent 6629177 commit 6515c25

18 files changed

+2537
-0
lines changed

docs/source/en/_toctree.yml

+2
Original file line numberDiff line numberDiff line change
@@ -735,6 +735,8 @@
735735
title: NAT
736736
- local: model_doc/poolformer
737737
title: PoolFormer
738+
- local: model_doc/prompt_depth_anything
739+
title: Prompt Depth Anything
738740
- local: model_doc/pvt
739741
title: Pyramid Vision Transformer (PVT)
740742
- local: model_doc/pvt_v2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Prompt Depth Anything
18+
19+
## Overview
20+
21+
The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://arxiv.org/abs/2412.14015) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang.
22+
23+
24+
The abstract from the paper is as follows:
25+
26+
*Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.*
27+
28+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/prompt_depth_anything_architecture.jpg"
29+
alt="drawing" width="600"/>
30+
31+
<small> Prompt Depth Anything overview. Taken from the <a href="https://arxiv.org/pdf/2412.14015">original paper</a>.</small>
32+
33+
## Usage example
34+
35+
The Transformers library allows you to use the model with just a few lines of code:
36+
37+
```python
38+
>>> import torch
39+
>>> import requests
40+
>>> import numpy as np
41+
42+
>>> from PIL import Image
43+
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
44+
45+
>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
46+
>>> image = Image.open(requests.get(url, stream=True).raw)
47+
48+
>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
49+
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
50+
51+
>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
52+
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
53+
>>> # the prompt depth can be None, and the model will output a monocular relative depth.
54+
55+
>>> # prepare image for the model
56+
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)
57+
58+
>>> with torch.no_grad():
59+
... outputs = model(**inputs)
60+
61+
>>> # interpolate to original size
62+
>>> post_processed_output = image_processor.post_process_depth_estimation(
63+
... outputs,
64+
... target_sizes=[(image.height, image.width)],
65+
... )
66+
67+
>>> # visualize the prediction
68+
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
69+
>>> depth = predicted_depth * 1000
70+
>>> depth = depth.detach().cpu().numpy()
71+
>>> depth = Image.fromarray(depth.astype("uint16")) # mm
72+
```
73+
74+
## Resources
75+
76+
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Prompt Depth Anything.
77+
78+
- [Prompt Depth Anything Demo](https://huggingface.co/spaces/depth-anything/PromptDA)
79+
- [Prompt Depth Anything Interactive Results](https://promptda.github.io/interactive.html)
80+
81+
If you are interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
82+
83+
## PromptDepthAnythingConfig
84+
85+
[[autodoc]] PromptDepthAnythingConfig
86+
87+
## PromptDepthAnythingForDepthEstimation
88+
89+
[[autodoc]] PromptDepthAnythingForDepthEstimation
90+
- forward
91+
92+
## PromptDepthAnythingImageProcessor
93+
94+
[[autodoc]] PromptDepthAnythingImageProcessor
95+
- preprocess
96+
- post_process_depth_estimation

src/transformers/__init__.py

+14
Original file line numberDiff line numberDiff line change
@@ -711,6 +711,7 @@
711711
"models.plbart": ["PLBartConfig"],
712712
"models.poolformer": ["PoolFormerConfig"],
713713
"models.pop2piano": ["Pop2PianoConfig"],
714+
"models.prompt_depth_anything": ["PromptDepthAnythingConfig"],
714715
"models.prophetnet": [
715716
"ProphetNetConfig",
716717
"ProphetNetTokenizer",
@@ -1299,6 +1300,7 @@
12991300
_import_structure["models.pix2struct"].extend(["Pix2StructImageProcessor"])
13001301
_import_structure["models.pixtral"].append("PixtralImageProcessor")
13011302
_import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
1303+
_import_structure["models.prompt_depth_anything"].extend(["PromptDepthAnythingImageProcessor"])
13021304
_import_structure["models.pvt"].extend(["PvtImageProcessor"])
13031305
_import_structure["models.qwen2_vl"].extend(["Qwen2VLImageProcessor"])
13041306
_import_structure["models.rt_detr"].extend(["RTDetrImageProcessor"])
@@ -3335,6 +3337,12 @@
33353337
"Pop2PianoPreTrainedModel",
33363338
]
33373339
)
3340+
_import_structure["models.prompt_depth_anything"].extend(
3341+
[
3342+
"PromptDepthAnythingForDepthEstimation",
3343+
"PromptDepthAnythingPreTrainedModel",
3344+
]
3345+
)
33383346
_import_structure["models.prophetnet"].extend(
33393347
[
33403348
"ProphetNetDecoder",
@@ -5921,6 +5929,7 @@
59215929
from .models.pop2piano import (
59225930
Pop2PianoConfig,
59235931
)
5932+
from .models.prompt_depth_anything import PromptDepthAnythingConfig
59245933
from .models.prophetnet import (
59255934
ProphetNetConfig,
59265935
ProphetNetTokenizer,
@@ -6530,6 +6539,7 @@
65306539
PoolFormerFeatureExtractor,
65316540
PoolFormerImageProcessor,
65326541
)
6542+
from .models.prompt_depth_anything import PromptDepthAnythingImageProcessor
65336543
from .models.pvt import PvtImageProcessor
65346544
from .models.qwen2_vl import Qwen2VLImageProcessor
65356545
from .models.rt_detr import RTDetrImageProcessor
@@ -8166,6 +8176,10 @@
81668176
Pop2PianoForConditionalGeneration,
81678177
Pop2PianoPreTrainedModel,
81688178
)
8179+
from .models.prompt_depth_anything import (
8180+
PromptDepthAnythingForDepthEstimation,
8181+
PromptDepthAnythingPreTrainedModel,
8182+
)
81698183
from .models.prophetnet import (
81708184
ProphetNetDecoder,
81718185
ProphetNetEncoder,

src/transformers/models/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,7 @@
219219
plbart,
220220
poolformer,
221221
pop2piano,
222+
prompt_depth_anything,
222223
prophetnet,
223224
pvt,
224225
pvt_v2,

src/transformers/models/auto/configuration_auto.py

+2
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,7 @@
241241
("plbart", "PLBartConfig"),
242242
("poolformer", "PoolFormerConfig"),
243243
("pop2piano", "Pop2PianoConfig"),
244+
("prompt_depth_anything", "PromptDepthAnythingConfig"),
244245
("prophetnet", "ProphetNetConfig"),
245246
("pvt", "PvtConfig"),
246247
("pvt_v2", "PvtV2Config"),
@@ -593,6 +594,7 @@
593594
("plbart", "PLBart"),
594595
("poolformer", "PoolFormer"),
595596
("pop2piano", "Pop2Piano"),
597+
("prompt_depth_anything", "PromptDepthAnything"),
596598
("prophetnet", "ProphetNet"),
597599
("pvt", "PVT"),
598600
("pvt_v2", "PVTv2"),

src/transformers/models/auto/image_processing_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@
127127
("pix2struct", ("Pix2StructImageProcessor",)),
128128
("pixtral", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
129129
("poolformer", ("PoolFormerImageProcessor",)),
130+
("prompt_depth_anything", ("PromptDepthAnythingImageProcessor",)),
130131
("pvt", ("PvtImageProcessor",)),
131132
("pvt_v2", ("PvtImageProcessor",)),
132133
("qwen2_5_vl", ("Qwen2VLImageProcessor", "Qwen2VLImageProcessorFast")),

src/transformers/models/auto/modeling_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -942,6 +942,7 @@
942942
("depth_pro", "DepthProForDepthEstimation"),
943943
("dpt", "DPTForDepthEstimation"),
944944
("glpn", "GLPNForDepthEstimation"),
945+
("prompt_depth_anything", "PromptDepthAnythingForDepthEstimation"),
945946
("zoedepth", "ZoeDepthForDepthEstimation"),
946947
]
947948
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from typing import TYPE_CHECKING
15+
16+
from ...utils import _LazyModule
17+
from ...utils.import_utils import define_import_structure
18+
19+
20+
if TYPE_CHECKING:
21+
from .configuration_prompt_depth_anything import PromptDepthAnythingConfig
22+
from .image_processing_prompt_depth_anything import PromptDepthAnythingImageProcessor
23+
from .modeling_prompt_depth_anything import (
24+
PromptDepthAnythingForDepthEstimation,
25+
PromptDepthAnythingPreTrainedModel,
26+
)
27+
else:
28+
import sys
29+
30+
_file = globals()["__file__"]
31+
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

0 commit comments

Comments
 (0)