[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

schrodingho · 2025-01-15T08:39:05Z

OpenVINO Version

Master Branch

Operating System

Windows System

Device used for inference

dGPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

https://github.com/autonomousvision/unimatch

Model quantization

No

Target Platform

OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0

Performance issue description

I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.

To reduce latency, I replaced the PyTorch function F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True) with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch for the original model and opt_unimatch for the modified one).

ori_unimatch:

benchmark_app -m ori_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 75.63 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 3059.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 460.74 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to 
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            131 iterations
[ INFO ] Duration:         60207.90 ms
[ INFO ] Latency:
[ INFO ]    Median:        458.77 ms
[ INFO ]    Average:       458.70 ms
[ INFO ]    Min:           452.05 ms
[ INFO ]    Max:           465.72 ms
[ INFO ] Throughput:   4.35 FPS

opt_unimatch:

benchmark_app -m opt_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 80.84 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 8530.97 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 242.54 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            278 iterations
[ INFO ] Duration:         60109.22 ms
[ INFO ] Latency:
[ INFO ]    Median:        215.37 ms
[ INFO ]    Average:       215.41 ms
[ INFO ]    Min:           205.85 ms
[ INFO ]    Max:           229.31 ms
[ INFO ] Throughput:   9.25 FPS

Step-by-step reproduction

Clone the Unimatch.
Download the pretrained model GMFlow-scale2-regrefine6-mixdata from the Model_Zoo and save it the pretrained folder.
Follow the script gmflow_demo.sh in Scripts to run the model:

python main_flow.py \
--inference_dir demo/flow-davis \
--resume pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth \
--output_path output/gmflow-scale2-regrefine6-davis \
--padding_factor 16 \
--upsample_factor 4 \
--num_scales 2 \
--attn_splits_list 2 8 \
--corr_radius_list -1 4 \
--prop_radius_list -1 1 \
--reg_refine \
--num_reg_refine 2

Add OpenVINO converting code in it and compile the model.

from pathlib import Path
import openvino as ov
ov_opt_device = "cpu"
model_without_ddp = model_without_ddp.to(ov_opt_device)

FIG_H = 320
FIG_W = 576

dummy_input1 = torch.randn(2, 3, FIG_H, FIG_W)
dummy_input2 = torch.randn(2, 3, FIG_H, FIG_W)

example_inputs = (
    dummy_input1,
    dummy_input2,
)
inputs = {
    "img0": dummy_input1,
    "img1": dummy_input2,
}
input_info = [(name, list(inp.shape)) for name, inp in inputs.items()]
UNIMATCH_OV_PATH = Path(f"opt_unimatch.xml")
model_without_ddp.eval()

with torch.no_grad():
    ov_model = ov.convert_model(model_without_ddp, input=input_info, example_input=example_inputs)
    ov.save_model(ov_model, UNIMATCH_OV_PATH, compress_to_fp16=True)

Use benchmark_app to profile it.

benchmark_app -m %converted_model%.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

Change the F.gridsample in /unimatch/matching.py to this implementation, and redo the step 4 and 5.

Issue submission checklist

I'm reporting a performance issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

The text was updated successfully, but these errors were encountered:

dnkurek · 2025-01-15T19:30:19Z

Hi, do you also have the same issue with the iGPU or CPU in your system?

Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead

schrodingho · 2025-01-16T02:36:55Z

Hi, I just ran benchmarks on the iGPU (UHD 770) and the CPU (i9-13900K). The iGPU has the same issue (the grid_sample_ref is slow):

ori_unimatch

opt_unimatch

For CPU, it seems to have no such issue (original is better):

ori_unimatch

opt_unimatch

dnkurek · 2025-01-16T04:50:23Z

Yeah so it looks like grid_sample_ref needs to be optimized and make a grid_sample_opt version perhaps...

mlukasze · 2025-01-24T07:26:00Z

ref ticket: CVS-161002

hey @schrodingho
we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing.
Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

schrodingho · 2025-01-27T02:59:24Z

ref ticket: CVS-161002

hey @schrodingho we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing. Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

Thank you for looking into this. You can refer to my forked repo, which includes a script named ov_convert.sh to directly convert the model.

pkowalc1 · 2025-02-05T08:15:24Z

hi @schrodingho,
I have optimized GridSample a little bit - currently it is ~21-36x faster on a770 and on your model according to benchmark_app(latency is 3-5ms). It is still WIP, but, if you want to, you can try it now, by compiling ov from this branch:
gpu_grid_sample_opt [commit id: 3752b8a]

Would be great if you could confirm that it helps in your case/env/etc. The code should be correct at this point, also it should be as stable numerically as ref version, so you shouldn't see any difference in output other than getting it faster.

I will try to optimize it further on this branch, so it may take a while before it is merged to master.

pkowalc1 · 2025-02-07T14:03:38Z

hi @schrodingho,
Did you manage to compile and run ov from specified branch? If you need any assistance - don't hesitate to ask!

schrodingho added performance Performance related topics support_request labels Jan 15, 2025

mlukasze assigned pkowalc1 Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

schrodingho commented Jan 15, 2025 •

edited

Loading

dnkurek commented Jan 15, 2025

schrodingho commented Jan 16, 2025

dnkurek commented Jan 16, 2025

mlukasze commented Jan 24, 2025 •

edited

Loading

schrodingho commented Jan 27, 2025

pkowalc1 commented Feb 5, 2025 •

edited

Loading

pkowalc1 commented Feb 7, 2025

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

Comments

schrodingho commented Jan 15, 2025 • edited Loading

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Target Platform

Performance issue description

Step-by-step reproduction

Issue submission checklist

dnkurek commented Jan 15, 2025

schrodingho commented Jan 16, 2025

dnkurek commented Jan 16, 2025

mlukasze commented Jan 24, 2025 • edited Loading

schrodingho commented Jan 27, 2025

pkowalc1 commented Feb 5, 2025 • edited Loading

pkowalc1 commented Feb 7, 2025

schrodingho commented Jan 15, 2025 •

edited

Loading

mlukasze commented Jan 24, 2025 •

edited

Loading

pkowalc1 commented Feb 5, 2025 •

edited

Loading