-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448
Comments
Hi, do you also have the same issue with the iGPU or CPU in your system? Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead |
Yeah so it looks like grid_sample_ref needs to be optimized and make a grid_sample_opt version perhaps... |
ref ticket: CVS-161002 hey @schrodingho |
Thank you for looking into this. You can refer to my forked repo, which includes a script named |
hi @schrodingho, Would be great if you could confirm that it helps in your case/env/etc. The code should be correct at this point, also it should be as stable numerically as ref version, so you shouldn't see any difference in output other than getting it faster. I will try to optimize it further on this branch, so it may take a while before it is merged to master. |
hi @schrodingho, |
OpenVINO Version
Master Branch
Operating System
Windows System
Device used for inference
dGPU
OpenVINO installation
PyPi
Programming Language
Python
Hardware Architecture
x86 (64 bits)
Model used
https://github.com/autonomousvision/unimatch
Model quantization
No
Target Platform
OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0
Performance issue description
I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.
To reduce latency, I replaced the PyTorch function
F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True)
with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch
for the original model andopt_unimatch
for the modified one).ori_unimatch:
![Image](https://private-user-images.githubusercontent.com/43869288/403327028-c722b799-013a-46c5-892d-61f935d6e290.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNDU5OTQsIm5iZiI6MTczOTE0NTY5NCwicGF0aCI6Ii80Mzg2OTI4OC80MDMzMjcwMjgtYzcyMmI3OTktMDEzYS00NmM1LTg5MmQtNjFmOTM1ZDZlMjkwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAwMDEzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ0MWM1MmVhYTZmNmIwYTg5ZDkyYTIwYjYxNjdlZjE4MjRmYTFmMzk1ZDI2MzExOTdjMjlhNGFhNDg5MWRiMzAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.NUqk2bwpBdz_Jp6o0q57zBja6zaXkrc_SNsC4Uj1_2M)
opt_unimatch:
![Image](https://private-user-images.githubusercontent.com/43869288/403327397-0d5b84af-b976-43f9-aa73-28676933fc42.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNDU5OTQsIm5iZiI6MTczOTE0NTY5NCwicGF0aCI6Ii80Mzg2OTI4OC80MDMzMjczOTctMGQ1Yjg0YWYtYjk3Ni00M2Y5LWFhNzMtMjg2NzY5MzNmYzQyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAwMDEzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTUxNDY0NDI1YWZlOTBmYzI0YmY0YWM1NzMzZTJkYWQ1NTY2YzI0NTYwY2ViZjE1MTlmZmEyOGY3NzJiNmFkMDUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.UrCbpuD9mJioESgE4TZ2B0EJ2ZkE3j3rqTSOiZroxjk)
Step-by-step reproduction
GMFlow-scale2-regrefine6-mixdata
from the Model_Zoo and save it thepretrained
folder.gmflow_demo.sh
inScripts
to run the model:/unimatch/matching.py
to this implementation, and redo the step 4 and 5.Issue submission checklist
The text was updated successfully, but these errors were encountered: