You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When processing large batches tall images (10000-100000 rows, 64 columns) it exposed a data-race which causes out-of-bounds reads in the labeling_hennequin_2018_HA4.cu algorithm implementation. This only occurs in every ~10-100 runs of the kernel. This implicit sync optimization is the cause. To fix it all you have to do is remove the #ifdef check, that is unconditionally call __syncthreads(). I used compute-sanitizer --racecheck to confirm the existence of this bug. In the data I was working with adding the synchronization lead to a 5% runtime performance regression to the StripLabeling() kernel. My machine was RHEL with an Ampere GPU running with CUDA 11.4. I can't give you example code to reproduce the issue since I'm not allowed to share the code.
Solution
I think the check is supposed to be __CUDA_ARCH__ >= 700 instead of __CUDA_ARCH__ < 700? From what I remember reading Nvidia has been removing implicit synchronization over time and not adding it. I could not name the exact feature the #ifdef block is trying to get at. Personally I would just comment out the #ifdef entirely and leave a comment explaining why. That's what I did in my own code. The specific optimization is mentioned in the original HA4 paper (I think) so adding an explainer at that point should dissuade the user from making the same mistake.
The text was updated successfully, but these errors were encountered:
Sure I can do that. Do you happen to know if it’s just inverting the comparison operator? If not I’ll check go through the original paper and figure out what it should be 👍
N7Alpha
added a commit
to N7Alpha/YACCLAB
that referenced
this issue
Feb 13, 2025
Background
When processing large batches tall images (10000-100000 rows, 64 columns) it exposed a data-race which causes out-of-bounds reads in the
labeling_hennequin_2018_HA4.cu
algorithm implementation. This only occurs in every ~10-100 runs of the kernel. This implicit sync optimization is the cause. To fix it all you have to do is remove the#ifdef
check, that is unconditionally call__syncthreads()
. I usedcompute-sanitizer --racecheck
to confirm the existence of this bug. In the data I was working with adding the synchronization lead to a 5% runtime performance regression to theStripLabeling()
kernel. My machine was RHEL with an Ampere GPU running with CUDA 11.4. I can't give you example code to reproduce the issue since I'm not allowed to share the code.Solution
I think the check is supposed to be
__CUDA_ARCH__ >= 700
instead of__CUDA_ARCH__ < 700
? From what I remember reading Nvidia has been removing implicit synchronization over time and not adding it. I could not name the exact feature the#ifdef
block is trying to get at. Personally I would just comment out the#ifdef
entirely and leave a comment explaining why. That's what I did in my own code. The specific optimization is mentioned in the original HA4 paper (I think) so adding an explainer at that point should dissuade the user from making the same mistake.The text was updated successfully, but these errors were encountered: