-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
LowerWGScope pass generates copies between private and shared memory. Logic is to share private value from leader work item to other work items through shared memory. Example in pseudo code: ... if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() load %PrivateValue from @SharedGlobal -> all WIs load the shared value ... Generated load/store operations are not supposed to be moved across memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are not handled specially by LLVM middle end passes and recognized only by PTX backend. So, middle end optimizations could perform code movement resulting in load before store. For example, GVN could perform LoadPRE based on GlobalsAA: ... crit_edge: load %PrivateValue from @SharedGlobal -> all WIs load the shared value if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() ... That is why generate volatile memory accesses in LowerWGScope pass to guarantee that such optimizations are not performed by llvm passes. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
@@ -921,7 +921,7 @@ Value *spirv::genLinearLocalID(Instruction &Before, const Triple &TT) { | |||
unsigned Align = M.getDataLayout().getPreferredAlignment(G); | |||
G->setAlignment(Align); | |||
} | |||
Value *Res = new LoadInst(G, "", &Before); | |||
Value *Res = new LoadInst(G, "", true, &Before); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong way to fix the problem.
Please, add "convergent" attribute to the barrier function at line 944.
NOTE: it's not the only function missing this attribute.
Please, add a unit test for this pass to clang LIT test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Naghasan, do we need add this attribute to _Z22__spirv_ControlBarrierN5__spv5ScopeES0_j
implementation in libclc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably safer to do so, regardless of how llvm handles attribute merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, I am going to try add attribute but I am not sure that this will fix the problem.
Could you please take a look at the example I provided in the issue: #1258
Declaration contains convergent attribute:
; Function Attrs: convergent nounwind
declare void @llvm.nvvm.barrier0() #3
Nevertheless illegal transformation happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tail call void @llvm.nvvm.barrier0() #4
...
attributes #3 = { convergent nounwind }
attributes #4 = { nounwind }
Call must be convergent too i.e. #3
.
Note these might require it too:
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() #2
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() #2
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() #2
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() #2
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm almost 100% sure that applying convergent
will fix the problem.
I'm 146% sure that using volatile
for load/store instructions is not the right approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Alexey offline, unfortunately convergent attribute doesn't help. As I described the problem, it happens because of globals AA, convergent attribute doesn't affect alias analysis. It looks like problem is how global variables are created in LowerWGScope pass. They are created with internal linkage. As a result alias analysis thinks that this global doesn't escape and gives wrong answer that barrier references but doesn't modify memory location of the global. It looks like globals variables produced by LowerWGScope should have external linkage. I am working on the new patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created separate PR with new patch #1257. For some reasons, I am not able to add Victor as a reviewer. This PR is going to be closed.
these load a store must not be reordered in any case, since they access the same address (if the IR is accurate), and this is a WAR dependence. No memory barrier is needed to enforce that. So the real Volatile is a workaround which does not break semantics (but masks the problem).
Then this would be a problem of the alias analysis. First, load/store address is the same, and it does not matter what's the linkage. This would cause problems in 1 threaded program. |
LowerWGScope pass generates copies between private and shared memory.
Logic is to share private value from leader work item to other work
items through shared memory. Example in pseudo code:
Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
not handled specially by LLVM middle end passes and recognized only by
PTX backend. So, middle end optimizations could perform code movement
resulting in load before store. For example, GVN could perform LoadPRE
based on GlobalsAA:
That is why generate volatile memory accesses in LowerWGScope pass to
guarantee that such optimizations are not performed by llvm passes.
Signed-off-by: Artur Gainullin artur.gainullin@intel.com