[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

againull · 2020-03-06T06:14:11Z

LowerWGScope pass generates copies between private and shared memory.
Logic is to share private value from leader work item to other work
items through shared memory. Example in pseudo code:


...
if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
not handled specially by LLVM middle end passes and recognized only by
PTX backend. So, middle end optimizations could perform code movement
resulting in load before store. For example, GVN could perform LoadPRE
based on GlobalsAA:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

That is why generate volatile memory accesses in LowerWGScope pass to
guarantee that such optimizations are not performed by llvm passes.

Signed-off-by: Artur Gainullin artur.gainullin@intel.com

LowerWGScope pass generates copies between private and shared memory. Logic is to share private value from leader work item to other work items through shared memory. Example in pseudo code: ... if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() load %PrivateValue from @SharedGlobal -> all WIs load the shared value ... Generated load/store operations are not supposed to be moved across memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are not handled specially by LLVM middle end passes and recognized only by PTX backend. So, middle end optimizations could perform code movement resulting in load before store. For example, GVN could perform LoadPRE based on GlobalsAA: ... crit_edge: load %PrivateValue from @SharedGlobal -> all WIs load the shared value if (Leader work item) store %PrivateValue to @SharedGlobal -> leader shares the value memory_barrier() ... That is why generate volatile memory accesses in LowerWGScope pass to guarantee that such optimizations are not performed by llvm passes. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>

bader · 2020-03-06T10:58:45Z

clang/lib/CodeGen/SYCLLowerIR/LowerWGScope.cpp

@@ -921,7 +921,7 @@ Value *spirv::genLinearLocalID(Instruction &Before, const Triple &TT) {
      unsigned Align = M.getDataLayout().getPreferredAlignment(G);
      G->setAlignment(Align);
    }
-    Value *Res = new LoadInst(G, "", &Before);
+    Value *Res = new LoadInst(G, "", true, &Before);


This is the wrong way to fix the problem.
Please, add "convergent" attribute to the barrier function at line 944.
NOTE: it's not the only function missing this attribute.

Please, add a unit test for this pass to clang LIT test suite.

@Naghasan, do we need add this attribute to _Z22__spirv_ControlBarrierN5__spv5ScopeES0_j implementation in libclc?

That's probably safer to do so, regardless of how llvm handles attribute merging.

Thx, I am going to try add attribute but I am not sure that this will fix the problem.
Could you please take a look at the example I provided in the issue: #1258

Declaration contains convergent attribute:

; Function Attrs: convergent nounwind declare void @llvm.nvvm.barrier0() #3

Nevertheless illegal transformation happens.

tail call void @llvm.nvvm.barrier0() #4 ... attributes #3 = { convergent nounwind } attributes #4 = { nounwind }

Call must be convergent too i.e. #3.
Note these might require it too:

; Function Attrs: nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2 ; Function Attrs: nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() #2 ; Function Attrs: nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() #2 ; Function Attrs: nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() #2 ; Function Attrs: nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() #2 ; Function Attrs: nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #2

I'm almost 100% sure that applying convergent will fix the problem.
I'm 146% sure that using volatile for load/store instructions is not the right approach.

Discussed with Alexey offline, unfortunately convergent attribute doesn't help. As I described the problem, it happens because of globals AA, convergent attribute doesn't affect alias analysis. It looks like problem is how global variables are created in LowerWGScope pass. They are created with internal linkage. As a result alias analysis thinks that this global doesn't escape and gives wrong answer that barrier references but doesn't modify memory location of the global. It looks like globals variables produced by LowerWGScope should have external linkage. I am working on the new patch.

Created separate PR with new patch #1257. For some reasons, I am not able to add Victor as a reviewer. This PR is going to be closed.

kbobrovs · 2020-03-10T04:25:07Z

  load %PrivateValue from @SharedGlobal -> all WIs load the shared value
  store %PrivateValue to @SharedGlobal -> leader shares the value

these load a store must not be reordered in any case, since they access the same address (if the IR is accurate), and this is a WAR dependence. No memory barrier is needed to enforce that. So the real
problem is in the transformation which violates this.

Volatile is a workaround which does not break semantics (but masks the problem).

As a result alias analysis thinks that this global doesn't escape and gives wrong answer that barrier references but doesn't modify memory location of the global.

Then this would be a problem of the alias analysis. First, load/store address is the same, and it does not matter what's the linkage. This would cause problems in 1 threaded program.
Second, globals with internal linkage can well be accessed by multiple threads. It may just slightly help the alias analysis, but I don't see how it makes the reordering correct.

againull mentioned this pull request Mar 6, 2020

[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258

Closed

againull requested review from kbobrovs and bader and removed request for kbobrovs March 6, 2020 06:52

bader reviewed Mar 6, 2020

View reviewed changes

againull closed this Mar 10, 2020

bader mentioned this pull request Mar 11, 2020

[SYCL] WG-shared global variables must have external linkage #1279

Closed

againull deleted the mark_as_volatile branch December 3, 2022 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

againull commented Mar 6, 2020

bader Mar 6, 2020

bader Mar 6, 2020

Naghasan Mar 6, 2020

againull Mar 6, 2020 •

edited

Loading

bader Mar 6, 2020

bader Mar 6, 2020

againull Mar 6, 2020

againull Mar 10, 2020

kbobrovs commented Mar 10, 2020

[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

Conversation

againull commented Mar 6, 2020

bader Mar 6, 2020

Choose a reason for hiding this comment

bader Mar 6, 2020

Choose a reason for hiding this comment

Naghasan Mar 6, 2020

Choose a reason for hiding this comment

againull Mar 6, 2020 • edited Loading

Choose a reason for hiding this comment

bader Mar 6, 2020

Choose a reason for hiding this comment

bader Mar 6, 2020

Choose a reason for hiding this comment

againull Mar 6, 2020

Choose a reason for hiding this comment

againull Mar 10, 2020

Choose a reason for hiding this comment

kbobrovs commented Mar 10, 2020

againull Mar 6, 2020 •

edited

Loading