Skip to content

[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 1 commit into from

Conversation

againull
Copy link
Contributor

@againull againull commented Mar 6, 2020

LowerWGScope pass generates copies between private and shared memory.
Logic is to share private value from leader work item to other work
items through shared memory. Example in pseudo code:


...
if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
not handled specially by LLVM middle end passes and recognized only by
PTX backend. So, middle end optimizations could perform code movement
resulting in load before store. For example, GVN could perform LoadPRE
based on GlobalsAA:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

That is why generate volatile memory accesses in LowerWGScope pass to
guarantee that such optimizations are not performed by llvm passes.

Signed-off-by: Artur Gainullin artur.gainullin@intel.com

LowerWGScope pass generates copies between private and shared memory.
Logic is to share private value from leader work item to other work
items through shared memory. Example in pseudo code:

...
if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like  @llvm.nvvm.barrier0() are
not handled specially by LLVM middle end passes and recognized only by
PTX backend. So, middle end optimizations could perform code movement
resulting in load before store. For example, GVN could perform LoadPRE
based on GlobalsAA:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

That is why generate volatile memory accesses in LowerWGScope pass to
guarantee that such optimizations are not performed by llvm passes.

Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
@@ -921,7 +921,7 @@ Value *spirv::genLinearLocalID(Instruction &Before, const Triple &TT) {
unsigned Align = M.getDataLayout().getPreferredAlignment(G);
G->setAlignment(Align);
}
Value *Res = new LoadInst(G, "", &Before);
Value *Res = new LoadInst(G, "", true, &Before);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong way to fix the problem.
Please, add "convergent" attribute to the barrier function at line 944.
NOTE: it's not the only function missing this attribute.

Please, add a unit test for this pass to clang LIT test suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naghasan, do we need add this attribute to _Z22__spirv_ControlBarrierN5__spv5ScopeES0_j implementation in libclc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably safer to do so, regardless of how llvm handles attribute merging.

Copy link
Contributor Author

@againull againull Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I am going to try add attribute but I am not sure that this will fix the problem.
Could you please take a look at the example I provided in the issue: #1258

Declaration contains convergent attribute:

; Function Attrs: convergent nounwind
declare void @llvm.nvvm.barrier0() #3

Nevertheless illegal transformation happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  tail call void @llvm.nvvm.barrier0() #4
...
attributes #3 = { convergent nounwind }
attributes #4 = { nounwind }

Call must be convergent too i.e. #3.
Note these might require it too:

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() #2

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() #2

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() #2

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() #2

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost 100% sure that applying convergent will fix the problem.
I'm 146% sure that using volatile for load/store instructions is not the right approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Alexey offline, unfortunately convergent attribute doesn't help. As I described the problem, it happens because of globals AA, convergent attribute doesn't affect alias analysis. It looks like problem is how global variables are created in LowerWGScope pass. They are created with internal linkage. As a result alias analysis thinks that this global doesn't escape and gives wrong answer that barrier references but doesn't modify memory location of the global. It looks like globals variables produced by LowerWGScope should have external linkage. I am working on the new patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created separate PR with new patch #1257. For some reasons, I am not able to add Victor as a reviewer. This PR is going to be closed.

@againull againull closed this Mar 10, 2020
@kbobrovs
Copy link
Contributor

  load %PrivateValue from @SharedGlobal -> all WIs load the shared value
  store %PrivateValue to @SharedGlobal -> leader shares the value

these load a store must not be reordered in any case, since they access the same address (if the IR is accurate), and this is a WAR dependence. No memory barrier is needed to enforce that. So the real
problem is in the transformation which violates this.

Volatile is a workaround which does not break semantics (but masks the problem).

As a result alias analysis thinks that this global doesn't escape and gives wrong answer that barrier references but doesn't modify memory location of the global.

Then this would be a problem of the alias analysis. First, load/store address is the same, and it does not matter what's the linkage. This would cause problems in 1 threaded program.
Second, globals with internal linkage can well be accessed by multiple threads. It may just slightly help the alias analysis, but I don't see how it makes the reordering correct.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants