-
Notifications
You must be signed in to change notification settings - Fork 180
Add Cooperative Groups API integration #87
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: main
Are you sure you want to change the base?
Conversation
@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place. Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go. Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later. |
4bbc882
to
e44a8bc
Compare
This works as follows: - Users build their Cuda code via `CudaBuilder` as normal. - If they want to use the cooperative groups API, then in their `build.rs`, just after building their PTX, they will: - Create a `cuda_builder::cg::CooperativeGroups` instance, - Add any needed opts for building the Cooperative Groups API bridge code (`-arch=sm_*` and so on), - Add their newly built PTX code to be linked with the CG API, which can include multiple PTX, cubin or fatbin files, - Call `.compile(..)`, which will spit out a fully linked `cubin`, - In the user's main application code, instead of using `launch!` to schedule their GPU work, they will now use `launch_cooperative!`.
e44a8bc
to
aefa92a
Compare
This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction |
I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code. Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future. |
Hello! We are rebooting this project. Sorry for your PR not getting merged! Is this still relevant? |
This works as follows:
CudaBuilder
as normal.build.rs
, just after building their PTX, they will:cuda_builder::cg::CooperativeGroups
instance,-arch=sm_*
and so on),.compile(..)
, which will spit out a fully linkedcubin
,launch!
to schedule their GPU work, they will now uselaunch_cooperative!
.todo
cuLaunchCooperativeKernel
in a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.