Skip to content

Add Cooperative Groups API integration #87

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thedodd
Copy link
Contributor

@thedodd thedodd commented Sep 16, 2022

This works as follows:

  • Users build their Cuda code via CudaBuilder as normal.
  • If they want to use the cooperative groups API, then in their build.rs, just after building their PTX, they will:
    • Create a cuda_builder::cg::CooperativeGroups instance,
    • Add any needed opts for building the Cooperative Groups API bridge code (-arch=sm_* and so on),
    • Add their newly built PTX code to be linked with the CG API, which can include multiple PTX, cubin or fatbin files,
    • Call .compile(..), which will spit out a fully linked cubin,
  • In the user's main application code, instead of using launch! to schedule their GPU work, they will now use launch_cooperative!.

todo

  • Update cust to expose the cuLaunchCooperativeKernel in a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.
  • Update cuda_builder crate to expose a nice API around linking PTX code with the Cooperative Groups API bridge code.
  • Remove the Justfile. I was only using it for POC testing.
  • Remove the PTX code from this branch, currently it is just here for reference.

@thedodd
Copy link
Contributor Author

thedodd commented Sep 16, 2022

@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place.

Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go.

Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later.

@thedodd thedodd force-pushed the cooperative-groups-1 branch 5 times, most recently from 4bbc882 to e44a8bc Compare September 18, 2022 03:28
This works as follows:
- Users build their Cuda code via `CudaBuilder` as normal.
- If they want to use the cooperative groups API, then in
  their `build.rs`, just after building their PTX, they will:
  - Create a `cuda_builder::cg::CooperativeGroups` instance,
  - Add any needed opts for building the Cooperative Groups
    API bridge code (`-arch=sm_*` and so on),
  - Add their newly built PTX code to be linked with the CG API,
    which can include multiple PTX, cubin or fatbin files,
  - Call `.compile(..)`, which will spit out a fully linked `cubin`,
- In the user's main application code, instead of using `launch!` to
  schedule their GPU work, they will now use `launch_cooperative!`.
@thedodd thedodd force-pushed the cooperative-groups-1 branch from e44a8bc to aefa92a Compare September 18, 2022 03:36
@thedodd thedodd marked this pull request as ready for review September 18, 2022 03:36
@thedodd thedodd changed the title Add bridging PTX code for cooperative_groups API Add Cooperative Groups API integration Sep 18, 2022
@RDambrosio016
Copy link
Member

This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction

@thedodd
Copy link
Contributor Author

thedodd commented Sep 21, 2022

wouldn't it be easier to use inline assembly?

I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code.

Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future.

@LegNeato
Copy link
Contributor

Hello! We are rebooting this project. Sorry for your PR not getting merged!

Is this still relevant?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants