Skip to content

Commit bcfd504

Browse files
authored
Rollup merge of rust-lang#38559 - japaric:ptx2, r=alexcrichton
PTX support, take 2 - You can generate PTX using `--emit=asm` and the right (custom) target. Which then you can run on a NVIDIA GPU. - You can compile `core` to PTX. [Xargo] also works and it can compile some other crates like `collections` (but I doubt all of those make sense on a GPU) [Xargo]: https://github.com/japaric/xargo - You can create "global" functions, which can be "called" by the host, using the `"ptx-kernel"` ABI, e.g. `extern "ptx-kernel" fn kernel() { .. }`. Every other function is a "device" function and can only be called by the GPU. - Intrinsics like `__syncthreads()` and `blockIdx.x` are available as `"platform-intrinsics"`. These intrinsics are *not* in the `core` crate but any Rust user can create "bindings" to them using an `extern "platform-intrinsics"` block. See example at the end. - Trying to emit PTX with `-g` (debuginfo); you get an LLVM error. But I don't think PTX can contain debuginfo anyway so `-g` should be ignored and a warning should be printed ("`-g` doesn't work with this target" or something). - "Single source" support. You *can't* write a single source file that contains both host and device code. I think that should be possible to implement that outside the compiler using compiler plugins / build scripts. - The equivalent to CUDA `__shared__` which it's used to declare memory that's shared between the threads of the same block. This could be implemented using attributes: `#[shared] static mut SCRATCH_MEMORY: [f32; 64]` but hasn't been implemented yet. - Built-in targets. This PR doesn't add targets to the compiler just yet but one can create custom targets to be able to emit PTX code (see the example at the end). The idea is to have people experiment with this feature before committing to it (built-in targets are "insta-stable") - All functions must be "inlined". IOW, the `.rlib` must always contain the LLVM bitcode of all the functions of the crate it was produced from. Otherwise, you end with "undefined references" in the final PTX code but you won't get *any* linker error because no linker is involved. IOW, you'll hit a runtime error when loading the PTX into the GPU. The workaround is to use `#[inline]` on non-generic functions and to never use `#[inline(never)]` but this may not always be possible because e.g. you could be relying on third party code. - Should `--emit=asm` generate a `.ptx` file instead of a `.s` file? TL;DR Use Xargo to turn a crate into a PTX module (a `.s` file). Then pass that PTX module, as a string, to the GPU and run it. The full code is in [this repository]. This section gives an overview of how to run Rust code on a NVIDIA GPU. [this repository]: https://github.com/japaric/cuda - Create a custom target. Here's the 64-bit NVPTX target (NOTE: the comments are not valid because this is supposed to be a JSON file; remove them before you use this file): ``` js // nvptx64-nvidia-cuda.json { "arch": "nvptx64", // matches LLVM "cpu": "sm_20", // "oldest" compute capability supported by LLVM "data-layout": "e-i64:64-v16:16-v32:32-n16:32:64", "llvm-target": "nvptx64-nvidia-cuda", "max-atomic-width": 0, // LLVM errors with any other value :-( "os": "cuda", // matches LLVM "panic-strategy": "abort", "target-endian": "little", "target-pointer-width": "64", "target-vendor": "nvidia", // matches LLVM -- not required } ``` (There's a 32-bit target specification in the linked repository) - Write a kernel ``` rust extern "platform-intrinsic" { fn nvptx_block_dim_x() -> i32; fn nvptx_block_idx_x() -> i32; fn nvptx_thread_idx_x() -> i32; } /// Copies an array of `n` floating point numbers from `src` to `dst` pub unsafe extern "ptx-kernel" fn memcpy(dst: *mut f32, src: *const f32, n: usize) { let i = (nvptx_block_dim_x() as isize) .wrapping_mul(nvptx_block_idx_x() as isize) .wrapping_add(nvptx_thread_idx_x() as isize); if (i as usize) < n { *dst.offset(i) = *src.offset(i); } } ``` - Emit PTX code ``` $ xargo rustc --target nvptx64-nvidia-cuda --release -- --emit=asm Compiling core v0.0.0 (file://..) (..) Compiling nvptx-builtins v0.1.0 (https://github.com/japaric/nvptx-builtins) Compiling kernel v0.1.0 $ cat target/nvptx64-nvidia-cuda/release/deps/kernel-*.s // // Generated by LLVM NVPTX Back-End // .version 3.2 .target sm_20 .address_size 64 // .globl memcpy .visible .entry memcpy( .param .u64 memcpy_param_0, .param .u64 memcpy_param_1, .param .u64 memcpy_param_2 ) { .reg .pred %p<2>; .reg .s32 %r<5>; .reg .s64 %rd<12>; ld.param.u64 %rd7, [memcpy_param_2]; mov.u32 %r1, %ntid.x; mov.u32 %r2, %ctaid.x; mul.wide.s32 %rd8, %r2, %r1; mov.u32 %r3, %tid.x; cvt.s64.s32 %rd9, %r3; add.s64 %rd10, %rd9, %rd8; setp.ge.u64 %p1, %rd10, %rd7; @%p1 bra LBB0_2; ld.param.u64 %rd3, [memcpy_param_0]; ld.param.u64 %rd4, [memcpy_param_1]; cvta.to.global.u64 %rd5, %rd4; cvta.to.global.u64 %rd6, %rd3; shl.b64 %rd11, %rd10, 2; add.s64 %rd1, %rd6, %rd11; add.s64 %rd2, %rd5, %rd11; ld.global.u32 %r4, [%rd2]; st.global.u32 [%rd1], %r4; LBB0_2: ret; } ``` - Run it on the GPU ``` rust // `kernel.ptx` is the `*.s` file we got in the previous step const KERNEL: &'static str = include_str!("kernel.ptx"); driver::initialize()?; let device = Device(0)?; let ctx = device.create_context()?; let module = ctx.load_module(KERNEL)?; let kernel = module.function("memcpy")?; let h_a: Vec<f32> = /* create some random data */; let h_b = vec![0.; N]; let d_a = driver::allocate(bytes)?; let d_b = driver::allocate(bytes)?; // Copy from host to GPU driver::copy(h_a, d_a)?; // Run `memcpy` on the GPU kernel.launch(d_b, d_a, N)?; // Copy from GPU to host driver::copy(d_b, h_b)?; // Verify assert_eq!(h_a, h_b); // `d_a`, `d_b`, `h_a`, `h_b` are dropped/freed here ``` --- cc @alexcrichton @brson @rkruppe > What has changed since rust-lang#34195? - `core` now can be compiled into PTX. Which makes it very easy to turn `no_std` crates into "kernels" with the help of Xargo. - There's now a way, the `"ptx-kernel"` ABI, to generate "global" functions. The old PR required a manual step (it was hack) to "convert" "device" functions into "global" functions. (Only "global" functions can be launched by the host) - Everything is unstable. There are not "insta stable" built-in targets this time (\*). The users have to use a custom target to experiment with this feature. Also, PTX instrinsics, like `__syncthreads` and `blockIdx.x`, are now implemented as `"platform-intrinsics"` so they no longer live in the `core` crate. (\*) I'd actually like to have in-tree targets because that makes this target more discoverable, removes the need to lug around .json files, etc. However, bundling a target with the compiler immediately puts it in the path towards stabilization. Which gives us just two cycles to find and fix any problem with the target specification. Afterwards, it becomes hard to tweak the specification because that could be a breaking change. A possible solution could be "unstable built-in targets". Basically, to use an unstable target, you'll have to also pass `-Z unstable-options` to the compiler. And unstable targets, being unstable, wouldn't be available on stable. > Why should this be merged? - To let people experiment with the feature out of tree. Having easy access to the feature (in every nightly) allows this. I also think that, as it is, it should be possible to start prototyping type-safe single source support using build scripts, macros and/or plugins. - It's a straightforward implementation. No different that adding support for any other architecture.
2 parents 4e2e01e + aac5ff7 commit bcfd504

File tree

16 files changed

+345
-4
lines changed

16 files changed

+345
-4
lines changed

src/bootstrap/native.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ pub fn llvm(build: &Build, target: &str) {
8181
.profile(profile)
8282
.define("LLVM_ENABLE_ASSERTIONS", assertions)
8383
.define("LLVM_TARGETS_TO_BUILD",
84-
"X86;ARM;AArch64;Mips;PowerPC;SystemZ;JSBackend;MSP430;Sparc")
84+
"X86;ARM;AArch64;Mips;PowerPC;SystemZ;JSBackend;MSP430;Sparc;NVPTX")
8585
.define("LLVM_INCLUDE_EXAMPLES", "OFF")
8686
.define("LLVM_INCLUDE_TESTS", "OFF")
8787
.define("LLVM_INCLUDE_DOCS", "OFF")
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"intrinsic_prefix": "_",
3+
"llvm_prefix": "llvm.cuda.",
4+
"intrinsics": [
5+
{
6+
"intrinsic": "syncthreads",
7+
"width": ["0"],
8+
"llvm": "syncthreads",
9+
"ret": "V",
10+
"args": []
11+
}
12+
]
13+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"platform": "nvptx",
3+
"number_info": {
4+
"signed": {}
5+
},
6+
"width_info": {}
7+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
{
2+
"intrinsic_prefix": "_",
3+
"llvm_prefix": "llvm.nvvm.read.ptx.sreg.",
4+
"intrinsics": [
5+
{
6+
"intrinsic": "block_dim_x",
7+
"width": ["0"],
8+
"llvm": "ntid.x",
9+
"ret": "S32",
10+
"args": []
11+
},
12+
{
13+
"intrinsic": "block_dim_y",
14+
"width": ["0"],
15+
"llvm": "ntid.y",
16+
"ret": "S32",
17+
"args": []
18+
},
19+
{
20+
"intrinsic": "block_dim_z",
21+
"width": ["0"],
22+
"llvm": "ntid.z",
23+
"ret": "S32",
24+
"args": []
25+
},
26+
{
27+
"intrinsic": "block_idx_x",
28+
"width": ["0"],
29+
"llvm": "ctaid.x",
30+
"ret": "S32",
31+
"args": []
32+
},
33+
{
34+
"intrinsic": "block_idx_y",
35+
"width": ["0"],
36+
"llvm": "ctaid.y",
37+
"ret": "S32",
38+
"args": []
39+
},
40+
{
41+
"intrinsic": "block_idx_z",
42+
"width": ["0"],
43+
"llvm": "ctaid.z",
44+
"ret": "S32",
45+
"args": []
46+
},
47+
{
48+
"intrinsic": "grid_dim_x",
49+
"width": ["0"],
50+
"llvm": "nctaid.x",
51+
"ret": "S32",
52+
"args": []
53+
},
54+
{
55+
"intrinsic": "grid_dim_y",
56+
"width": ["0"],
57+
"llvm": "nctaid.y",
58+
"ret": "S32",
59+
"args": []
60+
},
61+
{
62+
"intrinsic": "grid_dim_z",
63+
"width": ["0"],
64+
"llvm": "nctaid.z",
65+
"ret": "S32",
66+
"args": []
67+
},
68+
{
69+
"intrinsic": "thread_idx_x",
70+
"width": ["0"],
71+
"llvm": "tid.x",
72+
"ret": "S32",
73+
"args": []
74+
},
75+
{
76+
"intrinsic": "thread_idx_y",
77+
"width": ["0"],
78+
"llvm": "tid.y",
79+
"ret": "S32",
80+
"args": []
81+
},
82+
{
83+
"intrinsic": "thread_idx_z",
84+
"width": ["0"],
85+
"llvm": "tid.z",
86+
"ret": "S32",
87+
"args": []
88+
}
89+
]
90+
}

src/librustc_llvm/build.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ fn main() {
9696

9797
let optional_components =
9898
["x86", "arm", "aarch64", "mips", "powerpc", "pnacl", "systemz", "jsbackend", "msp430",
99-
"sparc"];
99+
"sparc", "nvptx"];
100100

101101
// FIXME: surely we don't need all these components, right? Stuff like mcjit
102102
// or interpreter the compiler itself never uses.

src/librustc_llvm/ffi.rs

+1
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ pub enum CallConv {
4242
X86StdcallCallConv = 64,
4343
X86FastcallCallConv = 65,
4444
ArmAapcsCallConv = 67,
45+
PtxKernel = 71,
4546
X86_64_SysV = 78,
4647
X86_64_Win64 = 79,
4748
X86_VectorCall = 80,

src/librustc_llvm/lib.rs

+5
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,11 @@ pub fn initialize_available_targets() {
376376
LLVMInitializeSparcTargetMC,
377377
LLVMInitializeSparcAsmPrinter,
378378
LLVMInitializeSparcAsmParser);
379+
init_target!(llvm_component = "nvptx",
380+
LLVMInitializeNVPTXTargetInfo,
381+
LLVMInitializeNVPTXTarget,
382+
LLVMInitializeNVPTXTargetMC,
383+
LLVMInitializeNVPTXAsmPrinter);
379384
}
380385

381386
pub fn last_error() -> Option<String> {

src/librustc_platform_intrinsics/lib.rs

+3
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ static VOID: Type = Type::Void;
9595
mod x86;
9696
mod arm;
9797
mod aarch64;
98+
mod nvptx;
9899

99100
impl Intrinsic {
100101
pub fn find(name: &str) -> Option<Intrinsic> {
@@ -104,6 +105,8 @@ impl Intrinsic {
104105
arm::find(name)
105106
} else if name.starts_with("aarch64_") {
106107
aarch64::find(name)
108+
} else if name.starts_with("nvptx_") {
109+
nvptx::find(name)
107110
} else {
108111
None
109112
}
+92
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
// Copyright 2015 The Rust Project Developers. See the COPYRIGHT
2+
// file at the top-level directory of this distribution and at
3+
// http://rust-lang.org/COPYRIGHT.
4+
//
5+
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
6+
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
7+
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
8+
// option. This file may not be copied, modified, or distributed
9+
// except according to those terms.
10+
11+
// DO NOT EDIT: autogenerated by etc/platform-intrinsics/generator.py
12+
// ignore-tidy-linelength
13+
14+
#![allow(unused_imports)]
15+
16+
use {Intrinsic, Type};
17+
use IntrinsicDef::Named;
18+
19+
// The default inlining settings trigger a pathological behaviour in
20+
// LLVM, which causes makes compilation very slow. See #28273.
21+
#[inline(never)]
22+
pub fn find(name: &str) -> Option<Intrinsic> {
23+
if !name.starts_with("nvptx") { return None }
24+
Some(match &name["nvptx".len()..] {
25+
"_syncthreads" => Intrinsic {
26+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
27+
output: &::VOID,
28+
definition: Named("llvm.cuda.syncthreads")
29+
},
30+
"_block_dim_x" => Intrinsic {
31+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
32+
output: &::I32,
33+
definition: Named("llvm.nvvm.read.ptx.sreg.ntid.x")
34+
},
35+
"_block_dim_y" => Intrinsic {
36+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
37+
output: &::I32,
38+
definition: Named("llvm.nvvm.read.ptx.sreg.ntid.y")
39+
},
40+
"_block_dim_z" => Intrinsic {
41+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
42+
output: &::I32,
43+
definition: Named("llvm.nvvm.read.ptx.sreg.ntid.z")
44+
},
45+
"_block_idx_x" => Intrinsic {
46+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
47+
output: &::I32,
48+
definition: Named("llvm.nvvm.read.ptx.sreg.ctaid.x")
49+
},
50+
"_block_idx_y" => Intrinsic {
51+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
52+
output: &::I32,
53+
definition: Named("llvm.nvvm.read.ptx.sreg.ctaid.y")
54+
},
55+
"_block_idx_z" => Intrinsic {
56+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
57+
output: &::I32,
58+
definition: Named("llvm.nvvm.read.ptx.sreg.ctaid.z")
59+
},
60+
"_grid_dim_x" => Intrinsic {
61+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
62+
output: &::I32,
63+
definition: Named("llvm.nvvm.read.ptx.sreg.nctaid.x")
64+
},
65+
"_grid_dim_y" => Intrinsic {
66+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
67+
output: &::I32,
68+
definition: Named("llvm.nvvm.read.ptx.sreg.nctaid.y")
69+
},
70+
"_grid_dim_z" => Intrinsic {
71+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
72+
output: &::I32,
73+
definition: Named("llvm.nvvm.read.ptx.sreg.nctaid.z")
74+
},
75+
"_thread_idx_x" => Intrinsic {
76+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
77+
output: &::I32,
78+
definition: Named("llvm.nvvm.read.ptx.sreg.tid.x")
79+
},
80+
"_thread_idx_y" => Intrinsic {
81+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
82+
output: &::I32,
83+
definition: Named("llvm.nvvm.read.ptx.sreg.tid.y")
84+
},
85+
"_thread_idx_z" => Intrinsic {
86+
inputs: { static INPUTS: [&'static Type; 0] = []; &INPUTS },
87+
output: &::I32,
88+
definition: Named("llvm.nvvm.read.ptx.sreg.tid.z")
89+
},
90+
_ => return None,
91+
})
92+
}

src/librustc_trans/abi.rs

+5
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ use cabi_mips64;
2525
use cabi_asmjs;
2626
use cabi_msp430;
2727
use cabi_sparc;
28+
use cabi_nvptx;
29+
use cabi_nvptx64;
2830
use machine::{llalign_of_min, llsize_of, llsize_of_alloc};
2931
use type_::Type;
3032
use type_of;
@@ -353,6 +355,7 @@ impl FnType {
353355
Win64 => llvm::X86_64_Win64,
354356
SysV64 => llvm::X86_64_SysV,
355357
Aapcs => llvm::ArmAapcsCallConv,
358+
PtxKernel => llvm::PtxKernel,
356359

357360
// These API constants ought to be more specific...
358361
Cdecl => llvm::CCallConv,
@@ -608,6 +611,8 @@ impl FnType {
608611
"wasm32" => cabi_asmjs::compute_abi_info(ccx, self),
609612
"msp430" => cabi_msp430::compute_abi_info(ccx, self),
610613
"sparc" => cabi_sparc::compute_abi_info(ccx, self),
614+
"nvptx" => cabi_nvptx::compute_abi_info(ccx, self),
615+
"nvptx64" => cabi_nvptx64::compute_abi_info(ccx, self),
611616
a => ccx.sess().fatal(&format!("unrecognized arch \"{}\" in target specification", a))
612617
}
613618

src/librustc_trans/cabi_nvptx.rs

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
// Copyright 2016 The Rust Project Developers. See the COPYRIGHT
2+
// file at the top-level directory of this distribution and at
3+
// http://rust-lang.org/COPYRIGHT.
4+
//
5+
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
6+
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
7+
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
8+
// option. This file may not be copied, modified, or distributed
9+
// except according to those terms.
10+
11+
// Reference: PTX Writer's Guide to Interoperability
12+
// http://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability
13+
14+
#![allow(non_upper_case_globals)]
15+
16+
use llvm::Struct;
17+
18+
use abi::{self, ArgType, FnType};
19+
use context::CrateContext;
20+
use type_::Type;
21+
22+
fn ty_size(ty: Type) -> usize {
23+
abi::ty_size(ty, 4)
24+
}
25+
26+
fn classify_ret_ty(ccx: &CrateContext, ret: &mut ArgType) {
27+
if ret.ty.kind() == Struct && ty_size(ret.ty) > 32 {
28+
ret.make_indirect(ccx);
29+
} else {
30+
ret.extend_integer_width_to(32);
31+
}
32+
}
33+
34+
fn classify_arg_ty(ccx: &CrateContext, arg: &mut ArgType) {
35+
if arg.ty.kind() == Struct && ty_size(arg.ty) > 32 {
36+
arg.make_indirect(ccx);
37+
} else {
38+
arg.extend_integer_width_to(32);
39+
}
40+
}
41+
42+
pub fn compute_abi_info(ccx: &CrateContext, fty: &mut FnType) {
43+
if !fty.ret.is_ignore() {
44+
classify_ret_ty(ccx, &mut fty.ret);
45+
}
46+
47+
for arg in &mut fty.args {
48+
if arg.is_ignore() {
49+
continue;
50+
}
51+
classify_arg_ty(ccx, arg);
52+
}
53+
}

src/librustc_trans/cabi_nvptx64.rs

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
// Copyright 2016 The Rust Project Developers. See the COPYRIGHT
2+
// file at the top-level directory of this distribution and at
3+
// http://rust-lang.org/COPYRIGHT.
4+
//
5+
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
6+
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
7+
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
8+
// option. This file may not be copied, modified, or distributed
9+
// except according to those terms.
10+
11+
// Reference: PTX Writer's Guide to Interoperability
12+
// http://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability
13+
14+
#![allow(non_upper_case_globals)]
15+
16+
use llvm::Struct;
17+
18+
use abi::{self, ArgType, FnType};
19+
use context::CrateContext;
20+
use type_::Type;
21+
22+
fn ty_size(ty: Type) -> usize {
23+
abi::ty_size(ty, 8)
24+
}
25+
26+
fn classify_ret_ty(ccx: &CrateContext, ret: &mut ArgType) {
27+
if ret.ty.kind() == Struct && ty_size(ret.ty) > 64 {
28+
ret.make_indirect(ccx);
29+
} else {
30+
ret.extend_integer_width_to(64);
31+
}
32+
}
33+
34+
fn classify_arg_ty(ccx: &CrateContext, arg: &mut ArgType) {
35+
if arg.ty.kind() == Struct && ty_size(arg.ty) > 64 {
36+
arg.make_indirect(ccx);
37+
} else {
38+
arg.extend_integer_width_to(64);
39+
}
40+
}
41+
42+
pub fn compute_abi_info(ccx: &CrateContext, fty: &mut FnType) {
43+
if !fty.ret.is_ignore() {
44+
classify_ret_ty(ccx, &mut fty.ret);
45+
}
46+
47+
for arg in &mut fty.args {
48+
if arg.is_ignore() {
49+
continue;
50+
}
51+
classify_arg_ty(ccx, arg);
52+
}
53+
}

0 commit comments

Comments
 (0)