CUDA Utils is a header-only library that significantly simplifies complex CUDA kernel code. It provides intuitive wrapper classes for multi-dimensional tensors, making CUDA programming more readable and less error-prone, especially for advanced use cases like high-performance GEMM implementations.
The following examples demonstrate how CUDA Utils can dramatically improve code readability and reduce complexity in CUDA kernels. These snippets are based on real-world usage in high-performance GEMM kernels like QuadMul, OctoMul, and OctoQuadMul.
if (input_mask[batch_idx * num_heads * input_dim1 * input_dim2 +
head_idx * input_dim1 * input_dim2 +
mask_i * input_dim2 + mask_j] == 0) {
output_tensor[batch_idx * num_heads * output_dim1 * output_dim2 +
head_idx * output_dim1 * output_dim2 +
i * output_dim2 + j] = -INFINITY;
}
GMemTensor4D<float> output(output_tensor, batch_size, num_heads, output_dim1, output_dim2);
GMemTensor4D<int> mask(input_mask, batch_size, num_heads, input_dim1, input_dim2);
if (mask.get(batch_idx, head_idx, mask_i, mask_j) == 0) {
output.set(batch_idx, head_idx, i, j, -INFINITY);
}
uint8_t *shared_ptr = &shared_A[stage][row * Config::kTileSizeK + col];
uint8_t *global_ptr = &A[batch_idx * M * Config::K +
(block_row_start + row) * Config::K +
k_offset + col];
__pipeline_memcpy_async(shared_ptr, global_ptr, sizeof(Data128B));
__pipeline_memcpy_async(
smemA.get_ptr(stage, row, col),
gmemA.get_ptr(batch_idx, block_row_start + row, k_offset + col),
sizeof(Data128B));
- Improved Readability: Complex indexing operations become self-explanatory.
- Reduced Errors: Multi-dimensional index calculations are encapsulated, minimizing indexing errors.
- Performance-Oriented: Designed for high-performance computing with efficient memory access patterns.
- Type-Safe Memory Reinterpretation:
get_reinterpreted<>()
andset_reinterpreted<>()
methods allow safe and easy reinterpretation of memory. - Simplified Shared Memory Management: Easier setup and access to shared memory in complex kernels.
MIT