A comprehensive CUDA implementation showcasing various matrix multiplication optimization techniques, from naive approaches to highly optimized kernels with register tiling and vectorization.
- 5 Different Kernel Implementations with progressive optimizations
- Comprehensive Benchmarking against cuBLAS and CUTLASS
- Performance Analysis with detailed metrics
- Modular Architecture for easy experimentation
- Size-Specialized Kernels for optimal performance
CAMM/
├── Kernel/ # CUDA kernel implementations
│ ├── matmul_naive/ # Basic matrix multiplication
│ ├── mat_mul_coalesced/ # Memory coalescing optimization
│ ├── mat_mul_sharedmem/ # Shared memory optimization
│ └── mat_mul_register_tiling/ # Register tiling with specialization
├── Header/
│ └── matmul_kernels.cuh # Kernel function declarations
├── utils/ # Benchmarking and utility functions
│ ├── benchmark_matmul_*.cu # Individual kernel benchmarks
│ ├── main.cu # Main benchmarking suite
│ └── cpu_benchmarking.cpp # CPU reference implementation
├── Benchmarks/ # Performance results (ignored by git)
└── cutlass/ # NVIDIA CUTLASS library integration
- Description: Basic matrix multiplication without optimizations
- Grid/Block: Standard 2D grid configuration
- Use Case: Baseline performance reference
- Description: Optimized memory access patterns for better bandwidth utilization
- Optimization: Ensures coalesced global memory access
- Performance: ~2-3x improvement over naive implementation
- Description: Utilizes shared memory to reduce global memory accesses
- Optimization: Tile-based computation with shared memory blocking
- Performance: ~4-6x improvement over naive implementation
- Description: Advanced optimization using register-level tiling
- Features:
- Base register tiling implementation
- Size-specialized kernels for 128x128 and 512x512 matrices
- Optimized grid dimensions:
gridDim(16,16)
,blockDim(16,16)
- Performance: ~8-12x improvement over naive implementation
- NVIDIA GPU with CUDA Compute Capability 6.0+
- CUDA Toolkit 11.0+
- GCC/G++ compiler
- CMake (optional)
# Naive implementation
nvcc -o naive utils/benchmark_matmul_naive.cu Kernel/matmul_naive/*.cu
# Coalesced memory access
nvcc -o coalesced utils/benchmark_matmul_coalesced.cu Kernel/mat_mul_coalesced/*.cu
# Shared memory optimization
nvcc -o shared utils/benchmark_matmul_sharedmem.cu Kernel/mat_mul_sharedmem/*.cu
# Register tiling
nvcc -o register utils/benchmark_matmul_register_tiling.cu Kernel/mat_mul_register_tiling/*.cu
# Compile main benchmarking application
nvcc -o benchmark utils/main.cu Kernel/*/*.cu -I./Header
# Compare against cuBLAS
nvcc -o cublas_bench utils/benchmark_matmul_cublas.cu -lcublas
# Compare against CUTLASS
nvcc -o cutlass_bench utils/benchmark_matmul_cutlass.cu -I./cutlass/include
nvcc -O3 -arch=sm_75 -use_fast_math -Xptxas -O3 -o <output> <source_files>
# Run individual kernel benchmark
./benchmark
# Compare with cuBLAS
./cublas_bench
# Compare with CUTLASS
./cutlass_bench
Kernel Type | Relative Performance | Memory Efficiency | Best Use Case |
---|---|---|---|
Naive | 1x (baseline) | Low | Learning/Reference |
Coalesced | 2-3x | Medium | Small matrices |
Shared Memory | 4-6x | High | Medium matrices |
Register Tiling | 8-12x | Very High | Large matrices |
- General sizes: Use register tiling implementation
- Very large matrices: Consider cuBLAS integration
- Memory Coalescing: Ensuring aligned memory access patterns
- Shared Memory Utilization: Reducing global memory bandwidth requirements
- Register Tiling: Maximizing register usage and reducing memory latency
- Thread Block Optimization: Optimal thread block dimensions
- Vectorized Operations: Using vector load/store instructions
- Size Specialization: Kernel variants optimized for specific matrix dimensions
- Create kernel implementation in
Kernel/<kernel_name>/
- Add declaration to
Header/matmul_kernels.cuh
- Create benchmark in
utils/benchmark_<kernel_name>.cu
- Update main benchmarking suite
- Benchmark results are automatically saved to
Benchmarks/
(git-ignored) - Use consistent matrix sizes for fair comparisons
- Run multiple iterations for statistical significance
- Follow the existing code structure
- Add comprehensive benchmarks for new implementations
- Document optimization techniques used
- Ensure compatibility with existing build system
- NVIDIA CUDA Programming Guide
- CUTLASS: CUDA Templates for Linear Algebra Subroutines
- cuBLAS Library Documentation
- adding more soon
Note: Performance results may vary based on GPU architecture, CUDA version, and system configuration. Benchmark on your target hardware for accurate performance characteristics.