- Overview
- Repository Structure
- Prerequisites
- Installation
- Usage
- Cuda Kernels
- Performance
- License
- Author
This project implements integral image computation for grayscale images using CUDA. It leverages GPU parallel processing to achieve high performance.
The parallel computation on the GPU is based on the alternation of two kernels in the following order:
- Row-wise Scan
- Transpose
- Row-wise Scan
- Transpose
Specifically, for the scan kernel, two versions are provided: a naive implementation and an optimized one.
An integral image, also known as a summed-area table, is a representation that allows for fast computation of the sum of values in a rectangular subset of an image.
For example:
Given an input image
.
├── python/
│ ├── pycuda_test.py # Python script using pyCUDA for invoking CUDA kernels and managing the workflow
│ └── numba_test.py # Python script using Numba for invoking CUDA kernels and managing the workflow
├── c++/
│ ├── main.cu # CUDA source file containing benchmarking logic
│ └── kernel.cu # CUDA kernel definitions
└── integralimage # Script for compiling the project and running benchmarks
- CUDA-capable NVIDIA GPU
- CUDA Toolkit
- C++ compiler
- CMake
- Python 3.x (for Python interface)
- Python Libraries:
- numpy
- pycuda
- numba
- Clone the repository:
git clone https://github.com/AlessioBugetti/integral-image-processing.git
cd integral-image-processing
- Install Python dependencies:
pip install -r python/requirements.txt
- Ensure the CUDA environment is set up:
- Install NVIDIA drivers.
- Install the CUDA Toolkit.
- Verify with
nvcc --version
.
./integralimage build
./integralimage run
python pycuda_test.py
or
python numba_test.py
SumRows
: Naively computes the row-wise scan (prefix sum) of a matrixSinglePassRowWiseScan
: Optimized computation of the row-wise scan (prefix sum) of a matrixTranspose
: Transposes a matrix using block-level tiling with shared memory
The implementation includes benchmarking capabilities that measure:
- Sequential CPU execution time
- CUDA execution time for the naive implementation of the integral image computation
- CUDA execution time for the optimized implementation of the integral image computation
- Speedup ratios compared to the CPU implementation for both the naive and optimized implementations
- Measurements are averaged over multiple iterations to ensure reliable results.
This project is licensed under the GPL-3.0-only License. See the LICENSE
file for more details.
Alessio Bugetti - alessiobugetti98@gmail.com