-
Notifications
You must be signed in to change notification settings - Fork 356
Optimization using cuFFT Callbacks
We'll consider 3 of the kernels within the inner loop of the PyCBC analysis code.
-
correlate
multiples a complex template and data segment together to create the 'correlation' vector
-
ifft
Using cuFFT, we take the correlation vector and inverse FFT it to produce an SNR time series. This is a C2C out of place transform (2^20 points unless otherwise noted)
-
threshold
Peaks in the SNR timeseries are located by windows maximums within a window size.
- No callback used
- Fuse the correlate and iFFT together
- (2) plus simply return zero for the second half of the correlation vector.
Future Variations that should be safe
- Store template as real phase, calculate both phases within input callback
- Store SNR timeseries as fp16
- Move portion of thresholding code into output callback.
- Use exact non-zero input boundaries (How do we pass in the boundaries?)
- Use exact non-corrupted output boundaries (How do we pass in the boundaries?)
Future Variations that need accuracy checking
- Load template and/or data as fp16
Callback wishlist
- Callbacks for input/output of intermediate steps in the FFT
- CUDA 6.5
- Vector sizes are 2^20 points complex single precision unless otherwise noted
- FFT batching is not used.
- in-situ analysis on Gaussian data
- host performance and GPU dead time are ignored.
- Results obtained by running under COMPUTE_PROFILE=1 and parsing the resultant log files.
- No Callbacks
Operation | Kernel | Average gputime (µs) |
---|---|---|
FFT | _ZN12spRadix0256B10k | 208.00 x 2 |
FFT | _ZN12spRadix0016B10k | 180.25 |
Correlate | correlate | 175.18 |
Threshold | threshold_and_cluste | 75.41 |
Threshold | threshold_and_cluste | 5.14 |
- Fuse Correlate FFT
Operation | Kernel | Average gputime (µs) |
---|---|---|
FFT | _ZN12spRadix0256C18k (callback) | 319.51 |
FFT | _ZN12spRadix0256B10k | 198.09 |
FFT | _ZN12spRadix0016B10k | 181.27 |
Threshold | threshold_and_cluste | 75.48 |
Threshold | threshold_and_cluste | 5.23 |
__device__ cufftComplex in_call(void* input, size_t offset,
void* caller_info, void* shared) {
cufftComplex r;
cufftComplex s = ((cufftComplex*) input)[offset];
cufftComplex h = ((cufftComplex*) %s)[offset];
r.x = h.x * s.x + h.y * s.y;
r.y = h.x * s.y - h.y * s.x;
return r;
}
- Fuse Correlate FFT, and return zeros for second half of vector
Operation | Kernel | Average gputime (µs) |
---|---|---|
FFT | _ZN12spRadix0256C18k (callback) | 254.62 |
FFT | _ZN12spRadix0256B10k | 198.03 |
FFT | _ZN12spRadix0016B10k | 181.27 |
Threshold | threshold_and_cluste | 75.58 |
Threshold | threshold_and_cluste | 5.14 |
__device__ cufftComplex in_call(void* input, size_t offset,
void* caller_info, void* shared) {
if (offset > %s)
return (cufftComplex){0, 0};
else{
cufftComplex r;
cufftComplex s = ((cufftComplex*) input)[offset];
cufftComplex h = ((cufftComplex*) %s)[offset];
r.x = h.x * s.x + h.y * s.y;
r.y = h.x * s.y - h.y * s.x;
return r;
}
}
Adding a callback for the store operation has very strange results at the moment. Using the (3) load callback and adding a simple store callback that just writes out the results with no changes made, results in a very large performance regression in that last FFT kernel. One might expect there to be some overhead using the store callback, and so would be slightly slower than the (3) case alone.
Operation | Kernel | Average gputime (µs) |
---|---|---|
FFT | _ZN12spRadix0256C18k (load callback) | 254.61 |
FFT | _ZN12spRadix0256B10k | 198.12 |
FFT | _ZN12spRadix0016A18k (store callback) | 558.46 |
Threshold | threshold_and_cluste | 75.65 |
Threshold | threshold_and_cluste | 5.14 |
__device__ void out_call(void *out, size_t offset, cufftComplex element,
void *caller_info, void *shared){
((cufftComplex*) out)[offset] = element;
}
Even if the store callback is a NOOP so that nothing is written to the output memory, it still seems to be anomalously slow. Even with some overhead, I would have expected this to be faster than the non-callback case as you save on the memory writes.
Operation | Kernel | Average gputime (µs) |
---|---|---|
FFT | _ZN12spRadix0256C18k (load callback) | 254.97 |
FFT | _ZN12spRadix0256B10k | 197.76 |
FFT | _ZN12spRadix0016A18k (store callback) | 478.47 |
Threshold | threshold_and_cluste | 76.08 |
Threshold | threshold_and_cluste | 4.64 |
__device__ void out_call(void *out, size_t offset, cufftComplex element,
void *caller_info, void *shared){
}