Run the runner.sh script (./runner.sh
) in an environment with pandas and numpy,
e.g. in a nvidia-docker container:
nvidia-docker run -it --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:21.06-py3
The runner.sh script contains four arguments for our python runner.
- The device ID of the GPU to use for the runs
- Timeout threshold in seconds for kernels (will terminate the kernel if this is exceeded)
- The maximum matrix size. Decrease this if you run out of memory on your GPU.
- Any nvcc flags as a string, except for -o (already computed based on device ID).
The results archive are completed runs for a GPU architecture that can be used to compare the performance of different models.
The results archive currently contains our T4 results compiled using the default nvcc compilation target of sm_52. The results were created by splitting the kernels into six chunks and running these kernels independently on 3 GPUs for each chunk (for a total of 18 T4 GPUs). The results can then be used to analyze variance between GPUs and average out the means.
Square matricies with dimensions: 240, 496, 784, 1016, 1232, 1680, 2024
{8,8},{16,16},{24,24},{32,32},{1,64},{1,128},{1,192},{1,256},{1,320},{1,384},{1,448},{1,512},{1,576},{1,640},{1,704},{1,768},{1,832},{1,896},{1,960},{1,1024}
N/A results are dropped at the end of the measurement process.
Some runs include multiple samples of each kernel, e.g. the T4 dataset contains three independent runs, resulting in 2 933 905 non-N/A data points.