- Naive software rasterization
- Tiled
- Triangles
- Spirv compilation on CPU
- Packetization(multiple instances compiled into one kernel)
- SIMD1/SIMD4/SIMD64
- Control flow vectorization
- Vertex shaders
- Minimal viable
- Pixel shaders
- Minimal viable
- Packetization(multiple instances compiled into one kernel)
- 2D Texture sampling
- Only nearest neighbor
- Read/write 1D buffers
- Software rasterization
- Lines
- AVX2
- Experiment with int16 fixed point
- Multi-threading
- Spirv compilation on CPU
- Optimizations
- Tune for AVX2
- Data transformation AOS->AOSOA
- Vertex shaders
- Clipping distance
- Pixel shaders
- Discards
- Write depth
- Optimizations
- 2D Texture sampling
- Implicit Mip selection with derivatives
- Read/write 2D Textures
Came up with the following algorithm for starters. It's not super efficient but easy to implement.
- Calculate dominator tree.
- Split edges into forward/backward.
- Sort CFG using topological partial ordering defined by forward edges. Prioritize children within the same strongly connected component.
- Allocate a mask register per basic block(uint64_t).
- Each basic block clears its mask register on exit.
- Jumping on a basic block is setting bits for active lanes.
- For each basic block create a dispatch node that jumps to the basic block if any mask bit set or to the next dispatch node otherwise. That creates a 'dispatch chain' that skips basic blocks with all mask bits clears.
- Conditional jumps are replaced with jumps to dispatch chains.
- Back edges are unconditional jumps to the loop header
HLSL Source:
[[vk::binding(0, 0)]] RWBuffer <uint> g_buf_0;
[[vk::binding(1, 0)]] RWBuffer <uint> g_buf_1;
uint get_num(uint t) {
if (t < 888) {
while (true) {
t = (t ^ (t << 1)) + 1;
if ((t & 7) == 7)
continue;
t = t * (t - 1) + 1;
if (t > 200)
return t;
if ((t & 8) != 0)
break;
}
t = (t << 2) + 1;
return t;
} else {
return t + t * t * t;
}
}
[numthreads(4, 1, 1)]
void main(uint3 tid : SV_DispatchThreadID)
{
g_buf_0[tid.x] = get_num(g_buf_1[tid.x]);
}
Initial SPIRV CFG:
Linearized CFG:
LLVM IR for SIMD4 mode:
cd vulkenstein
python3 tests/run_all_tests.py
- Used for JIT code generation
- Used for headers, spirv disassembly
- Used for microbenchmarking on Linux
cd 3rdparty/libpfc
make
su
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 2 > /sys/bus/event_source/devices/cpu/rdpmc
insmod pfc.ko
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_EXPORT_COMPILE_COMMANDS=1 .. && make
-
Aart J. C. Bik. The Software Vectorization Handbook. Intel Press, 2004.
-
Automatic SIMD Vectorization of SSA-based Control Flow Graphs
-
Solving the structured control flow problem once and for all