bugfix: fix cu118 cub usage (#410)

Related issue: sgl-project/sglang#771 This PR fixes the usage of `FlagHeads` cub API in sampling kernels. As [documented](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockDiscontinuity.html), the default FlagHeads api will always flag the first element, which is not expected when first element is not `true`. > For thread0, item input[0] is always flagged. This PR sets the `tile_predecessor_item` argument (to 0) which will be compared against input[0]. CUDA 12+ don't have this issue because we are using the new `SubtractLeft` API instead of `FlagHeads`.
flashinfer-ai · Jul 30, 2024 · 58d3593 · 58d3593
1 parent aaa929a
commit 58d3593
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/include/flashinfer/sampling.cuh b/include/flashinfer/sampling.cuh
@@ -118,7 +118,7 @@ __device__ __forceinline__ void DeviceSamplingFromProb(
         .SubtractLeft<VEC_SIZE>(greater_than_u, greater_than_u_diff, BoolDiffOp());
 #else
     BlockAdjacentDifference<bool, BLOCK_THREADS>(temp_storage->block_prim.adj_diff)
-        .FlagHeads<VEC_SIZE>(greater_than_u_diff, greater_than_u, BoolDiffOp());
+        .FlagHeads<VEC_SIZE>(greater_than_u_diff, greater_than_u, BoolDiffOp(), 0);
 #endif
     __syncthreads();