Releases · flashinfer-ai/flashinfer

05 Aug 00:37

yzh119

v0.2.9

9158fef

v0.2.9 Latest

Latest

What's Changed

Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
Made AR output optional + esthetic changes by @nvmbreughe in #1265
init add gemm fp8 using cudnn backend by @ttyio in #1264
Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
CI: install nvidia-nvshmem-cu12 by @EmilienM in #1262
feat: enable trtllm-gen mla MTP by @yyihuang in #1258
Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
add trtllm-gen context attention by @IwakuraRein in #1239
feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
Add missing import in comm/init,py by @joker-eph in #1275
hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
Unify groupwise fp8 GEMM test by @cyx-6 in #1281
fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
Add shuffle matrix flag by @aleozlx in #1272
Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
patch error handling by @aleozlx in #1293
Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
add mm_fp4 use cudnn backend by @ttyio in #1288
fix: minor errors in cubin loader by @yyihuang in #1295
perfix: use lightweight API to query device property by @azhurkevich in #1298
refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
[Feature] SM level profiler by @Edenzzzz in #1305
Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
Update cutlass fp4 moe kernels by @wenscarl in #1294
Fix the bug of the kernel-selection heuristic in trtllm-gen by @PerkzZheng in #1307
test qkvo quantization not equal to 1. by @weireweire in #1314
[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. by @happierpig in #1290
Addition of flashinfer_benchmark.py for benchmarking routines by @bkryu in #1323
minor: update devcontainer by @yyihuang in #1329
Fix redundant argument in TrtllmGenDecodeModule by @IwakuraRein in #1326
Optimizations for TRTLLM MNNVL Allreduce by @timlee0212 in #1321
add torch float4_e2m1fn_x2 check for cudnn fp4 backend by @ttyio in #1333
only add cudnn dependency for x86 platform by @ttyio in #1332
Make Fp8 MoE routing_bias optional by @aleozlx in #1319
feat: Add weight layout option for trtllm-gen fused moe by @aleozlx in #1297
[Fix] remove torch 2.8 requirement for FP4 GEMM by @elfiegg in #1334
Bug fix: fix duplicate launch in POD by @Edenzzzz in #1267
Add blockwise-scaled FP8 GEMM via TRTLLM-Gen. by @sergachev in #1320
feat: support output nvfp4 in trtllm-gen function call. by @weireweire in #1318
Fix bench deepgemm setting by @cyx-6 in #1344
fix: fix trtllm-gen mla error on new interface by @yyihuang in #1348
[Bugfix] Change max_size for LRU by @elfiegg in #1349
Support loading autotuned results from json for cutlass fp4 moe backends by @kaixih in #1310
Refactor scripts in benchmarks to use flasinfer.testing.bench_gpu_time by @bkryu in #1337
bugfix: Change default index in routingTopKExperts by @amirkl94 in #1347
Support passing kv_data_type to MultiLevelCascadeAttentionWrapper.plan() by @sarckk in #1350
Add trtllm-gen prefill test. Fix related wrapper issue. by @weireweire in #1346
feat: Support logits_soft_cap for Persistent attn; fix kv split limit by @Edenzzzz in #1324
chore: remove cpp benchmarks, tests, cmake path, as they are deprecated by @hypdeb in #1345
minor: add trtllm_gen_mla benchmark by @yyihuang in #1316
cleanup: retire aot-build-utils by @yzh119 in #1354
minor: more informative error message for buffer overflow by @Edenzzzz in #1357
gen_trtllm_comm_module: fix device capability detection by @dtrifiro in #1356
Refactor Fused Moe Module by @wenscarl in #1309
Add native cudnn_decode for improved cudnn decode performance by @Anerudhan in #1283
Update CI docker container to use latest cudnn by @yzh119 in #1362
feature: add fp4 mm using trtllm backend by @ttyio in #1355
support trtllm-gen prefill fp4 output by @weireweire in #1360
Allow cudnn prefill kernels to be called natively by @Anerudhan in #1317
bugfix: fix ci for aot-compile by @yzh119 in #1364
feat: auto deduce use_oneshot from token_num in all-reduce by @yyihuang in #1365
add cutlass backend for mm_fp4 by @ttyio in #1296
Support scale factor start index for fp4 mha prefill/decode by @weireweire in #1363
test: add cuda graph to comm test by @yyihuang in #1366
ci: add requests to ci docker container by @yzh119 in #1370
Artifact downloading and single sourced artifact path by @cyx-6 in #1369
[fix] remove (view) transpose to keep consistent with majorness MN requirement. by @elfiegg in #1358
hotfix: update mxfp4 groupwise-scaled gemm unittests by @yzh119 in #1359
bugfix: fixed cutlass fused moe usage of FP4QuantizationSFLayout::SWIZZLED by @yzh119 in #1371
ci: add blackwell unittest scripts by @yzh119 in #1372
Update documentation index by @cyx-6 in #1374
bugfix: do cudnn related error check only when cudnn backend is enabled. by @ttyio in #1377
bugfix: Add guard for fp4/fp8 related include headers by @yzh119 in #1376
refactor: download trtllm gemm metadata from server by @ttyio in #1378
Fix sphinx error by @cyx-6 in #1380
release: bump version to v0.2.9 by @yzh119 in #1381

New Contributors

...

Contributors

directhex, ttyio, and 28 other contributors

Assets 2

27 Jul 05:18

yzh119

v0.2.9rc2

cf39366

v0.2.9rc2 Pre-release

Pre-release

What's Changed

Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
Made AR output optional + esthetic changes by @nvmbreughe in #1265
init add gemm fp8 using cudnn backend by @ttyio in #1264
Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
CI: install nvidia-nvshmem-cu12 by @EmilienM in #1262
feat: enable trtllm-gen mla MTP by @yyihuang in #1258
Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
add trtllm-gen context attention by @IwakuraRein in #1239
feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
Add missing import in comm/init,py by @joker-eph in #1275
hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
Unify groupwise fp8 GEMM test by @cyx-6 in #1281
fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
Add shuffle matrix flag by @aleozlx in #1272
Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
patch error handling by @aleozlx in #1293
Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
add mm_fp4 use cudnn backend by @ttyio in #1288
fix: minor errors in cubin loader by @yyihuang in #1295
perfix: use lightweight API to query device property by @azhurkevich in #1298
refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
[Feature] SM level profiler by @Edenzzzz in #1305
Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
Update cutlass fp4 moe kernels by @wenscarl in #1294
Fix the bug of the kernel-selection heuristic in trtllm-gen by @PerkzZheng in #1307
test qkvo quantization not equal to 1. by @weireweire in #1314
[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. by @happierpig in #1290
Addition of flashinfer_benchmark.py for benchmarking routines by @bkryu in #1323
minor: update devcontainer by @yyihuang in #1329
Fix redundant argument in TrtllmGenDecodeModule by @IwakuraRein in #1326
Optimizations for TRTLLM MNNVL Allreduce by @timlee0212 in #1321
add torch float4_e2m1fn_x2 check for cudnn fp4 backend by @ttyio in #1333
only add cudnn dependency for x86 platform by @ttyio in #1332
Make Fp8 MoE routing_bias optional by @aleozlx in #1319
feat: Add weight layout option for trtllm-gen fused moe by @aleozlx in #1297
[Fix] remove torch 2.8 requirement for FP4 GEMM by @elfiegg in #1334
Bug fix: fix duplicate launch in POD by @Edenzzzz in #1267

New Contributors

@vlev02 made their first contribution in #1254
@ttyio made their first contribution in #1264
@azhurkevich made their first contribution in #1214
@weireweire made their first contribution in #1242
@IwakuraRein made their first contribution in #1239
@nvpohanh made their first contribution in #1286
@directhex made their first contribution in #1279
@ilmarkov made their first contribution in #1284
@elfiegg made their first contribution in #1303
@PerkzZheng made their first contribution in #1307
@bkryu made their first contribution in #1323
@timlee0212 made their first contribution in #1321

Full Changelog: v0.2.8...v0.2.9rc2

Contributors

directhex, ttyio, and 21 other contributors

Assets 2

23 Jul 08:01

yzh119

v0.2.9rc1

d3d76b7

v0.2.9rc1 Pre-release

Pre-release

What's Changed

Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
Made AR output optional + esthetic changes by @nvmbreughe in #1265
init add gemm fp8 using cudnn backend by @ttyio in #1264
Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
CI: install nvidia-nvshmem-cu12 by @EmilienM in #1262
feat: enable trtllm-gen mla MTP by @yyihuang in #1258
Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
add trtllm-gen context attention by @IwakuraRein in #1239
feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
Add missing import in comm/init,py by @joker-eph in #1275
hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
Unify groupwise fp8 GEMM test by @cyx-6 in #1281
fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
Add shuffle matrix flag by @aleozlx in #1272
Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
patch error handling by @aleozlx in #1293
Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
add mm_fp4 use cudnn backend by @ttyio in #1288
fix: minor errors in cubin loader by @yyihuang in #1295
perfix: use lightweight API to query device property by @azhurkevich in #1298
refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
[Feature] SM level profiler by @Edenzzzz in #1305
Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
Update cutlass fp4 moe kernels by @wenscarl in #1294

New Contributors

@vlev02 made their first contribution in #1254
@ttyio made their first contribution in #1264
@azhurkevich made their first contribution in #1214
@weireweire made their first contribution in #1242
@IwakuraRein made their first contribution in #1239
@nvpohanh made their first contribution in #1286
@directhex made their first contribution in #1279
@ilmarkov made their first contribution in #1284
@elfiegg made their first contribution in #1303

Full Changelog: v0.2.8...v0.2.9rc1

Contributors

directhex, ttyio, and 17 other contributors

Assets 2

15 Jul 06:52

yzh119

v0.2.8

3f8317c

v0.2.8

What's Changed

[fix] fix BatchAttention CTA_TILE_KV mask issue by @happierpig in #1206
feat: enable and update all-reduce fused quantization by @yyihuang in #1164
Fix the issue with auxillary kernel launch and grid dim calculation by @Anerudhan in #1208
Fix test_groupwise_scaled_gemm_fp8.py by @jinyangyuan-nvidia in #1211
[TVM] Remove enable_pdl from TVM binding interface by @MasterJH5574 in #1217
misc: minor adds in readme by @yyihuang in #1218
bugfix: fix blackwell fmha hanging issue for empty kv_len by @yzh119 in #1198
update trtllm-gen decode attention kernel launcher by @wenscarl in #1189
Handle allocation cutlass fused MoE output to caller by @wenscarl in #1225
Fix missing hash in the cudnn cubin path by @Anerudhan in #1227
bugfix: add logits processor to pyproject.toml by @yzh119 in #1224
fix: add trtllm-allreduce-fusion api notes and fix memory error by @yyihuang in #1229
feat: Add non-causal cudnn prefill kernels by @Anerudhan in #1230
minor: update oneshot handling, add params notes by @yyihuang in #1232
Enable cudnn decode and add tests for the cudnn decode kernel by @Anerudhan in #1221
docker: add cuda-python to CI docker image by @yzh119 in #1233
bugfix: Fix building without get_requires*() invocation by @mgorny in #1226
bugfix: support uint8_t for vec_t class template by @chenyang78 in #1234
feat: trtllm-gen fp8 moe kernels by @aleozlx in #1212
Patch fp8 cubin availability by @aleozlx in #1240
[comm] TRT-LLM's Multi-Node NVLink All-Reduce Kernel by @nvmbreughe in #1213
feat: Support MXFP8 x MXFP4 CUTLASS grouped GEMM by @jinyangyuan-nvidia in #1241
feat: add trtllm-gen mla cubin by @yyihuang in #1222
Add DeepGEMM kernels by @cyx-6 in #1209
Remove sm100+ requirment for trtllm allreduce kernels by @yzh119 in #1249
Defer mpi import for comm module by @yzh119 in #1250
feat: support environment variable overrides for NVSHMEM paths and linker flags by @EmilienM in #1253
release: bump version to v0.2.8 by @yzh119 in #1257
TRT-LLM's Multi-Node NVLink AR + fused RMSNorm kernel by @nvmbreughe in #1255

New Contributors

@jinyangyuan-nvidia made their first contribution in #1211
@mgorny made their first contribution in #1226
@chenyang78 made their first contribution in #1234
@aleozlx made their first contribution in #1212
@nvmbreughe made their first contribution in #1213
@EmilienM made their first contribution in #1253

Full Changelog: v0.2.7.post1...v0.2.8

Contributors

mgorny, chenyang78, and 11 other contributors

Assets 2

08 Jul 18:30

yzh119

v0.2.8rc1

728e8bb

v0.2.8rc1 Pre-release

Pre-release

What's Changed

[fix] fix BatchAttention CTA_TILE_KV mask issue by @happierpig in #1206
feat: enable and update all-reduce fused quantization by @yyihuang in #1164
Fix the issue with auxillary kernel launch and grid dim calculation by @Anerudhan in #1208
Fix test_groupwise_scaled_gemm_fp8.py by @jinyangyuan-nvidia in #1211
[TVM] Remove enable_pdl from TVM binding interface by @MasterJH5574 in #1217
misc: minor adds in readme by @yyihuang in #1218
bugfix: fix blackwell fmha hanging issue for empty kv_len by @yzh119 in #1198
update trtllm-gen decode attention kernel launcher by @wenscarl in #1189
Handle allocation cutlass fused MoE output to caller by @wenscarl in #1225
Fix missing hash in the cudnn cubin path by @Anerudhan in #1227
bugfix: add logits processor to pyproject.toml by @yzh119 in #1224
fix: add trtllm-allreduce-fusion api notes and fix memory error by @yyihuang in #1229
feat: Add non-causal cudnn prefill kernels by @Anerudhan in #1230
minor: update oneshot handling, add params notes by @yyihuang in #1232
Enable cudnn decode and add tests for the cudnn decode kernel by @Anerudhan in #1221
docker: add cuda-python to CI docker image by @yzh119 in #1233
bugfix: Fix building without get_requires*() invocation by @mgorny in #1226
bugfix: support uint8_t for vec_t class template by @chenyang78 in #1234

New Contributors

@jinyangyuan-nvidia made their first contribution in #1211
@mgorny made their first contribution in #1226
@chenyang78 made their first contribution in #1234

Full Changelog: v0.2.7.post1...v0.2.8rc1

Contributors

mgorny, chenyang78, and 7 other contributors

Assets 2

01 Jul 18:14

yzh119

v0.2.7.post1

3fb73b3

v0.2.7.post1

What's Changed

[feat] optimize persistent batch attention perf. by @happierpig in #1200
Feature/cudnn dynamic cubin by @Anerudhan in #1187
Fix flashinfer.comm module missing by @BBuf in #1203
chore: bump flashinfer v0.2.7.post1 by @zhyncs in #1205

New Contributors

@Anerudhan made their first contribution in #1187
@BBuf made their first contribution in #1203

Full Changelog: v0.2.7...v0.2.7.post1

Contributors

Anerudhan, BBuf, and 2 other contributors

Assets 2

30 Jun 19:39

yzh119

v0.2.7

4d3fb6d

v0.2.7

What's Changed

ci: Update images for self-hosted ARM64 runner by @yongwww in #1128
Fix pointer dtype bug in rope by @Edenzzzz in #1129
feat: update and test create_ipc_buffer by @yyihuang in #1130
misc: update runllm widget by @yzh119 in #1132
misc: correct runllm widget (again) by @MasterJH5574 in #1133
[Feature] Support PDL for batch Prefill and Decode by @Edenzzzz in #1117
fix: negative zero by type trait --> binary value by @yyihuang in #1136
fix: sync after create_workspace by @yyihuang in #1138
refactor: use functools.cache instead of global dict for caching modules by @yzh119 in #1135
[feat] add unified batch attention w/ correctness tests. by @happierpig in #1137
Fix FA2 and FA3 multi-item scoring and cuda illegal memory access error by @arde171 in #1140
feat: Add support for FLASHINFER_EXTRA_LDFLAGS environment variable by @jennifgcrl in #1144
misc: remove sync between persistent runners and use packed_causal_kv_end for SM90Plan by @Edenzzzz in #1146
[fix] fix precision errors when applying causal mask on Qwen-2.5 series models by @happierpig in #1148
ci: Install mpi4py by @yongwww in #1149
feat: add trtllm moe_allreduce_fusion by @yyihuang in #1108
feat: add trtllm all-reduce fusion by @yyihuang in #1131
Add more logging to TRTLLM-GEN debug trace (NFC) by @joker-eph in #1158
feat: update non-fused moe by @yyihuang in #1161
Add fp4 quantization swizzling tests by @wenscarl in #1157
refactor: communication module by @yyihuang in #1162
feat: add finalize_moe_allreduce from trtllm by @yyihuang in #1159
feat: experimental support of green ctx by @yzh119 in #1163
feat: Fused temperature online softmax kernel by @xslingcn in #1153
MNNVL MoE All-to-All Support by @cyx-6 in #1134
feat: nvshmem python bindings by @yzh119 in #1160
Fix missing symbols in trtllm_utils.so by @tiran in #1168
feat: logits processor fustion rule for temperature softmax by @xslingcn in #1170
Expose fp4 blockscale swizzling kernel by @wenscarl in #1176
add nvshmem sum_reduce for mnnvl allreduce by @Amir-19 in #1152
bugfix: softmax NaN results caused by large -inf masks by @xslingcn in #1178
[CI] Update is_last_build by @yongwww in #1183
[feat] support block sparse attention w/ variable block sizes and head-wise sparse patterns by @happierpig in #1177
bugfix: fix invalid blackwell fmha unittests by @yzh119 in #1181
feat: support green ctx creation by a list of SM counts by @Conless in #1190
fix: trtllm_comm module aot arch issues by @yyihuang in #1196
bugfix: fix broken docs build by adding missing dependencies by @Conless in #1197
chore: bump v0.2.7 by @zhyncs in #1199

New Contributors

@jennifgcrl made their first contribution in #1144
@tiran made their first contribution in #1168
@Amir-19 made their first contribution in #1152
@Conless made their first contribution in #1190

Full Changelog: v0.2.6.post1...v0.2.7

Contributors

tiran, joker-eph, and 14 other contributors

Assets 2

07 Jun 03:24

yzh119

v0.2.6.post1

bc50f1a

v0.2.6.post1

What's Changed

[CI] Add x86_64 tag for x86 self-hosted runner by @yongwww in #1126
hotfix: fix installation script behavior by @yzh119 in #1125

Full Changelog: v0.2.6...v0.2.6.post1

Contributors

yongwww and yzh119

Assets 2

06 Jun 19:13

yzh119

v0.2.6

608a343

v0.2.6

What's Changed

ci: select 2_28 manylinux builder for new torch+cuda versions by @yzh119 in #1000
misc: update REAMDME.md by @yzh119 in #1003
bugfix: Fix illegal memory access due to custom mask ptr by @yongchaoding in #1008
misc: fix kv-layout doc references by @Edenzzzz in #1009
misc: more benchmark scripts in Python by @yzh119 in #1010
misc: fix instrument code for mla profiler by @yzh119 in #1014
bugfix: import wrapper of mla decode by @dhy2000 in #1013
feat: update decode attention APIs by @yzh119 in #1007
doc: use latest protobuf for profiler by @xslingcn in #1021
feat: SM-constraint Communication Kernels by @yyihuang in #994
feat: ragged tensor padding kernel for blackwell kernel alignment by @yzh119 in #1025
bugfix: fix custom mask not be reseted after convert custom mask into causal or non-causal by @yongchaoding in #1028
fix: add zero init for KV tiled copy by @happierpig in #1029
[NVIDIA] Add Cutlass MLA backend by @kaixih in #1031
Add workflow to build aarch64 wheel by @yongwww in #1036
Non-blocking host-to-device copy in the ragged prefill wrapper by @nandor in #1040
fix: remove default ubuntu user in Lunar/Noble by @rickyfeng0119 in #1042
feat: Softmax free sampling by @kf-zhang in #1035
feat: add functional per-head FP8 quantization for FA3 by @happierpig in #1033
add multi-item scoring by @arde171 in #1015
[nvidia] cutlass fp8 blockwise/groupwise gemm support by @cyx-6 in #1045
[nvidia] cutlass fp8 groupwise grouped gemm support by @cyx-6 in #1047
fix: top_k_mask_logits hangs on -inf inputs by @xslingcn in #1050
Benchmark: POD vs batched prefill by @Edenzzzz in #1052
[nvidia] initial support for blackwell kernels by @yzh119 in #1039
Fix KV chunking for POD. by @AKKamath in #1054
bugfix: temporally disable split-kv in blackwell mla by @yzh119 in #1055
bugfix: remove device allocation by @yzh119 in #1056
Parameterize prefix mask call (needed by POD-Attention) by @AKKamath in #1059
bugfix: move cum_m calculation inside kernels by @yzh119 in #1060
misc: add pull request template by @yzh119 in #1062
bugfix: Cast build paths to str before setuputils Extension by @farnasirim in #1058
Add PyTorch 2.7.0 build by @huydhn in #1063
bugfix: adding lse output to blackwell fmha kernels by @yzh119 in #1071
bugfix: follow user-specified sm_scale for blackwell cutlass fmha by @yzh119 in #1072
misc: jit: Introduce JitSpec and Generate ninja file by @abcdabcd987 in #1065
fix: fix a typo in docs by @acelyc111 in #1077
misc: jit: Deprecate load_cuda_ops() by @abcdabcd987 in #1066
misc: jit: fix missing _get_glibcxx_abi_build_flags by @abcdabcd987 in #1080
misc: jit: Refactor gen JitSpec out of get_xxx_module by @abcdabcd987 in #1069
misc: jit: Replace parallel_load_modules() with build_jit_specs() by @abcdabcd987 in #1070
misc: jit: Import jit_env as a module by @abcdabcd987 in #1073
misc: aot: Add script to build all AOT ops by @abcdabcd987 in #1067
misc: aot: Refactor AOT packaging by @abcdabcd987 in #1075
misc: aot: Remove has_prebuilt_ops by @abcdabcd987 in #1076
ci: upgrade docker ci image by @yzh119 in #1082
bugfix: fix custom allreduce compilation in AOT mode by @yzh119 in #1083
perf: accelerate blackwell grouped gemm by @yzh119 in #1086
misc: update pull request template by @yzh119 in #1088
Fix Cutlass grouped GEMM stride by @cyx-6 in #1081
bugfix: fix fp8 attention kernels aot compilation issue by @yzh119 in #1087
comm: refactor and initialize flashinfer.comm module by @yzh119 in #1089
misc: cleanup by @b8zhong in #1092
misc: followup by @b8zhong in #1093
[nvidia] Add Blackwell FMHA decode kernel from TRT-LLM by @joker-eph in #1051
bugfix: fix ninja generation rule for non-cuda input by @yzh119 in #1097
jit: Update TVM JIT binding with the latest FFI refactor by @MasterJH5574 in #1100
SM100 Groupwise GeMM K-Major Scale Supports by @cyx-6 in #1102
misc: aot: Add platform tag to wheel by @abcdabcd987 in #1105
feat: composable logits processor by @xslingcn in #1099
feat: add trtllm all-reduce (non-MoE) by @yyihuang in #1096
bugfix: host-precomuted plan function for blackwell fmha by @yzh119 in #1106
doc: fix LogitsPipe example by @xslingcn in #1110
bugfix: bugfix for blackwell mla split-k by @yzh119 in #1109
Add CUTLASS fused moe kernels from TensorRT-LLM. by @wenscarl in #1113
fix: initialize lamport buffer only once after creating new workspace by @yyihuang in #1111
hotfix: fix the blackwell fmha stream by @yzh119 in #1116
fix head_dim not defined if sm_scale is not None by @majian4work in #1119
doc: add Ask-AI widget by @xslingcn in #1121
bugfix: Fix test and output shape of fp4 quantize by @wenscarl in #1114
misc: update slack link by @yzh119 in #1120
release: bump version to v0.2.6 by @yzh119 in #1122

New Contributors

@yongchaoding made their first contribution in #1008
@Edenzzzz made their first contribution in #1009
@dhy2000 made their first contribution in #1013
@kaixih made their first contribution in #1031
@yongwww made their first contribution in #1036
@rickyfeng0119 made their first contribution in #1042
@kf-zhang made their first contribution in #1035
@arde171 made their first contribution in #1015
@farnasirim made their first contribution in #1058
@huydhn made their first contribution in #1063
@acelyc111 made their first contribution in #1077
@b8zhong made their first contribution in #1092
@joker-eph made their first contribution in #1051
@wenscarl made their first contribution in #1113
@majian4work made their first contribution in #1119

Full Changelog: v0.2.5...v0.2.6

Contributors

huydhn, nandor, and 22 other contributors

Assets 2

04 Apr 00:41

yzh119

v0.2.5

592b110

v0.2.5

What's Changed

Fix compilation with FP16_QK_REDUCTION enabled. by @diptorupd in #962
misc: Use environment variable to control JIT verbose flag by @yzh119 in #981
Triton rms_norm kernels by @nandor in #983
Allow passing workspace base directory via environment variable by @jsuchome in #973
[CHORE] Rename output_emitted_token_num -> output_emitted_draft_token_num by @jon-chuang in #977
ci: switch to on-demand instances if spot instance is interrupted by @yzh119 in #987
misc: update devcontainer by @yzh119 in #986
ci: add torch 2.6+cu126 wheel by @yzh119 in #985
misc: fix devcontainer conda path by @yzh119 in #989
perf: prefetch page indices for mla kernel by @yzh119 in #991
SM-constraint-GEMM by triton persistent kernel by @yyihuang in #982
3rdparty: upgrade cutlass to 3.9 by @yzh119 in #997
perf: add -DNDEBUG compilation flag by @yzh119 in #998
release: bump version to v0.2.5 by @yzh119 in #999

New Contributors

@jsuchome made their first contribution in #973
@jon-chuang made their first contribution in #977
@yyihuang made their first contribution in #982

Full Changelog: v0.2.4...v0.2.5

Contributors

jsuchome, nandor, and 4 other contributors

Assets 14

Releases: flashinfer-ai/flashinfer

v0.2.9

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.9rc2

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.9rc1

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.8

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.8rc1

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.7.post1

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.7

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.6.post1

What's Changed

Contributors

Uh oh!

v0.2.6

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.5

What's Changed

New Contributors

Contributors

Uh oh!