Skip to content

v0.21.0

Compare
Choose a tag to compare
@awni awni released this 22 Nov 20:18
· 100 commits to main since this release
bb303c4

Highlights

  • Support 3 and 6 bit quantization: benchmarks
  • Much faster memory efficient attention for headdim 64, 80: benchmarks
  • Much faster sdpa inference kernel for longer sequences: benchmarks

Core

  • contiguous op (C++ only) + primitive
  • Bfs width limit to reduce memory consumption during eval
  • Fast CPU quantization
  • Faster indexing math in several kernels:
    • unary, binary, ternary, copy, compiled, reduce
  • Improve dispatch threads for a few kernels:
    • conv, gemm splitk, custom kernels
  • More buffer donation with no-ops to reduce memory use
  • Use CMAKE_OSX_DEPLOYMENT_TARGET to pick Metal version
  • Dispatch Metal bf16 type at runtime when using the JIT

NN

  • nn.AvgPool3d and nn.MaxPool3d
  • Support groups in nn.Conv2d

Bug fixes

  • Fix per-example mask + docs in sdpa
  • Fix FFT synchronization bug (use dispatch method everywhere)
  • Throw for invalid *fft{2,n} cases
  • Fix OOB access in qmv
  • Fix donation in sdpa to reduce memory use
  • Allocate safetensors header on the heap to avoid stack overflow
  • Fix sibling memory leak
  • Fix view segfault for scalars input
  • Fix concatenate vmap