Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: accelerate f16 distance #2885

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

feat: accelerate f16 distance #2885

wants to merge 5 commits into from

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 15, 2024

Ran command

Env:

  • Ubuntu 24.04 / Macos 15
  • AWS VMs or Apple M2 Max macbook pro
  • GCC-13 (Ubuntu), clang 18 (ubuntu / mac). GCC is not installed on Mac by default
  • RUSTFLAGS=""
  • Rustc 1.81
CC={clang|gcc} cargo bench --bench l2/cosine/dot [--features fp16kernels]  -- "half::binary16::f16, auto-vectorization"
CPU CC L2(f16) Dot (f16) Cosine (f16) branch + feature
AMD Zen3 867.81 ms 701.42 ms 1.3697 main
AMD Zen3 gcc 13 887.41 ms 905.89 m 920.16 ms main + fp16kernels
AMD Zen3 clang 18 119.64 ms 118.90 ms 121.82 ms main + fp16kernels
AMD Zen3 gcc 13 887.04 ms 878.89 ms 915.79 ms lei/f16_bench
AMD Zen3 clang 18 120.78 ms 113.93 ms 120.68 ms lei/f16_bench
Skylake clang 1.5729 s main
Skylake gcc 1.4302 s 1.4184 s 1.4276 s main + fp16kernels
Skylake clang 290.73 ms 260.39 ms 287.47 ms main + fp16kernels
Skylake gcc 1.4337 s 1.4161 s 1.4273 s lei/f16_bench
Skylake clang 578.46 ms 582.08 ms 888.80 ms lei/f16_bench
Sapphire Rapis 1.4047 s 1.1850 s 2.3802 s main
Shappire Rapis gcc 1.2236 s 616.14 ms 1.5293 s main + fp16kernels
Shappire Rapis clang 308.18 ms 283.11 ms 293.49 ms main + fp16kernels
Shappire Rapis gcc 887.84 ms 857.94 ms 897.96 ms lei/f16_bench
Shappire Rapis clang 274.20 ms 276.86 ms 314.43 ms lei/f16_bench
Graviton 3 (m7g.xlarge) 2.9608 s 2.7640 s 4.7155 s main
Graviton 3 gcc 234.97 ms 218.71 ms 230.73 ms main + fp16kernel
Graviton 3 clang 209.75 ms 209.26 ms 239.20 ms main + fp16kernel
Graviton 3 gcc 129.63 ms 120.84 ms 230.57 ms lei/f16_bench
Graviton 3 clang 130.93 ms 118.42 ms 235.08 ms lei/f16_bench
Apple M2 Max clang 85.693 ms 64.815 ms 87.479 ms main + fp16kernels
Apple M2 Max clang 416.78 ms 345.76 ms 691.80 ms main
Apple M2 Max clang 64.450 ms ms 63.911 ms 109.16 ms lei/f16_bench

Conclusion:

  • We need to use clang

@github-actions github-actions bot added the enhancement New feature or request label Sep 15, 2024
@eddyxu eddyxu added the WIP work in progress label Sep 15, 2024
Comment on lines +58 to +65
#if defined(__aarch64__)
// on aarch64 with fp16, this is 2x faster.
FP16 sub = x[i] - y[i];
#else
float sub = x[i] - y[i];
#endif
// Use float32 as the accumulator to avoid overflow.
sum += sub * sub;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just have simd/genric, simd/x86 and simd/aarch64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As 3 different functions?

eddyxu added a commit that referenced this pull request Sep 18, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request WIP work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants