BFloat16.jl support in kernels #2441

maleadt · 2024-07-12T08:38:01Z

Julia 1.11 introduces BFloat16 codegen support, so let's use this issue to track support for that.

Right now, it looks like we support the type, but somehow still emit conversions:

julia> BFloat16s.llvm_storage
true

julia> BFloat16s.llvm_arithmetic
true

julia> function kernel(x)
       @inbounds x[threadIdx().x] += BFloat16(1)
         return
       end

julia> x = CuArray{BFloat16}(undef, 1024);

julia> @device_code_llvm debuginfo=:none @cuda kernel(x)
; PTX CompilerJob of MethodInstance for kernel(::CuDeviceVector{BFloat16, 1}) for sm_89
define ptx_kernel void @_Z6kernel13CuDeviceArrayI8BFloat16Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = bitcast i8 addrspace(1)* %.fca.0.extract to bfloat addrspace(1)*
  %3 = zext i32 %1 to i64
  %4 = getelementptr inbounds bfloat, bfloat addrspace(1)* %2, i64 %3
  %5 = load bfloat, bfloat addrspace(1)* %4, align 2
  %6 = fpext bfloat %5 to float
  %7 = fadd float %6, 1.000000e+00
  %8 = fptrunc float %7 to bfloat
  store bfloat %8, bfloat addrspace(1)* %4, align 2
  ret void
}

In addition, the logic in BFloat16s.jl isn't great, as we determine support based on the host processor. It's not clear if we can do better though; this looks a lot like the literal Int issue (where we can't make GPU code use Int32 when the host is Int64).

The text was updated successfully, but these errors were encountered:

maleadt · 2024-09-16T12:25:55Z

Update: looks like we hit a selection error now

julia> using CUDA, BFloat16s

julia> function foobar(C::AbstractArray, a::Number, b::Number)
           @inbounds C[] = a*b
           return
       end
foobar (generic function with 1 method)

julia> @cuda foobar(CuArray(Float64[0]), one(BFloat16), one(Int32))
ERROR: LLVM error: Cannot select: 0x22563be0: f64 = fp_extend 0x22563b70, /home/tim/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:210 @[ number.jl:7 @[ /home/tim/Julia/pkg/CUDA/src/device/array.jl:166 @[ /home/tim/Julia/pkg/CUDA/src/device/array.jl:178 @[ REPL[3]:2 ] ] ] ]
  0x22563b70: bf16 = fmul 0x22563b00, 0x22563320, /home/tim/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:227 @[ promotion.jl:430 @[ REPL[3]:2 ] ]
    0x22563b00: bf16 = sint_to_fp 0x22563390, /home/tim/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:188 @[ number.jl:7 @[ promotion.jl:375 @[ promotion.jl:400 @[ promotion.jl:430 @[ REPL[3]:2 ] ] ] ] ]
      0x22563390: i32,ch = load<(dereferenceable invariant load (s32) from `i32 addrspace(101)* null`, addrspace 101)> 0x20cf8dc0, TargetExternalSymbol:i64'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_3', undef:i64
        0x22563940: i64 = TargetExternalSymbol'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_3'
        0x22563010: i64 = undef
    0x22563320: bf16,ch = load<(dereferenceable invariant load (s16) from `bfloat addrspace(101)* null`, addrspace 101)> 0x20cf8dc0, TargetExternalSymbol:i64'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_2', undef:i64
      0x225637f0: i64 = TargetExternalSymbol'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_2'
      0x22563010: i64 = undef
In function: _Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32

CarloLucibello · 2025-01-04T06:48:52Z

I tried a few high level operations, the current situation is this:

using BFloat16s, CUDA

CUDA.allowscalar(false)
A = rand(BFloat16, 3, 3) |> cu
x = rand(BFloat16, 3) |> cu

A * x       # ok    
tanh.(x)  # ok

x .* x      # ERROR: LLVM error: Cannot select: 0x12b83210: bf16 = fmul 0x12b839f0, 0x18ddf5c0 ...
x.^2       # ERROR: LLVM error: Cannot select: 0x163e2e30: bf16 = fmul 0xf042050, 0xf042050 ....

My environment is the following

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.5
NVIDIA driver 555.42.2

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+555.42.2

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.2
- LLVM: 16.0.6

3 devices:
  0: NVIDIA TITAN RTX (sm_75, 23.038 GiB / 24.000 GiB available)
  1: NVIDIA TITAN RTX (sm_75, 23.239 GiB / 24.000 GiB available)
  2: NVIDIA TITAN RTX (sm_75, 23.444 GiB / 24.000 GiB available)

maleadt added the bug label Jul 12, 2024

CarloLucibello mentioned this issue Jan 4, 2025

Missing BFloat16 support FluxML/Flux.jl#2573

Open

maleadt added cuda kernels and removed bug labels Jan 6, 2025

maleadt changed the title ~~BFloat16.jl support~~ BFloat16.jl support in kernels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFloat16.jl support in kernels #2441

BFloat16.jl support in kernels #2441

maleadt commented Jul 12, 2024

maleadt commented Sep 16, 2024

CarloLucibello commented Jan 4, 2025 •

edited

Loading

BFloat16.jl support in kernels #2441

BFloat16.jl support in kernels #2441

Comments

maleadt commented Jul 12, 2024

maleadt commented Sep 16, 2024

CarloLucibello commented Jan 4, 2025 • edited Loading

CarloLucibello commented Jan 4, 2025 •

edited

Loading