Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

BFloat16.jl support in kernels #2441

Open
maleadt opened this issue Jul 12, 2024 · 2 comments
Open

BFloat16.jl support in kernels #2441

maleadt opened this issue Jul 12, 2024 · 2 comments
Labels
cuda kernels Stuff about writing CUDA kernels.

Comments

@maleadt
Copy link
Member

maleadt commented Jul 12, 2024

Julia 1.11 introduces BFloat16 codegen support, so let's use this issue to track support for that.

Right now, it looks like we support the type, but somehow still emit conversions:

julia> BFloat16s.llvm_storage
true

julia> BFloat16s.llvm_arithmetic
true

julia> function kernel(x)
       @inbounds x[threadIdx().x] += BFloat16(1)
         return
       end

julia> x = CuArray{BFloat16}(undef, 1024);

julia> @device_code_llvm debuginfo=:none @cuda kernel(x)
; PTX CompilerJob of MethodInstance for kernel(::CuDeviceVector{BFloat16, 1}) for sm_89
define ptx_kernel void @_Z6kernel13CuDeviceArrayI8BFloat16Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = bitcast i8 addrspace(1)* %.fca.0.extract to bfloat addrspace(1)*
  %3 = zext i32 %1 to i64
  %4 = getelementptr inbounds bfloat, bfloat addrspace(1)* %2, i64 %3
  %5 = load bfloat, bfloat addrspace(1)* %4, align 2
  %6 = fpext bfloat %5 to float
  %7 = fadd float %6, 1.000000e+00
  %8 = fptrunc float %7 to bfloat
  store bfloat %8, bfloat addrspace(1)* %4, align 2
  ret void
}

In addition, the logic in BFloat16s.jl isn't great, as we determine support based on the host processor. It's not clear if we can do better though; this looks a lot like the literal Int issue (where we can't make GPU code use Int32 when the host is Int64).

@maleadt maleadt added the bug Something isn't working label Jul 12, 2024
@maleadt
Copy link
Member Author

maleadt commented Sep 16, 2024

Update: looks like we hit a selection error now

julia> using CUDA, BFloat16s

julia> function foobar(C::AbstractArray, a::Number, b::Number)
           @inbounds C[] = a*b
           return
       end
foobar (generic function with 1 method)

julia> @cuda foobar(CuArray(Float64[0]), one(BFloat16), one(Int32))
ERROR: LLVM error: Cannot select: 0x22563be0: f64 = fp_extend 0x22563b70, /home/tim/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:210 @[ number.jl:7 @[ /home/tim/Julia/pkg/CUDA/src/device/array.jl:166 @[ /home/tim/Julia/pkg/CUDA/src/device/array.jl:178 @[ REPL[3]:2 ] ] ] ]
  0x22563b70: bf16 = fmul 0x22563b00, 0x22563320, /home/tim/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:227 @[ promotion.jl:430 @[ REPL[3]:2 ] ]
    0x22563b00: bf16 = sint_to_fp 0x22563390, /home/tim/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:188 @[ number.jl:7 @[ promotion.jl:375 @[ promotion.jl:400 @[ promotion.jl:430 @[ REPL[3]:2 ] ] ] ] ]
      0x22563390: i32,ch = load<(dereferenceable invariant load (s32) from `i32 addrspace(101)* null`, addrspace 101)> 0x20cf8dc0, TargetExternalSymbol:i64'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_3', undef:i64
        0x22563940: i64 = TargetExternalSymbol'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_3'
        0x22563010: i64 = undef
    0x22563320: bf16,ch = load<(dereferenceable invariant load (s16) from `bfloat addrspace(101)* null`, addrspace 101)> 0x20cf8dc0, TargetExternalSymbol:i64'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_2', undef:i64
      0x225637f0: i64 = TargetExternalSymbol'_Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32_param_2'
      0x22563010: i64 = undef
In function: _Z6foobar13CuDeviceArrayI7Float64Ll1ELl1EE8BFloat165Int32

@CarloLucibello
Copy link
Contributor

CarloLucibello commented Jan 4, 2025

I tried a few high level operations, the current situation is this:

using BFloat16s, CUDA

CUDA.allowscalar(false)
A = rand(BFloat16, 3, 3) |> cu
x = rand(BFloat16, 3) |> cu

A * x       # ok    
tanh.(x)  # ok

x .* x      # ERROR: LLVM error: Cannot select: 0x12b83210: bf16 = fmul 0x12b839f0, 0x18ddf5c0 ...
x.^2       # ERROR: LLVM error: Cannot select: 0x163e2e30: bf16 = fmul 0xf042050, 0xf042050 ....

My environment is the following

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.5
NVIDIA driver 555.42.2

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+555.42.2

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.2
- LLVM: 16.0.6

3 devices:
  0: NVIDIA TITAN RTX (sm_75, 23.038 GiB / 24.000 GiB available)
  1: NVIDIA TITAN RTX (sm_75, 23.239 GiB / 24.000 GiB available)
  2: NVIDIA TITAN RTX (sm_75, 23.444 GiB / 24.000 GiB available)

@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. and removed bug Something isn't working labels Jan 6, 2025
@maleadt maleadt changed the title BFloat16.jl support BFloat16.jl support in kernels Jan 6, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
cuda kernels Stuff about writing CUDA kernels.
Projects
None yet
Development

No branches or pull requests

2 participants