> docker build  --network=host  -t ci-flash_attention_flash_attention.ubuntu.amd --pull -f docker/flash_attention.ubuntu.amd.Dockerfile  ./docker
Sending build context to Docker daemon  274.4kB

Step 1/8 : ARG BASE_DOCKER=rocm/pytorch-nightly
Step 2/8 : FROM $BASE_DOCKER
latest: Pulling from rocm/pytorch-nightly
Digest: sha256:fde1d1f2805cee71e27ebc701a123c64628302e7e5df40408c1c34b3cba58495
Status: Image is up to date for rocm/pytorch-nightly:latest
 ---> 4b23063b26d6
Step 3/8 : WORKDIR /workspace
 ---> Running in 2c7e2062e0ea
Removing intermediate container 2c7e2062e0ea
 ---> 0e76a94dd324
Step 4/8 : RUN ls /opt/conda/envs
 ---> Running in de33c073e2d1
py_3.8
Removing intermediate container de33c073e2d1
 ---> bc15461bc4f0
Step 5/8 : RUN pip install ninja
 ---> Running in 0aac2ebd7f48
Collecting ninja
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 19.1 MB/s eta 0:00:00
Installing collected packages: ninja
Successfully installed ninja-1.11.1.1
[91mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[0mRemoving intermediate container 0aac2ebd7f48
 ---> e2c579dd1e98
Step 6/8 : RUN git clone -b flash_attention_for_rocm --recurse-submodules https://github.com/ROCmSoftwarePlatform/flash-attention.git
 ---> Running in b96a89654dde
[91mCloning into 'flash-attention'...
[0m[91mSubmodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'csrc/cutlass'
[0m[91mSubmodule 'csrc/flash_attn_rocm/composable_kernel' (https://github.com/ROCmSoftwarePlatform/composable_kernel) registered for path 'csrc/flash_attn_rocm/composable_kernel'
[0m[91mCloning into '/workspace/flash-attention/csrc/cutlass'...
[0m[91mCloning into '/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel'...
[0mSubmodule path 'csrc/cutlass': checked out 'c4f6b8c6bc94ff69048492fb34df0dfaf1983933'
Submodule path 'csrc/flash_attn_rocm/composable_kernel': checked out '5ff2d646e893de55adebaa988e5dc547cbc21954'
Removing intermediate container b96a89654dde
 ---> d73388c98228
Step 7/8 : RUN cd /workspace/flash-attention        && python setup.py install
 ---> Running in bc2eb5303461

Warning: Torch did not find available GPUs on this system.
 If your intention is to cross-compile, this is not an error.
By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
Volta (compute capability 7.0), Turing (compute capability 7.5),
and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
If you wish to cross-compile for a single specific architecture,
export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.



torch.__version__  = 2.3.0a0+gitac0bed0


RTZ IS USED
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/ck.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/ck.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include/ck/library/utility/device_memory.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include/ck/library/utility/device_memory_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/gemm_specialization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/gemm_specialization.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_specialization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_specialization.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/integral_constant.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/integral_constant.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/enable_if.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/enable_if.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/number.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/number.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional2.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/data_type.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/data_type.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math_v2.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/f8_utils.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/f8_utils.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/random_gen.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/random_gen.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type_convert.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type_convert.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_id.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_id.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/quantization_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/quantization_operation.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/src/utils.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/utils_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/params.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/params_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence_helper.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional4.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional4.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple_helper.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_element_picker.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_element_picker.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_helper.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array_multi_index.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array_multi_index_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array_multi_index.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array_multi_index_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/multi_index.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/multi_index_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional3.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional3_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/ignore.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/ignore.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/magic_division.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/magic_division.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/c_style_pointer_cast.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/c_style_pointer_cast.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/is_known_at_compile_time.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/is_known_at_compile_time.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/transpose_vectors.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/transpose_vectors.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/inner_product.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/inner_product.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/thread_group.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/thread_group.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/debug.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/debug_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wave_read_first_lane.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wave_read_first_lane.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/generic_memory_space_atomic.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/generic_memory_space_atomic.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/synchronization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/synchronization_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_address_space.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_address_space.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/static_buffer.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/static_buffer.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/dynamic_buffer.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/dynamic_buffer.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_inline_asm.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_inline_asm.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_xdlops.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_xdlops.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/philox_rand.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/philox_rand.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_helper_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_helper_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_layout.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_layout.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/stream_config.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/stream_config.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_base.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_base.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/masking_specialization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/masking_specialization.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_grouped_gemm_softmax_gemm_permute.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_grouped_gemm_softmax_gemm_permute.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/matrix_padder.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/matrix_padder_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_adaptor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_adaptor_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/block_to_ctile_map_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_space_filling_curve.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_space_filling_curve_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/xdlops_gemm_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/cluster_descriptor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/cluster_descriptor_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor/static_tensor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor/static_tensor.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v3r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v3r1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v4r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v4r1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v6r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v6r1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v6r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v6r1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_enums.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_enums.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_common.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_common.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_operator.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_operator.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_functions_accumulate.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_functions_accumulate.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_shift.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_shift.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/reduction_functions_blockwise.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/reduction_functions_blockwise_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/reduction_functions_threadwise.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/reduction_functions_threadwise.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_softmax.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_softmax_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_dropout.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_dropout.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/device_prop.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/device_prop.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/kernel_launch.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/kernel_launch_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_fwd_xdl_cshuffle_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_fwd_xdl_cshuffle_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_ydotygrad.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_ydotygrad_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_batched_gemm_softmax_gemm_permute.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_batched_gemm_softmax_gemm_permute.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_fwd_xdl_cshuffle_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_fwd_xdl_cshuffle_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v1_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v2_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/device_gemm_trait.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_template.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_template_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_invoker.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_invoker_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_template.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/static_switch.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/static_switch.hpp [skipped, no changes]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_runner.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_runner_hip.hpp [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/flash_api.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [ok]
/workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [ok]
Total number of unsupported CUDA function calls: 0


Total number of replaced kernel launches: 10
running install
running bdist_egg
running egg_info
creating flash_attn.egg-info
writing flash_attn.egg-info/PKG-INFO
writing dependency_links to flash_attn.egg-info/dependency_links.txt
writing requirements to flash_attn.egg-info/requires.txt
writing top-level names to flash_attn.egg-info/top_level.txt
writing manifest file 'flash_attn.egg-info/SOURCES.txt'
reading manifest file 'flash_attn.egg-info/SOURCES.txt'
reading manifest template '[MANIFEST.in](http://manifest.in/)'
[91m[92mSuccessfully preprocessed all matching files.[0m
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()
warning: no files found matching '*.cu' under directory 'flash_attn'
[0m[91mwarning: no files found matching '*.h' under directory 'flash_attn'
[0m[91mwarning: no files found matching '*.cuh' under directory 'flash_attn'
[0m[91mwarning: no files found matching '*.cpp' under directory 'flash_attn'
[0m[91mwarning: no files found matching '*.hpp' under directory 'flash_attn'
[0m[91mwarning: no files found matching '*.cu' under directory 'flash_attn_rocm'
[0m[91mwarning: no files found matching '*.h' under directory 'flash_attn_rocm'
[0m[91mwarning: no files found matching '*.cuh' under directory 'flash_attn_rocm'
[0m[91mwarning: no files found matching '*.cpp' under directory 'flash_attn_rocm'
[0m[91mwarning: no files found matching '*.hpp' under directory 'flash_attn_rocm'
[0madding license file 'LICENSE'
adding license file 'AUTHORS'
writing manifest file 'flash_attn.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-cpython-38
creating build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn
copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-38/flash_attn
creating build/lib.linux-x86_64-cpython-38/flash_attn/modules
copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
creating build/lib.linux-x86_64-cpython-38/flash_attn/ops
copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
creating build/lib.linux-x86_64-cpython-38/flash_attn/layers
copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
creating build/lib.linux-x86_64-cpython-38/flash_attn/utils
copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
creating build/lib.linux-x86_64-cpython-38/flash_attn/losses
copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses
copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses
creating build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
running build_ext
building 'flash_attn_2_cuda' extension
creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38
creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc
creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm
creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src
[91mEmitting ninja build file /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[0m[1/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[2/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[3/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[4/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[5/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[6/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[7/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[8/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[9/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[10/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[11/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[12/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[13/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[14/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[15/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[16/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[17/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[18/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[19/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[20/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[21/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[22/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[23/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[24/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[25/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[26/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[27/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[28/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[29/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[30/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[31/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[32/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[33/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
[34/50] /opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o 
/opt/rocm/bin/hipcc  -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
ninja: build stopped: subcommand failed.
[91mTraceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2095, in _run_ninja_build
    subprocess.run(
  File "/opt/conda/envs/py_3.8/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 312, in <module>
    setup(
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/__init__.py", line 103, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
    dist.run_commands()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/install.py", line 84, in run
    self.do_egg_install()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/install.py", line 132, in do_egg_install
    self.run_command('bdist_egg')
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 167, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/install_lib.py", line 111, in build
    self.run_command('build_ext')
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 88, in run
    _build_ext.run(self)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
    self.build_extensions()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 870, in build_extensions
    build_ext.build_extensions(self)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
    self._build_extensions_serial()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
    self.build_extension(ext)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
    _build_ext.build_extension(self, ext)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
    objects = self.compiler.compile(
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 683, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1773, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2111, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[0mThe command '/bin/sh -c cd /workspace/flash-attention        && python setup.py install' returned a non-zero code: 1