> docker build --network=host -t ci-flash_attention_flash_attention.ubuntu.amd --pull -f docker/flash_attention.ubuntu.amd.Dockerfile ./docker Sending build context to Docker daemon 274.4kB Step 1/8 : ARG BASE_DOCKER=rocm/pytorch-nightly Step 2/8 : FROM $BASE_DOCKER latest: Pulling from rocm/pytorch-nightly Digest: sha256:fde1d1f2805cee71e27ebc701a123c64628302e7e5df40408c1c34b3cba58495 Status: Image is up to date for rocm/pytorch-nightly:latest ---> 4b23063b26d6 Step 3/8 : WORKDIR /workspace ---> Running in 2c7e2062e0ea Removing intermediate container 2c7e2062e0ea ---> 0e76a94dd324 Step 4/8 : RUN ls /opt/conda/envs ---> Running in de33c073e2d1 py_3.8 Removing intermediate container de33c073e2d1 ---> bc15461bc4f0 Step 5/8 : RUN pip install ninja ---> Running in 0aac2ebd7f48 Collecting ninja Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB) Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 19.1 MB/s eta 0:00:00 Installing collected packages: ninja Successfully installed ninja-1.11.1.1 [91mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [0mRemoving intermediate container 0aac2ebd7f48 ---> e2c579dd1e98 Step 6/8 : RUN git clone -b flash_attention_for_rocm --recurse-submodules https://github.com/ROCmSoftwarePlatform/flash-attention.git ---> Running in b96a89654dde [91mCloning into 'flash-attention'... [0m[91mSubmodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'csrc/cutlass' [0m[91mSubmodule 'csrc/flash_attn_rocm/composable_kernel' (https://github.com/ROCmSoftwarePlatform/composable_kernel) registered for path 'csrc/flash_attn_rocm/composable_kernel' [0m[91mCloning into '/workspace/flash-attention/csrc/cutlass'... [0m[91mCloning into '/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel'... [0mSubmodule path 'csrc/cutlass': checked out 'c4f6b8c6bc94ff69048492fb34df0dfaf1983933' Submodule path 'csrc/flash_attn_rocm/composable_kernel': checked out '5ff2d646e893de55adebaa988e5dc547cbc21954' Removing intermediate container b96a89654dde ---> d73388c98228 Step 7/8 : RUN cd /workspace/flash-attention && python setup.py install ---> Running in bc2eb5303461 Warning: Torch did not find available GPUs on this system. If your intention is to cross-compile, this is not an error. By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2), Volta (compute capability 7.0), Turing (compute capability 7.5), and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0). If you wish to cross-compile for a single specific architecture, export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py. torch.__version__ = 2.3.0a0+gitac0bed0 RTZ IS USED /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/ck.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/ck.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include/ck/library/utility/device_memory.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include/ck/library/utility/device_memory_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/gemm_specialization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/gemm_specialization.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_specialization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_specialization.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/integral_constant.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/integral_constant.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/enable_if.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/enable_if.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/number.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/number.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional2.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/data_type.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/data_type.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/math_v2.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/f8_utils.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/f8_utils.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/random_gen.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/random_gen.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type_convert.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/type_convert.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_id.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_id.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/quantization_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/quantization_operation.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/src/utils.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/utils_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/params.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/params_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/sequence_helper.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional4.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional4.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/tuple_helper.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_element_picker.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_element_picker.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/container_helper.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array_multi_index.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/array_multi_index_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array_multi_index.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/statically_indexed_array_multi_index_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/multi_index.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/multi_index_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional3.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/functional3_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/ignore.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/ignore.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/magic_division.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/magic_division.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/c_style_pointer_cast.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/c_style_pointer_cast.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/is_known_at_compile_time.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/is_known_at_compile_time.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/transpose_vectors.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/transpose_vectors.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/inner_product.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/inner_product.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/thread_group.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/thread_group.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/debug.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/debug_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wave_read_first_lane.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_wave_read_first_lane.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/generic_memory_space_atomic.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/generic_memory_space_atomic.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/synchronization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/synchronization_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_address_space.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_address_space.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/static_buffer.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/static_buffer.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/dynamic_buffer.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/dynamic_buffer.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_inline_asm.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_inline_asm.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_xdlops.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/amd_xdlops.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/common_header_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/philox_rand.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/philox_rand.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/multi_index_transform_helper_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_helper.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_descriptor_helper_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_layout.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/tensor_layout.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/stream_config.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/stream_config.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_base.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_base.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/masking_specialization.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/masking_specialization.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_grouped_gemm_softmax_gemm_permute.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_grouped_gemm_softmax_gemm_permute.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/matrix_padder.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/matrix_padder_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_adaptor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_adaptor_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/block_to_ctile_map_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_space_filling_curve.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/tensor_space_filling_curve_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/warp/xdlops_gemm_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_xdlops_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/cluster_descriptor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_description/cluster_descriptor_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor/static_tensor.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor/static_tensor.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v3r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v3r1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v4r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v4r1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v6r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v6r1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v6r1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/thread_group_tensor_slice_transfer_v6r1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_enums.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_enums.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_common.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_common.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_operator.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_operator.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_functions_accumulate.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/reduction_functions_accumulate.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_shift.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/utility/get_shift.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/reduction_functions_blockwise.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/reduction_functions_blockwise_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/reduction_functions_threadwise.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/thread/reduction_functions_threadwise.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_softmax.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_softmax_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_dropout.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_dropout.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/operator_transform/transform_contraction_to_gemm_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/device_prop.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/device_prop.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/kernel_launch.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/kernel_launch_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_fwd_xdl_cshuffle_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_fwd_xdl_cshuffle_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_ydotygrad.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_ydotygrad_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_light_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_mha_bwd_xdl_cshuffle_qloop_light_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_batched_gemm_softmax_gemm_permute.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/device_batched_gemm_softmax_gemm_permute.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_fwd_xdl_cshuffle_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_fwd_xdl_cshuffle_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v1.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v1_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v2.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_batched_mha_bwd_xdl_cshuffle_qloop_light_v2_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/device_gemm_trait.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/device_gemm_trait_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_template.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_template_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_invoker.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/bwd_device_gemm_invoker_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_template.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_template_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/fwd_device_gemm_invoker_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/static_switch.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/static_switch.hpp [skipped, no changes] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_runner.hpp -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_runner_hip.hpp [ok] /workspace/flash-attention/csrc/flash_attn_rocm/flash_api.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [ok] /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [ok] Total number of unsupported CUDA function calls: 0 Total number of replaced kernel launches: 10 running install running bdist_egg running egg_info creating flash_attn.egg-info writing flash_attn.egg-info/PKG-INFO writing dependency_links to flash_attn.egg-info/dependency_links.txt writing requirements to flash_attn.egg-info/requires.txt writing top-level names to flash_attn.egg-info/top_level.txt writing manifest file 'flash_attn.egg-info/SOURCES.txt' reading manifest file 'flash_attn.egg-info/SOURCES.txt' reading manifest template '[MANIFEST.in](http://manifest.in/)' [91m[92mSuccessfully preprocessed all matching files.[0m /opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated. !! ******************************************************************************** Please avoid running ``setup.py`` directly. Instead, use pypa/build, pypa/installer or other standards-based tools. See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details. ******************************************************************************** !! self.initialize_options() /opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated. !! ******************************************************************************** Please avoid running ``setup.py`` and ``easy_install``. Instead, use pypa/build, pypa/installer or other standards-based tools. See https://github.com/pypa/setuptools/issues/917 for details. ******************************************************************************** !! self.initialize_options() warning: no files found matching '*.cu' under directory 'flash_attn' [0m[91mwarning: no files found matching '*.h' under directory 'flash_attn' [0m[91mwarning: no files found matching '*.cuh' under directory 'flash_attn' [0m[91mwarning: no files found matching '*.cpp' under directory 'flash_attn' [0m[91mwarning: no files found matching '*.hpp' under directory 'flash_attn' [0m[91mwarning: no files found matching '*.cu' under directory 'flash_attn_rocm' [0m[91mwarning: no files found matching '*.h' under directory 'flash_attn_rocm' [0m[91mwarning: no files found matching '*.cuh' under directory 'flash_attn_rocm' [0m[91mwarning: no files found matching '*.cpp' under directory 'flash_attn_rocm' [0m[91mwarning: no files found matching '*.hpp' under directory 'flash_attn_rocm' [0madding license file 'LICENSE' adding license file 'AUTHORS' writing manifest file 'flash_attn.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py creating build creating build/lib.linux-x86_64-cpython-38 creating build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-38/flash_attn creating build/lib.linux-x86_64-cpython-38/flash_attn/modules copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules creating build/lib.linux-x86_64-cpython-38/flash_attn/ops copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops creating build/lib.linux-x86_64-cpython-38/flash_attn/layers copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers creating build/lib.linux-x86_64-cpython-38/flash_attn/utils copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils creating build/lib.linux-x86_64-cpython-38/flash_attn/losses copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses creating build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models running build_ext building 'flash_attn_2_cuda' extension creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38 creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm creating /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src [91mEmitting ninja build file /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [0m[1/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [2/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [3/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [4/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [5/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [6/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [7/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [8/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [9/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [10/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [11/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [12/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [13/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [14/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [15/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [16/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [17/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [18/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [19/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/device_memory_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [20/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [21/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [22/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [23/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [24/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [25/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [26/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [27/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [28/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [29/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [30/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [31/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [32/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [33/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' [34/50] /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc FAILED: /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o /opt/rocm/bin/hipcc -I/workspace/flash-attention/csrc/flash_attn_rocm -I/workspace/flash-attention/csrc/flash_attn_rocm/src -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/workspace/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.8/include/python3.8 -c -c /workspace/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /workspace/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' ninja: build stopped: subcommand failed. [91mTraceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2095, in _run_ninja_build subprocess.run( File "/opt/conda/envs/py_3.8/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "setup.py", line 312, in setup( File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/__init__.py", line 103, in setup return distutils.core.setup(**attrs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup return run_commands(dist) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands dist.run_commands() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands self.run_command(cmd) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command super().run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command cmd_obj.run() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/install.py", line 84, in run self.do_egg_install() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/install.py", line 132, in do_egg_install self.run_command('bdist_egg') File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command self.distribution.run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command super().run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command cmd_obj.run() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 167, in run cmd = self.call_command('install_lib', warn_dir=0) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command self.run_command(cmdname) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command self.distribution.run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command super().run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command cmd_obj.run() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/install_lib.py", line 11, in run self.build() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/install_lib.py", line 111, in build self.run_command('build_ext') File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command self.distribution.run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/dist.py", line 989, in run_command super().run_command(command) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command cmd_obj.run() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 88, in run _build_ext.run(self) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run self.build_extensions() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 870, in build_extensions build_ext.build_extensions(self) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions self._build_extensions_serial() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial self.build_extension(ext) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 249, in build_extension _build_ext.build_extension(self, ext) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension objects = self.compiler.compile( File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 683, in unix_wrap_ninja_compile _write_ninja_file_and_compile_objects( File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1773, in _write_ninja_file_and_compile_objects _run_ninja_build( File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2111, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension [0mThe command '/bin/sh -c cd /workspace/flash-attention && python setup.py install' returned a non-zero code: 1