Skip to content

convert Yi34B model fail #120

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
leiwen83 opened this issue Dec 6, 2023 · 17 comments
Closed

convert Yi34B model fail #120

leiwen83 opened this issue Dec 6, 2023 · 17 comments

Comments

@leiwen83
Copy link

leiwen83 commented Dec 6, 2023

Hi,

I try do the Yi34B model conversion with tools/llama_convert.py, but met error...

python tools/llama_convert.py  -i /data/tmp/Yi-34B -o /data/tmp/Yi-34B_xfaster/

=============== Argument ===============
saved_dir: /data/tmp/Yi-34B_xfaster/
in_file: /data/tmp/Yi-34B
processes: 8
weight_data_type: fp32
========================================
Loading checkpoint shards:   0%|                                 | 0/7 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "tools/llama_convert.py", line 225, in <module>
    split_and_convert(args)
  File "tools/llama_convert.py", line 91, in split_and_convert
    model = LlamaForCausalLM.from_pretrained(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 720, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/usr/bin/python3.8/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 7168]) in "weight" (which has shape torch.Size([7168, 7168])), this look incorrect.
@marvin-Yu
Copy link
Contributor

Yi requires support from a higher version of transformers. You can bypass this issue by upgrading to version 4.34.0. Although Yi's structure is consistent with Llama, XFT has not been adapted yet, and it is uncertain whether it can be seamlessly compatible. Contributions to the adaptation code for Yi are welcome. thx~

@leiwen83
Copy link
Author

leiwen83 commented Dec 7, 2023

After upgrading transformers to 4.34, model conversion is ok now.

However, it got core dump when use the sample code, it print the prompt again, then report illegal instruction.

Once upon a time, there existed a little girl who liked to have Illegal instruction (core dumped)

@Duyi-Wang
Copy link
Contributor

After upgrading transformers to 4.34, model conversion is ok now.

However, it got core dump when use the sample code, it print the prompt again, then report illegal instruction.

Once upon a time, there existed a little girl who liked to have Illegal instruction (core dumped)

What's your CPU platform? Or cloud you provide the info of lscpu.

@leiwen83
Copy link
Author

leiwen83 commented Dec 7, 2023

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-95
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
    BIOS Model name:     Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  24
    Socket(s):           2
    Stepping:            7
    CPU max MHz:         3900.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            4800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pb
                         e syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfm
                         perf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic m
                         ovbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpc
                         id_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust
                         bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx51
                         2cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat
                          pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.5 MiB (48 instances)
  L1i:                   1.5 MiB (48 instances)
  L2:                    48 MiB (48 instances)
  L3:                    71.5 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-23,48-71
  NUMA node1 CPU(s):     24-47,72-95
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: Split huge pages
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; Clear CPU buffers; SMT vulnerable

@jackywei
Copy link

jackywei commented Dec 7, 2023

this is the 2nd gen intel xeon processor(CLX), so only includes the AVX512 instruction, so please refer to here: https://github.com/intel/xFasterTransformer/wiki/How-to-build#build-on-icx--clx.

@leiwen83
Copy link
Author

leiwen83 commented Dec 7, 2023

this is the 2nd gen intel xeon processor(CLX), so only includes the AVX512 instruction, so please refer to here: https://github.com/intel/xFasterTransformer/wiki/How-to-build#build-on-icx--clx.

Hi ~
I change CMakeList to:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 3e4739d..d391f22 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -76,11 +76,11 @@ else()
 endif()

 # Enable AVX512_FP16 optimization
-# add_definitions(-DAVX512_FP32_WEIGHT_ONLY_FP16=true)
-add_definitions(-DAVX512_FP16_WEIGHT_ONLY_FP16=true)
-add_definitions(-DAVX512_BF16_WEIGHT_ONLY_BF16=true)
-# add_definitions(-DAVX512_FP32_WEIGHT_ONLY_INT8=true)
-add_definitions(-DAVX512_FP16_WEIGHT_ONLY_INT8=true)
+add_definitions(-DAVX512_FP32_WEIGHT_ONLY_FP16=true)
+# add_definitions(-DAVX512_FP16_WEIGHT_ONLY_FP16=true)
+# add_definitions(-DAVX512_BF16_WEIGHT_ONLY_BF16=true)
+add_definitions(-DAVX512_FP32_WEIGHT_ONLY_INT8=true)
+# add_definitions(-DAVX512_FP16_WEIGHT_ONLY_INT8=true)
 # add_definitions(-DDEBUG=true)
 # add_definitions(-DSTEP_BY_STEP_ATTN=true)
 # add_definitions(-DTIMELINE=true)

However, it silent quit when call xfastertransformer.AutoModel.from_pretrained.
Before this calling, there is some warn message:

2023:12:07-14:21:14:(22287) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2023:12:07-14:21:14:(22287) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
/data/aigc/workset/xFasterTransformer/src/utils/matmul_helper.h:296: Need to define WEIGHT_ONLY_BF16 kernel data type.

@leiwen83
Copy link
Author

leiwen83 commented Dec 7, 2023

After switch to fp16:
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="fp16")

it seems to be working, but at very slow speed, it took almost one minute to generate the first token, then no response afterwards...

So is there any benchmark data for different size of llm?

@Duyi-Wang
Copy link
Contributor

Did you run it with BF16 dtype in the first time? Our BF16 solution need AMX hardware support(4nd gen Xeon).
What's your command to run it? You may not have configured the core resources properly.
For example:

# If you want run on 1 node
numactl -C 0-23 python script.py
# or bind node with specify thread num
OMP_THREAD_NUMS=24 numactl -N 0 python script.py

# If you want run on 2 node, please using multi-rank mode with oneCCL
OMP_THREAD_NUMS=24 mpirun --map-by socket -n 2 python script.py

@leiwen83
Copy link
Author

leiwen83 commented Dec 8, 2023

Did you run it with BF16 dtype in the first time? Our BF16 solution need AMX hardware support(4nd gen Xeon). What's your command to run it? You may not have configured the core resources properly. For example:

# If you want run on 1 node
numactl -C 0-23 python script.py
# or bind node with specify thread num
OMP_THREAD_NUMS=24 numactl -N 0 python script.py

# If you want run on 2 node, please using multi-rank mode with oneCCL
OMP_THREAD_NUMS=24 mpirun --map-by socket -n 2 python script.py

When run with bf16, it stop after print below warn message:
/data/aigc/workset/xFasterTransformer/src/utils/matmul_helper.h:296: Need to define WEIGHT_ONLY_BF16 kernel data type.

@Duyi-Wang
Copy link
Contributor

Did you run it with BF16 dtype in the first time? Our BF16 solution need AMX hardware support(4nd gen Xeon). What's your command to run it? You may not have configured the core resources properly. For example:

# If you want run on 1 node
numactl -C 0-23 python script.py
# or bind node with specify thread num
OMP_THREAD_NUMS=24 numactl -N 0 python script.py

# If you want run on 2 node, please using multi-rank mode with oneCCL
OMP_THREAD_NUMS=24 mpirun --map-by socket -n 2 python script.py

When run with bf16, it stop after print below warn message: /data/aigc/workset/xFasterTransformer/src/utils/matmul_helper.h:296: Need to define WEIGHT_ONLY_BF16 kernel data type.

Yes, since BF16 requires AMX hardware support, and you disabled the BF16 option during compilation, this is as expected. BF16 is currently not supported on the CXL platform.

@Duyi-Wang
Copy link
Contributor

What's your command to run FP16?

@leiwen83
Copy link
Author

leiwen83 commented Dec 8, 2023

What's your command to run FP16?

model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="fp16")

@Duyi-Wang
Copy link
Contributor

What's your command to run FP16?

model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="fp16")

So you just run this with python cli?
If the cores are not properly bound and it runs across nodes, the expected performance may not be achieved; instead, it could be very slow.

You can follow this to open python cli:

# If you want run on 1 node
numactl -C 0-23 python

If you want run on 2 nodes, you need save python code into script.py and then:

# If you want run on 2 node, please using multi-rank mode with oneCCL
OMP_THREAD_NUMS=24 mpirun --map-by socket -n 2 python script.py

@leiwen83
Copy link
Author

leiwen83 commented Dec 8, 2023

With this: numactl -C 0-23 python

The generate speed is much faster now. For fp16, is AMX being used?
And for 2nd and 4nd, is there any data show their performance comparison?

@Duyi-Wang
Copy link
Contributor

With this: numactl -C 0-23 python

The generate speed is much faster now. For fp16, is AMX being used? And for 2nd and 4nd, is there any data show their performance comparison?

AMX on SPR only supports BF16 and INT8 now, but fp16 will still have a much better performance on SPR since SPR supports avx512_fp16 instructions. SPR performance is much better than CLX even without AMX since it has more cores and larger memory bandwidth.
You can benchmark the performance using scripts in the benchmark directory.

@Duyi-Wang
Copy link
Contributor

@leiwen83 Feel free to join our WeChat Group for additional details and prompt assistance. https://github.com/intel/xFasterTransformer/wiki

@pujiang2018
Copy link
Contributor

close as fixed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants