Skip to content

Commit bb486b8

Browse files
authored
Online GPU slicing (ggml-org#11)
* move gpu slicing python code into a module * remove dead code in exporting gpu split * streamline solver and export with one entrypoint * new powerinfer.py module * wip: invoke Python to generate gpu split on the fly * wip: load gpu split on demand * wip: new gpu split file format * wip: generate and load new gpu idx format * wip: generate and load gpu index on the fly * minor: calculate total VRAM offloading via FFN splitting * add option to disble gpu index * bugfix * wip: bug fix for segment fault * bugfix * bugfix and testing * temporary fix for neuron factor in solving * fix: generated gpu idx path * Update README about gpu index
1 parent ded0613 commit bb486b8

File tree

16 files changed

+417
-479
lines changed

16 files changed

+417
-479
lines changed

README.md

+17-6
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ And new features coming soon:
7171
```bash
7272
git clone https://github.com/SJTU-IPADS/PowerInfer
7373
cd PowerInfer
74+
pip install -r requirements.txt # install Python helpers' dependencies
7475
```
7576
### Build
7677
In order to build PowerInfer you have two different options. These commands are supposed to be run from the root directory of the project.
@@ -89,7 +90,8 @@ cmake --build build --config Release
8990

9091
## Model Weights
9192

92-
PowerInfer models are stored in a special format called *PowerInfer GGUF* based on GGUF format, consisting of both LLM weights and predictor weights. You can download PowerInfer GGUF weights from Hugging Face or convert them from the original model weights and predictor weights.
93+
PowerInfer models are stored in a special format called *PowerInfer GGUF* based on GGUF format, consisting of both LLM weights and predictor weights.
94+
You can obtain PowerInfer GGUF weights at `*.powerinfer.gguf` as well as profiled model activation statistics under `activation/` for 'hot'-neuron offloading from each Hugging Face model repo under "PowerInfer GGUF Format" column. You can also convert them from the original model weights and predictor weights.
9395

9496
| Base Model | PowerInfer GGUF Format | Original Model | Predictor |
9597
|------------|------------------|----------------|---------------------|
@@ -102,14 +104,16 @@ PowerInfer models are stored in a special format called *PowerInfer GGUF* based
102104

103105
For CPU-only and CPU-GPU hybrid inference with all available VRAM, you can use the following instructions to run PowerInfer:
104106
```bash
105-
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
107+
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
108+
# ./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
106109
```
110+
107111
If you want to limit the VRAM usage of GPU:
108112
```bash
109-
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
113+
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
114+
# ./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8
110115
```
111-
112-
As for now, it requires an offline-generated "GPU index" file to split FFNs on GPU. And we found these files are hard to maintain and distribute. We will ship automatic FFN split based on VRAM capacity via [#11](https://github.com/SJTU-IPADS/PowerInfer/pull/11) very soon.
116+
Under CPU-GPU hybrid inference, PowerInfer will automatically offload all dense activation blocks to GPU and split FFN on GPU if possible.
113117

114118
## Evaluation
115119

@@ -119,6 +123,13 @@ As for now, it requires an offline-generated "GPU index" file to split FFNs on G
119123

120124
PowerInfer achieves up to 11x and 8x speedup for FP16 and INT4 models!
121125

126+
## FAQs
127+
1. What if I encountered `CUDA_ERROR_OUT_OF_MEMORY`?
128+
- You can try to run with `--reset-gpu-index` argument to rebuild GPU index for this model to avoid any stale cache.
129+
- Due to our current implementation, model offloading might not be accurate as expected. You can try with `--vram-budget` with a slightly lower value or `--disable-gpu-index` to disable FFN offloading.
130+
2. What if...
131+
- Issues are welcomed! Please feel free to open an issue and attach your running environment and running parameters. We will try our best to help you.
132+
122133
## TODOs
123134
We will release the code and data in the following order, please stay tuned!
124135

@@ -130,7 +141,7 @@ We will release the code and data in the following order, please stay tuned!
130141
- [ ] Support Metal for Mac
131142
- [ ] Release code for OPT models
132143
- [ ] Release predictor training code
133-
- [ ] Support online split for FFN network
144+
- [x] Support online split for FFN network
134145
- [ ] Support Multi-GPU
135146

136147

common/common.cpp

+8-25
Original file line numberDiff line numberDiff line change
@@ -471,12 +471,10 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
471471
break;
472472
}
473473
params.lora_base = argv[i];
474-
} else if (arg == "--gpu-index") {
475-
if (++i >= argc) {
476-
invalid_param = true;
477-
break;
478-
}
479-
params.gpu_index = argv[i];
474+
} else if (arg == "--reset-gpu-index") {
475+
params.reset_gpu_index = true;
476+
} else if (arg == "--disable-gpu-index") {
477+
params.disale_gpu_index = true;
480478
} else if (arg == "--mmproj") {
481479
if (++i >= argc) {
482480
invalid_param = true;
@@ -910,6 +908,8 @@ struct llama_model_params llama_model_params_from_gpt_params(const gpt_params &
910908
mparams.tensor_split = params.tensor_split;
911909
mparams.use_mmap = params.use_mmap;
912910
mparams.use_mlock = params.use_mlock;
911+
mparams.reset_gpu_index = params.reset_gpu_index;
912+
mparams.disable_gpu_index = params.disale_gpu_index;
913913

914914
return mparams;
915915
}
@@ -968,24 +968,6 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
968968
return std::make_tuple(nullptr, nullptr);
969969
}
970970

971-
if (llama_use_sparse_inference(model)) {
972-
fprintf(stderr, "%s: postprocessing PowerInfer model '%s'\n", __func__, params.model.c_str());
973-
if (!params.gpu_index.empty()) {
974-
int err = llama_model_apply_gpu_idx_from_file(model, params.gpu_index.c_str(), true);
975-
if (err != 0) {
976-
fprintf(stderr, "%s: error: failed to apply mlp adapter\n", __func__);
977-
llama_free_model(model);
978-
return std::make_tuple(nullptr, nullptr);
979-
}
980-
}
981-
982-
if (llama_model_apply_augmentation(model) != 0) {
983-
fprintf(stderr, "%s: error: failed to apply augmentation\n", __func__);
984-
llama_free_model(model);
985-
return std::make_tuple(nullptr, nullptr);
986-
}
987-
}
988-
989971
auto cparams = llama_context_params_from_gpt_params(params);
990972
llama_context * lctx = llama_new_context_with_model(model, cparams);
991973
if (lctx == NULL) {
@@ -1357,7 +1339,8 @@ void dump_non_result_info_yaml(FILE * stream, const gpt_params & params, const l
13571339
fprintf(stream, " - %s: %f\n", std::get<0>(la).c_str(), std::get<1>(la));
13581340
}
13591341
fprintf(stream, "lora_base: %s\n", params.lora_base.c_str());
1360-
fprintf(stream, "gpu_index: %s\n", params.gpu_index.c_str());
1342+
fprintf(stream, "reset_gpu_index: %s\n", params.reset_gpu_index ? "true" : "false");
1343+
fprintf(stream, "disable_gpu_index: %s\n", params.disale_gpu_index? "true": "false");
13611344
fprintf(stream, "main_gpu: %d # default: 0\n", params.main_gpu);
13621345
fprintf(stream, "memory_f32: %s # default: false\n", !params.memory_f16 ? "true" : "false");
13631346
fprintf(stream, "mirostat: %d # default: 0 (disabled)\n", sparams.mirostat);

common/common.h

+2-1
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,8 @@ struct gpt_params {
9191
std::vector<std::tuple<std::string, float>> lora_adapter; // lora adapter path with user defined scale
9292
std::string lora_base = ""; // base model path for the lora adapter
9393

94-
std::string gpu_index = ""; // sparse activation mlp adapter path
94+
bool reset_gpu_index = false; // refresh the gpu index file
95+
bool disale_gpu_index = false; // disable loading gpu index and splitting ffn
9596

9697
int ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
9798
int ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line

examples/batched/batched.cpp

+4-20
Original file line numberDiff line numberDiff line change
@@ -48,12 +48,11 @@ int main(int argc, char ** argv) {
4848
params.n_threads = std::atoi(argv[6]);
4949
}
5050

51-
if (argc >= 8) {
52-
params.gpu_index = argv[7];
53-
}
51+
// For testing purposes, we always reset the GPU index
52+
params.reset_gpu_index = true;
5453

55-
printf("params: model = %s, prompt = %s, n_parallel = %d, n_len = %d, n_gpu_layers = %d, n_threads = %d, gpu_index = %s\n",
56-
params.model.c_str(), params.prompt.c_str(), n_parallel, n_len, n_gpu_layers, params.n_threads, params.gpu_index.c_str());
54+
printf("params: model = %s, prompt = %s, n_parallel = %d, n_len = %d, n_gpu_layers = %d, n_threads = %d, reset_gpu_index = true\n",
55+
params.model.c_str(), params.prompt.c_str(), n_parallel, n_len, n_gpu_layers, params.n_threads);
5756

5857
if (params.prompt.empty()) {
5958
params.prompt = "Hello my name is";
@@ -76,21 +75,6 @@ int main(int argc, char ** argv) {
7675
return 1;
7776
}
7877

79-
if (!params.gpu_index.empty()) {
80-
int err = llama_model_apply_gpu_idx_from_file(model, params.gpu_index.c_str(), true);
81-
if (err != 0) {
82-
fprintf(stderr, "%s: error: failed to apply mlp adapter\n", __func__);
83-
llama_free_model(model);
84-
return 1;
85-
}
86-
}
87-
88-
if (llama_model_apply_augmentation(model) != 0) {
89-
fprintf(stderr, "%s: error: failed to apply model augmentation\n", __func__);
90-
llama_free_model(model);
91-
return 1;
92-
}
93-
9478
// tokenize the prompt
9579

9680
std::vector<llama_token> tokens_list;

ggml.c

+4-4
Original file line numberDiff line numberDiff line change
@@ -17497,7 +17497,7 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
1749717497
}
1749817498

1749917499
const int n_threads = cplan->n_threads;
17500-
#ifdef LLAMA_CUBLAS
17500+
#ifdef GGML_USE_CUBLAS
1750117501
struct ggml_compute_state_shared state_shared = {
1750217502
/*.cgraph =*/ cgraph,
1750317503
/*.cgraph_plan =*/ cplan,
@@ -17534,7 +17534,7 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
1753417534
.ith = j,
1753517535
.shared = &state_shared,
1753617536
};
17537-
#ifdef LLAMA_CUBLAS
17537+
#ifdef GGML_USE_CUBLAS
1753817538
const int rc = ggml_thread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread_hybrid, &workers[j]);
1753917539
#else
1754017540
const int rc = ggml_thread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
@@ -17551,7 +17551,8 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
1755117551
const int64_t perf_start_time_us = ggml_perf_time_us();
1755217552

1755317553
// this is a work thread too
17554-
#ifdef LLAMA_CUBLAS
17554+
17555+
#ifdef GGML_USE_CUBLAS
1755517556
int compute_status = (size_t) ggml_graph_compute_thread_hybrid(&workers[0]);
1755617557
#else
1755717558
int compute_status = (size_t) ggml_graph_compute_thread(&workers[0]);
@@ -19590,7 +19591,6 @@ struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_p
1959019591
sparse_deriv = GGML_DENSE_INFERENCE;
1959119592
} else if (strncmp(magic, GGUF_POWERINFER_MAGIC, sizeof(magic)) == 0) {
1959219593
sparse_deriv = GGML_SPARSE_INFERENCE;
19593-
fprintf(stderr, "%s: PowerInfer derived model detected. Sparse inference will be used.\n", __func__);
1959419594
} else {
1959519595
fprintf(stderr, "%s: invalid magic characters %s.\n", __func__, magic);
1959619596
fclose(file);

gguf-py/gguf/constants.py

+6
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,9 @@ class Tokenizer:
7474
class PowerInfer:
7575
SPARSE_THRESHOLD = "powerinfer.sparse_threshold"
7676

77+
class Split:
78+
VRAM_CAPACITY = "split.vram_capacity"
79+
7780

7881
#
7982
# recommended mapping of model tensor names for storage in gguf
@@ -385,6 +388,9 @@ class GGMLQuantizationType(IntEnum):
385388
Q5_K = 13
386389
Q6_K = 14
387390
Q8_K = 15
391+
I8 = 16,
392+
I16 = 17
393+
I32 = 18,
388394

389395

390396
class GGUFEndian(IntEnum):

0 commit comments

Comments
 (0)