-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
126 lines (115 loc) · 6.01 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="Content-Security-Policy"
content="default-src 'self' data:;
connect-src data: 'self';
script-src 'self' 'unsafe-inline' 'unsafe-eval';
style-src 'self' data: 'unsafe-inline';
img-src 'self' blob: data:;">
<!--<meta name="viewport"
content="width=device-width,
height=device-height,
initial-scale=1.0,
minimum-scale=1.0">-->
<script src="./resources/plotly-2.23.2.min.js" charset="utf-8"></script>
<script>
window.addEventListener("load", async function() {
datasets = await fetch("./data/html-data/data.json").then(x => x.json())
for (var setname of Object.keys(datasets)) {
let element = document.createElement("div");
document.body.appendChild(element);
let data = datasets[setname];
element.id = setname;
let layout = {
//margin: { t: 0 },
title: setname,
showlegend: true
};
let options = {
scrollZoom: true
}
Plotly.newPlot(element, data, layout, options);
}
})
</script>
<style>
body {
/*font-family: sans-serif;*/
line-height: 1.7;
width: min(1260px, 80%);
margin: 0px auto;
padding: 1em;
box-sizing: border-box;
}
h1 {
text-align: center;
margin: 4em;
}
</style>
<title>Speed of LLaMa CPU-based Inference Across Select System Configurations 🍅️</title>
</head>
<body>
<h1>Speed of LLaMa CPU-based Inference Across Select System Configurations</h1>
<p>This page compares the speed of CPU-only inference across various system and inference configurations when using llama.cpp. The purpose of this page is to shed more light on how configuration changes can affect inference speed.
<div style="columns: 2">
<div style="break-inside: avoid-column;">
<h4>Measured Metrics</h4>
<ul>
<li>Relative Load Time <code>load_time_median</code>
<li>Token Sample Time <code>sample_time_median</code>
<li>Prompt Token Evaluation Time <code>prompt_eval_time_median</code>
<li>Token Evaluation Time <code>eval_time_median</code>
<li>Relative Total Time <code>total_time_median</code>
</ul>
<p>Refer to llama.cpp documentation for more information.
</div>
<div style="break-inside: avoid-column;">
<h4>System Configuration</h4>
<ul>
<li>AMD 7950x (16c/32t), X670E-E
<li>128GiB DDR5 6400MT/s CL32-39-39-102
<li>SAMSUNG 970 EVO Plus SSD 1TB NVMe M.2 V-NAND
</ul>
</div>
<div style="break-inside: avoid-column;">
<h4>Llamma.cpp Configuration</h4>
<ul>
<li>Version: ac7876a
<li>LLM Models: LLaMa 7B, 13B, 30B, and 65B
<li>CLI Parameters used: -t, -n 40, --ctx-size
</ul>
</div>
<div style="break-inside: avoid-column;">
<h4>System Configuration Variations</h4>
<ul>
<li>128GiB 4 DIMM @ 3?00MT/s, <code>schedutil</code> OS CPU frequency governor.
<li>64GiB 2 DIMM @ 5200MT/s, <code>schedutil</code> OS CPU frequency governor.
<li>64GiB 2 DIMM @ 5200MT/s, <code>performance</code> OS CPU frequency governer.
</ul>
</div>
<div style="break-inside: avoid-column;">
<h4>Llamma.cpp Configuration Variations</h4>
<ul>
<li>Concurrent Instances: 1, 3
<li>Threads: 1 — 20
<li>Contexts: 512, 2048 LLaMA
<li>Quantization: ggml @ q4_0
</ul>
</div>
</div>
<h2>Working with the Graphs & Data</h2>
<p>The graphs on this page are best viewed on a Desktop computer.
<p>The horizontal x-axis denotes the number of threads. The vertical y-axis denotes time, measured in milliseconds.
<p>For a less cluttered viewing of the graph, hide all the curves first, then only toggle the curves you want to examine. This is done by <em>rapidly</em> double-clicking on one of the labels in the legend, then clicking once on each curve you want to view.
<p>Curve labels format is:
<pre>
RAMSPEED-DIMMCOUNT-FREQGOV-NINSTANCE-PARAM-CTX-MODEL
</pre>
<p>For example, the label <code>5200-2dimm-schedutil-3-7B-512-ggml-model-q4_0.bin</code> pertains to a run that was done when the system had 2 DIMMs of ram operating at 5200MT/s, the CPU frequency governor was set to <code>schedutil</code>, 3 separate instances of llama.cpp were running the ggml-model-q4_0.bin version of the 7B model with a 512 context window.
<p>The data used for these graphs is available for download as a zipped archive <a download href="./data/html-data/data-QVlr1kKzDjc=.7z">here</a>. Use the password <code>QVlr1kKzDjc=</code> to access the data.
<p>These graphs are best viewed while consuming at least one tomato 🍅️.
<h2>Interactive Graphs</h2>
</body>
</html>