-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNOTES
79 lines (58 loc) · 4.2 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
Library parameter initialization
--------------------------------
All runtime configuration parameters are stored in armas_conf_t structure. Function
armas_conf_default() returns pointer to default library configuration block which is initially
set to compile time values. Runtime configuration is affected by ARMAS_CONFIG environment
variable. Function armas_init() reads and parses this environment variable and sets
default configuration block according to the environment variable.
Runtime configuration can be obtained by explicit call to armas_init() or by the first call to
armas_conf_default() function. All public library functions call this function if provided with
NULL configuration block parameter.
Basic configuration
-------------------
ARMAS_CONFIG has the format "MB,NB,KB,LB,WB,NPROC"
Low level blocking configuration for BLAS3 level functions is defined by parameters MB, NB and
KB. In matrix-matrix multiplication (GEMM) MB,NB block is computed as sum of [MB,KB]x[KB,NB]
blocks. Other BLAS3 level functions use these blocking parameters as appropriate. The
environment variable defines the relative sizes of these parameters. Actual parameter values
computed from values of cache memory environment variable ARMAS_CACHE. ARMAS_CACHE has format
"L3MEM,L1MEM", where L3MEM is the number of bytes reserved for per thread cache memory. This
buffer is divided to two blocks, one of size KBxMB elements and other of size KBxNB elements.
Each column of KB elements in block is aligned to CPU cache-line. (For extended precision
routines cache buffer is divided to three blocks.) Silent assumption is that MB <= NB <= KB.
The L1MEM parameter is used to compute internal parameter RB < MB, that controls access
to innermost block which is looped over NB times. Values of L3MEM and L1MEM are
integer numbers followed by optional letter 'k', 'K', 'm' or 'M'.
Parameter LB defines the blocking size for LAPACK level functions. Blocked code progresses
in LB size blocks.
Parameter WB defines parallel scheduling for BLAS3 level functions. For RECURSIVE and BLOCKED
scheduling WB is used to calculate the number of parallel execution threads. The number
of threads used is MIN(CEIL(<number-of-elements>/WB), NPROC). For TILED
scheduling the work is divided to [WB,WB] size tasks and scheduled to NPROC threads.
Parameter NPROC is maximum number of processors to use in parallel execution of BLAS3 level
subroutines.
Scheduling configuration
------------------------
Runtime scheduling configuration is defined with the ARMAS_SCHED environment variable. It has
format <SCHEDPARAMS>,<CPUSPEC-LIST>. The SCHEDPARAMS is character string where the first
character defines scheduling policy: recursive (R), blocked (B) or tiled (T). Second
character defines how blocked and tiled scheduling distribute tasks to worker threads:
round-robin (R) or random (Z). Optional third character defines whether task is enqueued
to one (1) worker or to two (2) workers in power-of-two fashion.
RECURSIVE scheduling for K tasks means that K-1 threads are recursively started and
the last K'th block is computed in calling thread's context. Then on return from the
recursive calls the started threads are waited to complete (joined).
In BLOCKED scheduling for K tasks all tasks are distributed to K workers and calling
thread waits on a counting semaphore for the completition of all tasks.
In TILED scheduling P tasks are distributed to K workers and calling thread wait for
the completition of all P tasks. The value of P is computed from the number of elements
in the target matrix and the blocking parameter WB.
The CPUSPEC-LIST is comma separated list of CPUSPECS. Each CPUSPEC is either cpu range or
cpu mask spesificion or simple cpu number. The range specifications has form
of FIRST-LAST[/STEP] where FIRST and LAST define inclusive range and an options STEP to
select every STEP'th cpu from the range. Cpu mask specification has format [START:]HEXMASK
where START is cpu number of first cpu and HEXMASK is 64bit mask. Simple cpu is as the name
says simple cpu number. Following three specifications are equivalent: '8-15/2', '8:0x55'
and '8,10,12,14'.
Runtime scheduling configuration ARMAS_SCHED is meaningfull only when NPROC parameter
has value greater than one.