-
Notifications
You must be signed in to change notification settings - Fork 2
/
design.txt
92 lines (78 loc) · 5.56 KB
/
design.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
Requirements
Our purpose is to write a shard allocator that can assign m shards to n nodes. (m >> n in most of the time).
The basic attributes of the nodes are stored in nodes.json file. It includes a node's total disk space, used disk space and id.
Shard attributes are stored in shards.json. It contains shard id, collection id and shard size.
(The collection can be viewed as Elasticsearch index. Will use 'index' in the following context).
To assign shards in a balanced way, we deisgn weight function based on node used space, shard count on a node and shard count of a certain index on a node.
User can adjust the weight on these 3 factors based on their shard usage.
Design
Shard allocation can be seen as a modified bin-packing problem. Bin packing is an optimization problem which computationally is an NP hard problem.
In our case, we want to distribute m shards to n nodes so as to minimize load/traffic on the most loaded node.
We can follow a greedy approach used by Elasticsearch which iterates all shards and for each shard chooses a most eligible node.
In order to define a criterion of looking for the most eligible node, we define a cost function that outputs a weight to measure the "business" of a node.
Hence, in the greedy fashion search, the incoming shard will always be assigned to the "idlest" node, namely the node with smallest weight.
Weight function:
Weight = WeightShard * theta0 + WeightIndex * theta1 + diskUsage * theta2
(User can adjust theta0, theta1, theta2 to adjust the importance of each factors)
1. How to pick the unassigned shard?
a. We choose primary nodes first
b. We choose primary nodes with the same index first.
2. During node searching, we will firstly rule out the ineligible nodes. The conditions include:
a. If the node contains shard with same index, same shard -> ineligible since primary and replica cannot be on the same node and same plicas cannot be on the same node
b. If the node used space + the allocating shard size > 90% of total node space
3. Using this weight function we can achieve
a. Shards with same index should be distributed to as many as possible nodes.
b. The node with less disk usage will be more eligible
c. The shards are distributed to the nodes as much evenly as possible.
4. If 2 nodes have the same weights, we should compare
a. Choose the node whose highest primary shard id is next to current shard id.
5. Time complexity is O(m * n)
Implementation
ShardAllocator.java is the entry point. It contains reading input from files, writing output to files, calling BalancedShardsAllocator
BalancedShardsAllocator.allocate() will create WeightFunction instance and pass it to Balancer class.
Balancer class provides a public method (allocateUnassigned()) to traverse all the primary and replica shards in the order mentioned in Desigsection.
For each shard, it will call into a private method implementing the core algorithm that decides the mode eligible node.
Verification
Unit tests are included to test the balance of the shard allocator
Following scenarios are covered by the unit test:
1. 1 index, 3 shards with same size, 3 nodes with same disk, 1 rep
a. Each node has 1 shard
2. 1 index, 3 shards with same size, 3 nodes with same disk, 2 rep
a. Each node has 2 shard
b. Primary and replicas are on different nodes
3. 1 index, 3 shards with same size, 3 nodes with same disk, 3 rep
a. Each node has 3 shard
b. Primary and replicas are on different nodes
4. When node disk are full, no shard will be assigned
5. When node disk are enough only for 2 shards out of 4 shards
a. Primary are allocated first
b. Shards with same index are allocated
6. When one node disk is much less than two other nodes
a. Shards are still evenly distributed to 3 nodes
b. Same shard count on each node
7. Same test as 6 but the disk balance is adjusted higher
a. Shards are NOT evenly distributed to 3 nodes
b. The one with less disk is biased and has more shards
8. 1 index, Shard 1, 2 with smaller size, Shard 3, 4 with larger size, 2 nodes, 1 rep
a. Shard 1 is allocated with Shard 3
b. Shard 2 is allocated with Shard 4
Pros and cons
Pros:
The balance of allocation does not rely on only one factor.
You have the ability to adjust the balance factor based on your experience.
Shard count is approximately the same for each node. Shard count per index is approximately the same for each node.
Finally, the disk usage is approximately the same for each node.
This balance means each node would get approximately same amount of traffic which can balance the JVM heap, CPU, memory consumption on each node, reducing “hot” nodes.
Cons:
We need to monitor the performance of the node and adjust the balance factor accordingly.
The weight is greatly affected by the balance factor we set.
Also, we have to regulate the disk balance factor by the average disk usage.
Improvement
1. Add more validation on inputs, and add more error handling
2. If a certain index becomes hot, we should take the CPU, network BW, traffic into account
3. We should also handle the case when we need to rebalance the cluster including adding a new node, removing a node, moving shards as its usage may change
4. For verification, we can implement a simulation to simulate the shards request and measures the traffic to/from each node,
such that it can verify whether the traffic for each node is balanced.
Reference
https://aws.amazon.com/blogs/opensource/open-distro-elasticsearch-shard-allocation/
https://opensearch.org/blog/The-Elasticsearch-Weight-Function/#:~:text=The%20Weight%20Function,to%20each%20shard%20%2D%20node%20combination