-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathreport.typ
215 lines (138 loc) · 11.1 KB
/
report.typ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
// The project function defines how your document looks.
// It takes your content and some metadata and formats it.
// Go ahead and customize it to your liking!
#let project(title: "", abstract: [], authors: (), body) = {
// Set the document's basic properties.
set document(author: authors.map(a => a.name), title: title)
set page(numbering: "1", number-align: center)
set text(font: "New Computer Modern", lang: "en")
show math.equation: set text(weight: 400)
set heading(numbering: "1.1")
// Set run-in subheadings, starting at level 4.
show heading: it => {
if it.level > 3 {
parbreak()
text(11pt, style: "italic", weight: "regular", it.body + ".")
} else {
it
}
}
// Title row.
align(center)[
#block(text(weight: 700, 1.75em, title))
]
// Author information.
pad(
top: 0.5em,
x: 2em,
grid(
columns: (1fr,) * calc.min(3, authors.len()),
gutter: 1em,
..authors.map(author => align(center)[
*#author.name* \
#author.email \
#author.affiliation
]),
),
)
// Abstract.
pad(
x: 2em,
top: 1em,
bottom: 1.1em,
align(center)[
#heading(
outlined: false,
numbering: none,
text(0.85em, smallcaps[Abstract]),
)
#abstract
],
)
// Main body.
set par(justify: true)
body
}
#show: project.with(
title: "Benchmark geospatial query performance on NoSQL(-like) databases",
authors: (
(name: "LiangXiang Shen", email: "kj415j45@gmail.com", affiliation: "Guangxi GuiYunBao Tech Inc."),
(name: "Feng Zhou", email: "", affiliation: "Guangxi GuiYunBao Tech Inc."),
),
abstract: [Geospatial has become more and more critical not only for enterprises but also for customers. In the long run, it is important to collect and use them well in growing IoT. The performance of the query could be a key factor to improve UX and background analysis. In this paper, we will benchmark the geospatial query performance on popular NoSQL(-like) databases, including MongoDB, Redis. We will also discuss the pros and cons of each database.],
)
= Introduction
According to recent Database ranking @DB_Rank, we choose the top-most popular NoSQL databases as our benchmark target. They are MongoDB, Redis. We will benchmark the geospatial query performance on them.
== MongoDB
_MongoDB_ is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. @MongoDBWiki
== Redis
_Redis_ (Remote Dictionary Server) is an open-source in-memory storage, used as a distributed, in-memory key-value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. @RedisWiki
_Redis Stack_ extends the core features of _Redis OSS_ and provides a complete developer experience for debugging and more. @RedisStackIntro As it is providing JSON, time-series and probabilistic-based data structure over raw Redis. These features are very useful for building modern applications.
= Benchmark
Generally, the benchmark should be as close to the actual use case as possible. And many targets may reflect the actual database performance, we would cover them as many we can as possible.
There are some challenges while benchmarking. These challenges expose the real development issues that we may encounter in the future. We will talk about them later.
== Test Data
=== iNaturalist 2017
_iNaturalist_ provides a place to record and organize nature findings, meet other nature enthusiasts, and learn about the natural world. It encourages the participation of a wide variety of nature enthusiasts, including, but not exclusive to, hikers, hunters, birders, beach combers, mushroom foragers, park rangers, ecologists, and fishermen. Through connecting these different perceptions and expertise of the natural world, iNaturalist hopes to create extensive community awareness of local biodiversity and promote further exploration of local environments. @iNaturalist
Google held a competition called _iNaturalist Challenge at FGVC 2017_ #footnote[https://www.kaggle.com/c/inaturalist-challenge-at-fgvc-2017] . They created a dataset of 5,089 species of plants and animals, consisting of 675,000 training images and 25,000 test images from iNaturalist.
We found a dataset with the geolocation that we needed for the benchmark called Fine Grained Geolocation Datasets @fg_geo created by Grace Chu, Brian Potetz, et al.
=== Generated Data
To simulate some edge cases, we generated 3 different types of datasets.
==== Random
Randomly generated points in the range of EPSG:4326.
==== Grid
Generated points with Fibonacci Sphere Algorithm @FibonacciSphereAlgo. It is a method of generating points that are evenly distributed on the surface of a sphere.
==== Cluster
Every 50 points will be considered as a cluster and will be placed around a random center on earch with short radius. All points will be generated in the range of EPSG:4326.
== Test Environment
=== Hardware
GitHub Actions #footnote[https://github.com/features/actions] is used for benchmarking. It provides a virtual machine with 2 cores and 7 GB of memory.
=== Software
- MongoDB
- Enterprise Server \@ latest #footnote[https://hub.docker.com/r/mongodb/mongodb-enterprise-server]
- Redis \@ latest #footnote[https://hub.docker.com/_/redis]
- Stack Server \@ latest #footnote[https://hub.docker.com/r/redis/redis-stack-server]
=== Tools
Source code used in the benchmark can be found in guiyunbao/geospatial-benchmark #footnote[https://github.com/guiyunbao/geospatial-benchmark] repository.
=== Test cases
==== Case A
Pick a random point on earth, find the closest location in the dataset.
This case is the most common use case for geospatial query. It is used to find the closest location to the user.
==== Case B
Pick a random location from the dataset, find all locations within certain distance.
This case is used to find all locations within a certain distance from the user. Also measure the bandwidth of the database as the result may be large.
==== Case C
Pick a random location from the dataset, find all locations within certain distance, order by distance.
This case is used to measure the performance of sorting by distance.
== Limitation
Due to complexity, several limitations are found during the benchmark. All patch that we made to ensure the test suite run correctly will be listed below.
=== Coordinate System
Redis uses EPSG:3857 #footnote[https://epsg.io/3857], as known as _Pseudo-Mercator_ or _Web Mercator_, as the coordinate system @RedisGeoadd.
This coordinate system is not including the area near the poles. So trying to insert a point near the poles will fail.
In this benchmark, *while inserting points into Redis, if a point is near the poles, we will move it to a valid EPSG:3857 latitude*.
It is worth mentioning that this system may not reflect actual ground distance as its projection is not perfect for Earth. *Though some reports say it has a 0.7% error rate @ErrorInProjection, we will ignore the error as it is far beyond our purpose.*
MongoDB uses EPSG:4326 #footnote[https://epsg.io/4326], as known as _WGS84_ which is used in _GPS_. It's not affected by the issue.
=== Query
In Redis, querying around a geo point requires radius as one of the parameters. If the radius is too large, the query will be slow. *For simulating actual use cases, we will use 100 kilometers as the maximum radius while querying in Redis.*
*All queries will use a valid EPSG:3857 coordinate.*
In Redis Stack, there is no way to evaluate distance between two points. And it is not possible to order them during query. So *Redis Stack will calulate the distance and order them in the application layer in Test Case C.*
== Test Result
Due to limited time, we do not make a chart or numberic analysis here. But all raw data can be found in GitHub Action logs #footnote[https://github.com/guiyunbao/geospatial-benchmark/actions/workflows/dispatch.yml]. You can also fork the repository and run the benchmark by yourself.
*Redis is the fastest in all cases*, followed by MongoDB and Redis Stack. *Most of the time, MongoDB is better than Redis Stack.*
=== Analyze by Test Cases
In most test case, Redis is dominating the benchmark. It is at least 5x faster than MongoDB, 10x faster then Redis Stack.
And it seems that sorting by distance did not impact performance in all databases. Even though Redis Stack is sorting them at the application layer.
=== Analyze by Datasets
Manually generated datasets are used to simulate some edge cases. They exposed that Redis family can not handle area around poles properly.
==== iNaturalist 2017 and Cluster
Both datasets simulate the real world use cases. PoI may be clustered in some area. And the dataset may be large. All databases are suffering a bandwidth issue in Test Case B and C, ops/sec is significantly decreasing.
==== Random
Randomly distributed points are not a common case. But still worth to test. Databases can not take advantage by grouping or other optimization. So the performance is not as good as other datasets. At a large scale, this is actually like testing on datasets with cities on earth as a point. Do not see any difference on all databases.
==== Grid
Grid is a special case. It is evenly distributed on the surface of a sphere. All Test Case would return a much smaller results which could reflect the actual querying speed. At this point, Redis Stack sometimes faster than MongoDB.
= Conclusion
Raw Redis is the fastest among all databases. With geospatial as index, it runs almost 10x faster than MongoDB, and 20x faster than Redis Stack. But it is not suitable for complex queries. As raw Redis could not set a secondary or more indexes. Neither querying with the value it stored. To achieve same experience like other relational database. Developer have to maintain the foreign key in the application layer which could be expensive and complex. It may also cause data inconsistency and over-use of memory.
MongoDB is somehow a balance to the performance and the complexity. It is not as fast as raw Redis, but it is still fast enough for most use cases. Its design is more like a relational database. MongoDB supports secondary indexes and querying with the value it stored. It also supports complex queries like aggregation. There is a In-Memory Storage Engine #footnote[https://www.mongodb.com/docs/manual/core/inmemory/#in-memory-storage-engine] in Enterprise Server but it is not enable by default. It also requires a setup of a replica set to persist the data. That is why it is not cover in this benchmark.
Redis Stack could be a viable choice in the future, but not for now. It is still at a early prototype stage in our opinion. Though it extends the core features of Redis, it becomes not as fast as raw Redis. It tries to implements a document-based database, but its ecosystem is frustrating. At least for Redis OM for JavaScript. The OM can not provide a proper type annotation when querying with the value it stored. Which is implements well in Mongoose.
#pagebreak(weak: true)
#bibliography("cites.yml")