Skip to content

Performance & data compactness

Dimitri van Hees edited this page Mar 18, 2016 · 15 revisions

Complex geo features provided by base registries (like multi polygons) are often very accurate. Especially when querying feature collections, this results in very large response payloads wasting bandwidth and causing slow response times. For web applications, these payload sizes are unacceptable and will cause applications to become slow or unresponsive.

There are several solutions to overcome this issue:

Compression

Compression is an easy way to achieve reduction of the response payload size. While this will speed up the transfer of the data from the server to the client, the data still needs to be uncompressed and loaded into memory within the browser of app.

HTTP compression

HTTP compression is a capability that can be built into web servers and web clients to improve transfer speed and bandwidth utilization. The most common compression schemes include Gzip and Deflate.

Gzip is a (de-)compression utility using Lempel-Ziv coding (LZ77) and is broadly used by webservers like Apache and Nginx to compress response payloads without the risk of data loss.

In order to receive a gzip-encoded response the client must set an Accept-Encoding request header. Here is an example of a properly formed HTTP header for enabling Gzip compression:

Accept-Encoding: gzip

TopoJSON

TopoJSON is an extension of GeoJSON that encodes topology. Rather than representing geometries discretely, geometries in the TopoJSON format are stitched together from shared line segments called arcs. Arcs are sequences of points, while line strings and polygons are defined as sequences of arcs. Each arc is defined only once, but can be referenced several times by different shapes, thus reducing redundancy and decreasing the payload size. This way of storing spatial data is most efficient on polygons that share a lot of borders with neighbours, for example city-boundaries.

Besides GeoJSON, TopoJSON supports other source formats like ESRI shapefiles and CSV. It is also capable of simplifying geo features, a topic we will cover in the next section. Similar to HTTP compression, TopoJSON needs to be decompressed before the data can be used. Since TopoJSON performs compression at another level than HTTP, these two compression methods can be used in conjunction to return the smallest payload possible.

Precision

Many use cases don't require the level of accuracy the source data offers. Most geo standards use decimal degrees for expressing spatial coordinates. An easy method of degrading precision is to reduce the number of decimals. This can easily result in smaller payloads.

Another problem is the precision of geometries. It's not unusual to have (multi) polygons consisting of thousands of coordinates. In many use cases, this level of precision is unnecessary. To simplify geometries to suit the displayed resolution, various line simplification algorithms exist.

Ramer–Douglas–Peucker

The Ramer–Douglas–Peucker algorithm (RDP) is an algorithm for reducing the number of points in a curve that is approximated by a series of points. It does so by "thinking" of a line between the first and last point in a set of points that form the curve. It checks which point in between is farthest away from this line. If the point (and as follows, all other in-between points) is closer than a given distance epsilon, it removes all these in-between points. If on the other hand this outlier point is farther away from our imaginary line than epsilon, the curve is split in two parts.

Ramer–Douglas–Peucker

Visvalingam–Whyatt

While Douglas–Peucker is the most well-known, the Visvalingam–Whyatt algorithm, also used by TopoJSON, may be more effective and has a remarkably intuitive explanation: it progressively removes points with the least-perceptible change. For example, the GeoJSON file used to draw the continental United States can be reduced from 531KB to 27KB with only minor visual changes (example).

Clustering

Lots of libraries make it very easy to draw markers (geo points) on a map. However, spatial datasets often get very large and markers start to clutter. This image illustrates the problem:

Cluttered markers

Some libraries have clustering capabilities (example), but this is only a partial solution to the problem. The client (browser/app) still has to deal with large payloads and high memory/cpu consumption, what can result in slow or unresponsive behavior. Therefore, it would be better if the server handles the clustering and only serves the necessary data for a clustered map view.

The result we want looks like this:

Clustered markers

Geohash

One of the most efficient ways to handle clustering server-side is geohashing. Geohash is a hierarchical spatial data structure which subdivides space into buckets of grid shape. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Geohashes offer properties like arbitrary precision and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision).

Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. Example:

Geohash

A popular search engine that supports geohash aggregation is Elasticsearch.

Filtering

Lots of use cases need to query for all spatial objects filtered by spatial conditions. To reduce the payload for these kind of use cases, advanced filtering features can be very useful. Some possible filtering types are:

  • Distance: this filters on all spatial objects within a specific distance from a geo point.
  • Bounding box: This filters on all spatial objects within a bounding box.
  • Geo Shape: This filters on all spatial objects within or interecting with a geo shape (e.g. polygon).

For extra convenience, the above filtering types could be combined with aggregation/faceting. For example, it could return all objects grouped by a distance range of 1, 3, 5 and 10 km.