| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ryandrake 4475 days ago

Spatial data structures are great, and heavily used in all kinds of mapping applications. Most of the systems I've worked with used the more constrained quadtree structure, where the map is divided into uniformly-sized tiles, and a level-of-detail hierarchy is built. This is good for raster data, where each tile is a picture and the whole world is covered by tiles. The R-tree has great advantages when the complexity is not uniformly distributed across the map, for example, vector road data, where vast areas on earth have no data, and most of the data is concentrated in small urban areas.

I believe PostGIS, which adds spatial operations to Postgres, uses an R-tree as its spatial index.

An improvement on the R-tree, the R(star)-tree, uses a different node splitting algorithm and includes re-insertions (similar to balancing a B-tree), reducing both coverage and overlap. The insertion complexity is greater, but in general, R(star)-tree query performance tends to be a bit better for mapping applications. Generally, maps don't change that often, so building the tree tends to happen far less often than querying it.

There are many more specialized spatial data structures available, for example, Kd-trees, which can be perfectly balanced and are useful for storing point data.

If you're really interested in this stuff, the holy bibles for spacial data structures (which I keep in a special place on my bookshelf) are a pair of books written by H. Samet: The Design and Analysis of Spatial Data Structures, and Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.

EDIT: Formatting, and apparently you can't write the asterisk character on HN

R-Tree paper (1984): http://postgis.org/support/rtree.pdf

R(star)-Tree (1990): http://epub.ub.uni-muenchen.de/4256/1/31.pdf

PostGIS: http://postgis.net

H.Samet textbooks: http://www.cs.umd.edu/~hjs/design.html

5 comments

jandrewrogers 4475 days ago

The spatial data structure literature is quite incomplete, and there is a lot of confusion about what to use when. Many of the issues with spatial indexing performance at companies I go into is that they are doing it wrong, not that they necessarily have a fundamental problem.

It should be pointed out that R-family data structures should only be used for data sets that are (1) small and (2) relatively static. They scale very poorly in a number of ways. The primary reason they are used in databases is that they can handle interval (i.e. non-point) spatial data types without the possibility of pathological space complexity (bad when talking about databases) and fit B-tree indexing models adequately, which that software understands.

Quad-tree variants can scale well for point data types but are often terrible for indexing non-point data types due to pathological space complexity.

Grid-file variants are the canonical structure for large-scale spatial data sets if the data is static. However, the number of companies implementing these correctly at scale is approximately zero in my experience.

For general, scalable, online/dynamic spatial indexing structures, there is another family of algorithms and data structures (adaptive spatial sieves) that are basically ignored in the literature even though they were first described in 1990, albeit in a not very useful form at the time. If you are doing petabyte-scale real-time indexing of polygons at extremely high rates, this is what you would use but little is published about them because most modern variants did not come out of academia.

And the most advanced algorithm family for indexing point-like data is not in the literature at all.

(Background: my day job involves extreme-scale, high-performance spatial indexing software and I invented a few spatial indexing algorithms back in the day that are still the state-of-the-art in their respective algorithm families.)