| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by psykotic 1539 days ago

> It's possible to build a hash on tolerant doubles by making two hashes per lookup, and even complex numbers with four per lookup, but no one's figured out how to deal with entries that contain multiple numbers yet.

I'm curious what obstacle you have in mind for "multiple numbers".

The problem is equivalent to a distance query on a data structure, i.e. enumerate everything within a distance tol of a query point x in an appropriate distance metric. There are many data structures for this problem (it's a basic building block in computational geometry). But I assume the hashing solution you have in mind is what we usually call a hash grid in gamedev and graphics, which works well when tol is fixed for the lifetime of the data structure. [1] Namely, you define a grid with cell size tol and map a point x to the index cell(x) of its containing cell, which is just a rounding operation. Then you can use a hash table keyed by cell(x) to store the contents of an unbounded, sparse grid. To perform a distance query from x you need to check not just x's cell but some of the immediately neighboring cells and then filter all of those candidates based on the pairwise distance. [2] [3]

This approach works with any distance metric (including the usual suspects L^2, L^1 and L^inf) and in any dimension d although the worst-case number of cells to check in the hash is 2^d, so the curse of dimensionality is present and in high-dimensional spaces another data structure not based on rectilinear partitioning would be preferable. But hashing does work with "multiple numbers" if by "multiple numbers" you mean a vector (with not too many components) where tolerance is defined by a distance metric.

[1] Actually, hash grids work well any time the query radius is on roughly the same scale as the cell size. But if the query radius is a lot smaller than the cell size then the grid is too coarse to perform adequate pre-filtering. And if the query radius is larger than the cell size you have to check a bigger neighborhood of cells (i.e. more hash lookups) to enumerate all the candidates.

[2] Depending on the read vs write ratio, you can actually flip this around by storing each point in multiple cells so that queries only need one hash table lookup at the expense of slightly less precise pre-filtering.

[3] Instead of doing multiple hash table lookups, you can also have each occupied cell store pointers to neighboring cells (null for unoccupied neighbors), which replaces hash table lookups for the neighbors with plain memory loads. A variation on this trick is to instead store bit flags to avoid doing hash table lookups for neighboring cells which are known to be empty; this takes up far less memory.