Hacker News new | ask | show | jobs
by mcphage 3298 days ago
As the first commenter on the site pointed out, the Hilbert Curve is probably a better choice (https://en.m.wikipedia.org/wiki/Hilbert_curve)
4 comments

Hilbert curve is only useful if the sole goal is sequential storage of point data. Z-curves are superior in almost every other case in real systems due to their unique and efficient computational properties, which many computer scientists are only vaguely aware of.

And since modern spatial database architectures don't sequentialize storage along the curve (because it doesn't make sense as a matter of engineering), the sole selling point of Hilbert curves is moot. You shouldn't design most systems in a way that could exploit the benefits of a Hilbert curve.

Could you elaborate on your comment about modern spatial databases not sequentializing storage along the curve? I would imagine parallel access across the curve, but wouldn't you want some reasonable sequential access at each cluster node to maximize IO speeds?

I've read your blog entries on SpaceCurve (http://www.jandrewrogers.com/2015/10/08/spacecurve/), found them very interesting but also just whetting the appetite. Are there no public reviews or papers covering discrete topology/sharding?

There are two design factors that most people overlook:

Most people know you can use Z/C-curve encodings to dynamically content address point data. There is a (very useful) generalization to hyper-rectangle types, perfect for content-addressing non-point geometries etc, but those types can't be meaningfully sequentialized at all in big systems. Most non-trivial spatial analytics involve non-point geometries, so being able to sequentialize points has limited utility.

Second, the computational cost of sorting along the curve, assuming you are using only points, is prohibitively high for negligible benefit. Modern storage engines use small shards, which are adaptively re-sharded as needed, and medium-sized pages. For insert, the content-addressing mechanic gets you to a single page; it would be significantly more expensive if you were sorting along the curve. For query, the typical selectivity on a shard is so high due to small adaptive shards, that you are better off treating it as an unsorted vector anyway. In short, much slower inserts and few (if any) query benefits.

As an optimization, it tends to only be applicable in cases where the architecture is significantly suboptimal anyway e.g. the use of gigantic shards. You'd get more benefit by fixing the architecture than trying to optimize poor architecture if at all possible.

The big advantage of Z-order curves is that the addressing computation is very cheap, which is why it's used a lot in computer graphics.
While I agree that Z-order curves are simpler, but it's fast to calculate Hilbert curves on modern CPUs too. Just to self plug:

https://github.com/leni536/fast_hilbert_curve

I only implemented the index->XY calculation yet. It compiles to 36 instructions without any branches and takes up 86 bytes.

https://github.com/leni536/fast_hilbert_curve/wiki/How-effic...

I think I can apply the same tricks for the inverse function too.

But using the same set of instructions, z-order encoding and decoding is 8 instructions (5 if you exclude size conversion and return):

    zorder64_inv:
        movabsq $0x5555555555555555, %rax
        pextq   %rax, %rcx, %rdx
        shrq    %rcx
        pextq   %rax, %rcx, %rcx
        shlq    $32, %rcx
        movl    %edx, %eax
        orq     %rcx, %rax
        retq

    zorder64:
        movl    %ecx, %eax
        movabsq $0x5555555555555555, %r8
        pdepq   %r8, %rax, %rcx
        movl    %edx, %eax
        pdepq   %r8, %rax, %rax
        addq    %rax, %rax
        orq     %rcx, %rax
        retq
Nice! Now I wonder when 36 vs 8 machine instructions become a bottleneck. I have seen applications of space-filling curves in quasi Monte Carlo integration, it could be potentially significant there.
Hilbert curves are used in a lot of graphics too. Heck, the old SGI Octane with Vpro graphics used a recursive Hilbert curve rasterizer. They show up a lot today in geospatial big-data since hilbert addresses make good shard keys.
I suspect that most production applications of Hilbert curve ordering would work just as well with Z order (a.k.a. Morton order), with the additional benefit of being simpler to reason about (just interleave/de-interleave the bits).

I haven’t ever seen any convincing benchmarks or other analysis where the Hilbert curve created any notable performance advantage vs. Z order; the only time you really need it is if moving along the linearized coordinate must never have jumps in the multidimensional coordinates, but I’m not convinced there are many if any real-world cases where that is important (note that in either case small movements in the multidimensional coordinates are associated with large jumps in the linearized coordinate). If the only goal is to minimize memory fetches, etc. then the Z ordering works just fine.

(If you know any good comparisons where the Hilbert curve comes out ahead, I’d be curious to read them.)

There is an error-diffusion dithering algorithm based on the Hilbert-curve called Riemersma dither.

https://www.compuphase.com/riemer.htm

I suspect that error diffusion along the long jumps of a z-order curve could create strange undesired artifacts.

H-curves are probably optimal with regards to locality, but good luck finding an implementation:

http://www.akt.tu-berlin.de/fileadmin/fg34/publications-akt/...

https://www.reddit.com/r/ProgrammerHumor/comments/4xzi9a/oh_...

It doesn't need to be that complicated at all. The book "Hacker's Delight" gives some nice, fairly simple implementations. Here's one:

http://www.hackersdelight.org/hdcodetxt/hilbert/lams.c.txt

H-curves are not Hilbert curves! I know, the name is very confusing, but read the PDF: H-curves have better locality than Hilbert curves (possibly even optimal).

And I'm sure that constructing H-curves doesn't need to be as complicated as that linked academic code makes it look, but for now I don't know of any implementation.

Thanks for the link though :)

Moore curves are also worth considering.