Hacker News new | ask | show | jobs
by hinkley 124 days ago
I can’t think of a single time I’ve needed a sorted list of only numbers. It’s always numbers and something else, like names or dates. Maybe for median calculations, but I don’t even use those that much either. Especially in telemetry, where mean is easy and median is not.
4 comments

> I can’t think of a single time I’ve needed a sorted list of only numbers.

Gosh. Let me try to convince you.

I use permutation arrays all the time: lists of indexes that can be used across multiple vectors.

This is much faster than the pattern of scanning rows, constructing tuples of (thingToSort . thingIWantInThatOrder) and making a custom sort function, and destructuring those tuples...

And really, not having to write custom sort functions is really really nice.

> Especially in telemetry, where mean is easy and median is not.

Funny. Yes median is obvious with a permutation array, and maybe mean is less so.

When your data is really big and not very variable, mean of x is roughly the same as the mean of any sufficient sample of x, and that sample can be meaningfully represented as a permutation array!

You can get such an array with reservoir sampling and some maths, and (depending on what you know of your data and variance) sometimes even simpler tricks.

That's kindof actually how the "faster than dijkstra" trick referred-to in the article works: Data sets with small variance has this same property that the min of x is roughly the same as the min of a sufficient sample of x (where the size of sufficient has to do with the variance). And so on.

Another big use-case in my code is trees: Apter trees have a flat memory layout which is convenient for permutation arrays which can simultaneously represent index, rotation, tombstones, and all sorts of other things you might need to do with a tree.

Give it a dig. There's good stuff in there.

To be pedantic, median is cheaper than sorting. O(n) with a quicksort-like algorithm.

Also, if you're taking an average of floating point numbers, you might want to sort it first and add from smallest to largest, to better preserve precision

An aside, but I recently learned -- if one is willing to use a very modest amount of memory -- summing floating-point numbers with no loss of precision is effectively a solved problem with the XSUM algorithm.

https://glizen.com/radfordneal/ftp/xsum.pdf

That paper explains some useful optimisation details, but obviously since the floats are all (either infinity or) some multiple of a known tiny fraction (their smallest non-zero number), we can definitely sum them accurately.
Not if the ratio between the largest and smallest floats is very large (2^(2^n)) where n is the number of bits in the exponent.
I think you either haven't thought about this or you did your math wrong.

You need (2^e)+m+1 bits. That is more bits than would fit in the cheap machine integer type you just have lying around, but it's not that many in real terms.

Let's do a tiny one to see though first, the "half-precision" or f16 type, 5 bits of exponent, 10 bits of fraction, 1 sign bit. We need 43 bits. This will actually fit in the 64-bit signed integer type on a modern CPU.

Now lets try f64, the big daddy, 11 exponent, 52 fraction, 1 sign bit so total 2048 + 52 + 1 = 2101 bits. As I said it doesn't fit in our machine integer types but it's much smaller than a kilobyte of RAM.

Edited: I can't count, though it doesn't make a huge difference.

You also need some extra bits at the top so that it doesn't overflow (e.g., on an adversarial input filled with copies of the max finite value, followed by just as many copies of its negation, so that it sums to 0). The exact number will depend on the maximum input length, but for arrays stored in addressible memory it will add up to no more than 64 or so.
That’s great for mean, but you don’t need to sort for mean.
You have a list of IDs, and want to make them compact for storage or transport - fast and simple way is to sort and delta encode.
Hmm. That’s fair, though I’d probably use set operations instead. What you find though is that for most other problems besides diffing, ID order is not chronological order, so you need to sort by a date stamp instead. But I’m typically letting the database do that, so I’m a consumer of sorted numbers, but not an implementor. Because what I sort is nearly always compound sorts. By field A, then field B and field C if those two still don’t cut it.
If the primary key is the number, it still works (and dates are just numbers by the way) because you can sort a heterogenous dataset by a single numeric key pretty trivially.

But sorting by arbitrary strings like names can’t avoid comparison sort.

That data structure isn’t an array of numbers, it’s an array of pointers to objects.