| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Izikiel43 1738 days ago
	> The question was basically, "Find the median of a huge data set without sorting it," Isn't this done using a min heap and a max heap in conjuction?

5 comments

ilikebits 1738 days ago

The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

link

robotresearcher 1737 days ago

'Estimate the median of an arbitrary sized data set using a constant amount of memory'.

link

tantalor 1738 days ago

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules

link

xyzzyz 1737 days ago

You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.

link

eru 1737 days ago

Something like QuickSelect is probably better in practice than median-of-medians.

link

thaumasiotes 1737 days ago

> but "huge" means "more than you can store"

Really? Where's it coming from?

link

namrog84 1737 days ago

I think they mean having more than you can store simultaneously on a single device.

With a few exceptions this is pretty common scenario.

link

Dylan16807 1737 days ago

More than you can store.

And possibly it's live data.

link

hnfong 1737 days ago

The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.

link

5faulker 1738 days ago

It's funny that this is often left out from a data & algorithm class.

link

thaumasiotes 1737 days ago

Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.

link

Izikiel43 1737 days ago

The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I agree that if you have the whole thing, doing heapsort and pulling a[N/2] or a[1 + N/2] is simpler.

link

thaumasiotes 1737 days ago

> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"

But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.

My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.

link

lanstin 1737 days ago

Because unlike many dynamic programming algorithms, it is something anyone running a large system will need.

link

ZephyrBlu 1737 days ago

How does this get you the median?

link