Hacker News new | ask | show | jobs
by Xcelerate 701 days ago
I received a variant of this problem as an interview question a few months ago. Except the linear time approach would not have worked here, since the list contains trillions of numbers, you only have sequential read access, and the list cannot be loaded into memory. 30 minutes — go.

First I asked if anything could be assumed about the statistics on the distribution of the numbers. Nope, could be anything, except the numbers are 32-bit ints. After fiddling around for a bit I finally decided on a scheme that creates a bounding interval for the unknown median value (one variable contains the upper bound and one contains the lower bound based on 2^32 possible values) and then adjusts this interval on each successive pass through the data. The last step is to average the upper and lower bound in case there are an odd number of integers. Worst case, this approach requires O(log n) passes through the data, so even for trillions of numbers it’s fairly quick.

I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.

Well, the interview feedback arrived, and it turns out my solution was rejected for being suboptimal. Apparently there is a more efficient approach that utilizes priority heaps. After looking up and reading about the priority heap approach, all I can say is that I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

I had never used leetcode before because I never had difficulty with prior coding interviews (my last job search was many years before the 2022 layoffs), but after this interview, I immediately signed up for a subscription. And of course the “median file integer” question I received is one of the most asked questions on the list of “hard” problems.

6 comments

These are horrible interview questions. Data issues like this do pop up in the real world; heck I deal with them.

But when you run into <algorithm> or <feels like algorithm>, the correct solution is to slow down, Google, then document your work as you go. In a real application log n may be insufficient. But coding interview exercises need tight constraints to fit the nature of the interview.

You don't need to load trillions of numbers into memory, you just need to count how many of each number there are. This requires 2^32 words of memory, not trillions of words. After doing that just scan down the array of counts, summing, until you find the midpoint.
Yeah, I thought of that actually, but the interviewer said “very little memory” at one point which gave me the impression that perhaps I only had some registers available to work with. Was this an algorithm for an embedded system?

The whole problem was kind of miscommunicated, because the interviewer showed up 10 minutes late, picked a problem from a list, and the requirements for the problem were only revealed when I started going a direction the interviewer wasn’t looking for (“Oh, the file is actually read-only.” “Oh, each number in the file is an integer, not a float.”)

That “miscommunication” you mention has been used against me in several interviews, because I was expected to ask questions (and sometimes a specific question they had in mind) before making assumptions. Well, then the 30min becomes an exercise in requirements gathering and not algorithmic implementation.
Which in fairness, is a reasonable competency to test for in an interview
Indeed. But I need clarity on which skill they want to test in thirty minutes.
Speaking as an interviewer: nope, you're not being tested on a single skill.
With 256 counters, you could use the same approach with four passes: pass i bins the numbers by byte i (0 = most sig., 3 = least sig.) and then identifies the bin that contains the median.

I really want to know what a one-pass, low-memory solution looks like, lol.

Perhaps the interviewer was looking for an argument that no such solution can exist? The counterargument would look like this: divide the N numbers into two halves, each N/2 numbers. Now, suppose you don't have enough memory to represent the first N/2 numbers (ignoring changes in ordering); in that case, two different bags of numbers will have the same representation in memory. One can now construct a second half of numbers for which the algorithm will get the wrong answer for at least one of the two colliding cases.

This is assuming a deterministic algorithm; maybe a random algorithm could work with high probability and less memory?

I've heard of senior people applying for jobs like this simply turning the interview question around and demanding that the person asking it solve it in the allotted time. A surprisingly high percentage of the time they can't.
This company receives so many candidates that the interviewer would have just ended the call and moved on to the next candidate.

I get the notion of making the point out of principle, but it’s sort of like arguing on the phone with someone at a call center—it’s better to just cut your losses quickly and move on to the next option in the current market.

And we, as software engineers, should also take that advice: it's better to just cut your losses quickly and move on to the next option in the current market.
Do you mean topK rather than median, for K small? You certainly cannot build a heap with trillions of items in it.
No, I mean median. Here is an article describing a very similar problem since I can’t link to the leetcode version: https://www.geeksforgeeks.org/median-of-stream-of-running-in...
> Auxiliary Space : O(n).

> The Space required to store the elements in Heap is O(n).

I don't think this algorithm is suitable for trillions of items.

I'm wondering what heap approach can solve that problem, as I can't think of any. Hopefully OP got a link to the thesis.

The n log n approach definitely works though.

But that stores all elements into memory?
That's a bullshit question.

My own response would have been a variant on radix-sort. Keep an array of 256 counters, and make a pass counting all of the high bytes. Now you know the high byte of the median. Make another pass keeping a histogram of the second byte of all values that match the high byte. And so on.

This takes four passes and requires 256 x 8 byte counters plus incidentals.

In a single pass you can't get the exact answer.

> … I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

What a bullshit task. I’m beginning to think this kind of interviewing should be banned. Seems to me it’s just an easy escape hatch for the interviewer/hiring manager when they want to discriminate based on prejudice.

Banning stupid interview questions is a bad idea for job applicants since they are such a good indication of bullshit job culture.
Excellent point.