Hacker News new | ask | show | jobs
by timr 3465 days ago
"I think most interview questions are similar nonsense."

It's probably a good idea to be careful with your words when you admit that you don't know the answer to a question.

First off: your example (random permutations of the same four words) doesn't require much memory at all. So if you think it does, you're wrong. You might overflow your counters, but that's a different problem.

A stream of random gibberish is certainly more challenging. But the cardinality of the English language isn't infinite (the OED has about 230k words, and that's with lot of words that nobody ever uses), so even a naive solution doesn't require "unbounded memory", as long as you take the problem statement seriously and don't do something ridiculous. That would be good enough to pass an interview.

But OK, let's say you do have a stream of random latin-encoded gibberish. What then? The problem statement is that you have to determine the top-10 words (or in this case "tokens") by frequency. The cardinality of the set is infinite, but the probability of duplication per token is small, and the output set is tiny. Do you really think you need unbounded storage?

In any case, even if you think a problem is "nonsense", it's probably true that the interviewer has thought about it more than you have. The part that frustrates you is highly likely to be the bit worth probing. A bad candidate will bomb out immediately; a decent candidate will provide a solid, if not perfect solution; a great candidate will solve the problem, see the broader theoretical aspects, and investigate those as well.

2 comments

Do you really think you need unbounded storage?

This is an example of a trap that interviewers run into when they try to arbitrarily reword questions.

While I could be mistaken, an exact solution to the top-k problem requires O(N) space where N is the number of distinct items. I can trivially think of a stream of tokens that would defeat any reasonable computer in both available memory and general "storage" (ex: 1 quintillion distinct words, then the next 10 words are a duplicate of an existing word, then end of stream). Since you asked for an algorithm and not a heuristic, it's clear you are looking for an exact answer. So the answer is "Yes, I would need unbounded storage for the new question as asked". Since this is an interview, I'd expect that'd you'd want me to give you the technically correct, and accurate answer.

I've tripped up enough interviewers who have tried to slightly reword questions, but ended up changing their meaning.

Well, "algorithm" is just a way of saying "procedure", so a heuristic applies (IMO). I would accept a reasonable heuristic without much argument because the problem of random text is much harder...but honestly, if you got to this part, you've already passed the interview. A total flameout on this question is not even realizing that the storage scales with vocabulary size, instead of sequence length.

An exact solution to an unbounded sequence of purely random text probably does require unbounded memory (I say 'probably' only to hedge my bets here). But depending on the definition of "random", you can put pretty tight bounds on it and only be off by a little. I haven't done the math, but my gut says that a bloom filter, followed by incrementing counts only for positive hits from the filter, would scale well. There may be simpler approaches that make use of word length and the size of the latin alphabet (e.g. "all tokens of size N have 1/C(26,N) probability of colliding if letters are chosen from a uniform distribution, therefore...")

But again, if we actually had this conversation in an interview, there'd be no danger of not passing. Unless you were a jerk or something.

If your algorithm overflows counters and produces incorrect output as a result, then you don't have a correct solution.

There's no way to solve the unbounded stream case correctly with finite counters as you admit.

So you didn't actually present a bounded memory solution to the problem.

The number of bytes in a machine number is not the limiting factor here. You could use arbitrary precision numbers (with some constant factor of additional storage), and solve the precision problem.

Being pedantic about this point doesn't get you closer to a correct answer. Read dsp1234's responses. They are correct.

They're not the limiting factor in practice, but the limit behavior of the algorithm -- how it behaves on an arbitrary length stream -- can overflow count for a fixed length counter and if you use arbitrary precision, it's no longer bounded memory. So again, you've failed to supply a memory bounded algorithm that handles the arbitrary case (but won't admit it).

Again, I'm aware of how to solve it in practice and that the problem as stated is not what you meant, I was pointing out that as stated, it didn't have a memory bounded solution.

And that it's commom for people to self-assuredly misdefine interview problems, then not understand when the interviewee points out their implicit assumption.