Hacker News new | ask | show | jobs
by SomeStupidPoint 3464 days ago
I would like to see a solution to your problem with bounded memory.

In particular, the case where I want the top 3 words, you don't know the length of the stream, and you get random permutations of the same 4 words until I stop emitting them (where I will end by emitting 3 to break the tie).

That's not to say your problem isn't interesting -- just that while specifically constructing a problem as an example, you created one that's unbounded in required memory (in both storage per value and potentially number of values to store) and then demanded a solution that doesn't run out of memory.

I think most interview questions are similar nonsense.

2 comments

"I think most interview questions are similar nonsense."

It's probably a good idea to be careful with your words when you admit that you don't know the answer to a question.

First off: your example (random permutations of the same four words) doesn't require much memory at all. So if you think it does, you're wrong. You might overflow your counters, but that's a different problem.

A stream of random gibberish is certainly more challenging. But the cardinality of the English language isn't infinite (the OED has about 230k words, and that's with lot of words that nobody ever uses), so even a naive solution doesn't require "unbounded memory", as long as you take the problem statement seriously and don't do something ridiculous. That would be good enough to pass an interview.

But OK, let's say you do have a stream of random latin-encoded gibberish. What then? The problem statement is that you have to determine the top-10 words (or in this case "tokens") by frequency. The cardinality of the set is infinite, but the probability of duplication per token is small, and the output set is tiny. Do you really think you need unbounded storage?

In any case, even if you think a problem is "nonsense", it's probably true that the interviewer has thought about it more than you have. The part that frustrates you is highly likely to be the bit worth probing. A bad candidate will bomb out immediately; a decent candidate will provide a solid, if not perfect solution; a great candidate will solve the problem, see the broader theoretical aspects, and investigate those as well.

Do you really think you need unbounded storage?

This is an example of a trap that interviewers run into when they try to arbitrarily reword questions.

While I could be mistaken, an exact solution to the top-k problem requires O(N) space where N is the number of distinct items. I can trivially think of a stream of tokens that would defeat any reasonable computer in both available memory and general "storage" (ex: 1 quintillion distinct words, then the next 10 words are a duplicate of an existing word, then end of stream). Since you asked for an algorithm and not a heuristic, it's clear you are looking for an exact answer. So the answer is "Yes, I would need unbounded storage for the new question as asked". Since this is an interview, I'd expect that'd you'd want me to give you the technically correct, and accurate answer.

I've tripped up enough interviewers who have tried to slightly reword questions, but ended up changing their meaning.

Well, "algorithm" is just a way of saying "procedure", so a heuristic applies (IMO). I would accept a reasonable heuristic without much argument because the problem of random text is much harder...but honestly, if you got to this part, you've already passed the interview. A total flameout on this question is not even realizing that the storage scales with vocabulary size, instead of sequence length.

An exact solution to an unbounded sequence of purely random text probably does require unbounded memory (I say 'probably' only to hedge my bets here). But depending on the definition of "random", you can put pretty tight bounds on it and only be off by a little. I haven't done the math, but my gut says that a bloom filter, followed by incrementing counts only for positive hits from the filter, would scale well. There may be simpler approaches that make use of word length and the size of the latin alphabet (e.g. "all tokens of size N have 1/C(26,N) probability of colliding if letters are chosen from a uniform distribution, therefore...")

But again, if we actually had this conversation in an interview, there'd be no danger of not passing. Unless you were a jerk or something.

If your algorithm overflows counters and produces incorrect output as a result, then you don't have a correct solution.

There's no way to solve the unbounded stream case correctly with finite counters as you admit.

So you didn't actually present a bounded memory solution to the problem.

The number of bytes in a machine number is not the limiting factor here. You could use arbitrary precision numbers (with some constant factor of additional storage), and solve the precision problem.

Being pedantic about this point doesn't get you closer to a correct answer. Read dsp1234's responses. They are correct.

They're not the limiting factor in practice, but the limit behavior of the algorithm -- how it behaves on an arbitrary length stream -- can overflow count for a fixed length counter and if you use arbitrary precision, it's no longer bounded memory. So again, you've failed to supply a memory bounded algorithm that handles the arbitrary case (but won't admit it).

Again, I'm aware of how to solve it in practice and that the problem as stated is not what you meant, I was pointing out that as stated, it didn't have a memory bounded solution.

And that it's commom for people to self-assuredly misdefine interview problems, then not understand when the interviewee points out their implicit assumption.

In the given case, it's actually the "English words" limitation that saves the question. An exact answer to top-k problems is O(N) is storage space where N is the number of distinct items. So since N is ~200k, it's not a terrible problem to deal with. When N is 3 as in your given case, it's not really a problem. The general case, of when the stream is unbounded in the number of distinct items, is much harder, say the stream is 1 quintillion non-repeating words, then starts repeating. This requires an approximation algorithm, but is not likely the solution that was looked for since the person specifically said "slight less trivial", and the streaming approximation algorithm is more complex.

Though it could be still be saved by flipping the question back onto the interviewer's phrasing. They said "don't run out of memory", not "don't run out of disk space". So you could also solve this problem by, for example, writing the data into a database, then using SQL group by and top X functions to solve it once the stream has ended. But as an interviewer, I probably wouldn't be amused.

Hah! Hell, I'd even give credit to the SQL answer, if it came with the rest. ;-)

The only thing I'd add is that you can probably come up with a pretty simple heuristic or three to solve the problem for the case of truly random gibberish, without the need for elaborate data structures. But yes, this would be the answer of a great candidate.

You can't have finite counters per word, else you can't disambiguate between counter length of a single word then the pattern I described and just the pattern I described, despite the fact it impacts the answer.

Your solution is incorrect because it fails to handle arbitrary length streams as specified -- as do the other glib answers.

My point was the specification didn't seem to align with the intended problem, and that interviewers tend to make such mistakes on the description, and then get mad when you ask them to clarify if we're talking practice or theory.

Similarly, you don't account for arbitrary proper nouns, which could theoretically be unboundedly introduced to an arbitrary sized English corpus. I guess you can debate which of these are "English" (itself a proper noun!), but interviewers often treat you as dumb if you don't divine their intent in such underspecified situations.

You can't have finite counters per word

You only need one counter per word.

a quick sketch of the basic algorithm is:

  1.) Create a new list of type <string, number> into COUNTS
  2.) Read a word from the stream into WORD
  3.) if WORD exists in COUNTS, then increment the number for that entry
  4.) if WORD does not exist in COUNTS, then add WORD to COUNTS with a number of 1
  5.) if not end of stream, goto 2
  6.) traverse through COUNTS keeping a running list of the top 10 highest entries (Note that this and step 4 can often be combined to keep a running total of the top 10) into TOPCOUNTS
  7.) print TOPCOUNTS
This algorithm is O(N) space where N is the number of distinct items, as I mentioned previously.
Again, you're assuming count has a bound size per word, which isn't true for an unbounded stream.

Step 3 of your algorithm either requires unboundedness of count (ie, count can use arbitrary amounts of memory) or can overflow on arbitrary length streams of words (and hence, has cases where it produces the wrong output).

Hash tables FTW!