Hacker News new | ask | show | jobs
by dsp1234 3465 days ago
In the given case, it's actually the "English words" limitation that saves the question. An exact answer to top-k problems is O(N) is storage space where N is the number of distinct items. So since N is ~200k, it's not a terrible problem to deal with. When N is 3 as in your given case, it's not really a problem. The general case, of when the stream is unbounded in the number of distinct items, is much harder, say the stream is 1 quintillion non-repeating words, then starts repeating. This requires an approximation algorithm, but is not likely the solution that was looked for since the person specifically said "slight less trivial", and the streaming approximation algorithm is more complex.

Though it could be still be saved by flipping the question back onto the interviewer's phrasing. They said "don't run out of memory", not "don't run out of disk space". So you could also solve this problem by, for example, writing the data into a database, then using SQL group by and top X functions to solve it once the stream has ended. But as an interviewer, I probably wouldn't be amused.

2 comments

Hah! Hell, I'd even give credit to the SQL answer, if it came with the rest. ;-)

The only thing I'd add is that you can probably come up with a pretty simple heuristic or three to solve the problem for the case of truly random gibberish, without the need for elaborate data structures. But yes, this would be the answer of a great candidate.

You can't have finite counters per word, else you can't disambiguate between counter length of a single word then the pattern I described and just the pattern I described, despite the fact it impacts the answer.

Your solution is incorrect because it fails to handle arbitrary length streams as specified -- as do the other glib answers.

My point was the specification didn't seem to align with the intended problem, and that interviewers tend to make such mistakes on the description, and then get mad when you ask them to clarify if we're talking practice or theory.

Similarly, you don't account for arbitrary proper nouns, which could theoretically be unboundedly introduced to an arbitrary sized English corpus. I guess you can debate which of these are "English" (itself a proper noun!), but interviewers often treat you as dumb if you don't divine their intent in such underspecified situations.

You can't have finite counters per word

You only need one counter per word.

a quick sketch of the basic algorithm is:

  1.) Create a new list of type <string, number> into COUNTS
  2.) Read a word from the stream into WORD
  3.) if WORD exists in COUNTS, then increment the number for that entry
  4.) if WORD does not exist in COUNTS, then add WORD to COUNTS with a number of 1
  5.) if not end of stream, goto 2
  6.) traverse through COUNTS keeping a running list of the top 10 highest entries (Note that this and step 4 can often be combined to keep a running total of the top 10) into TOPCOUNTS
  7.) print TOPCOUNTS
This algorithm is O(N) space where N is the number of distinct items, as I mentioned previously.
Again, you're assuming count has a bound size per word, which isn't true for an unbounded stream.

Step 3 of your algorithm either requires unboundedness of count (ie, count can use arbitrary amounts of memory) or can overflow on arbitrary length streams of words (and hence, has cases where it produces the wrong output).

Hash tables FTW!