Hacker News new | ask | show | jobs
by dsp1234 3464 days ago
You can't have finite counters per word

You only need one counter per word.

a quick sketch of the basic algorithm is:

  1.) Create a new list of type <string, number> into COUNTS
  2.) Read a word from the stream into WORD
  3.) if WORD exists in COUNTS, then increment the number for that entry
  4.) if WORD does not exist in COUNTS, then add WORD to COUNTS with a number of 1
  5.) if not end of stream, goto 2
  6.) traverse through COUNTS keeping a running list of the top 10 highest entries (Note that this and step 4 can often be combined to keep a running total of the top 10) into TOPCOUNTS
  7.) print TOPCOUNTS
This algorithm is O(N) space where N is the number of distinct items, as I mentioned previously.
2 comments

Again, you're assuming count has a bound size per word, which isn't true for an unbounded stream.

Step 3 of your algorithm either requires unboundedness of count (ie, count can use arbitrary amounts of memory) or can overflow on arbitrary length streams of words (and hence, has cases where it produces the wrong output).

Hash tables FTW!