|
|
|
|
|
by dsp1234
3464 days ago
|
|
Do you really think you need unbounded storage? This is an example of a trap that interviewers run into when they try to arbitrarily reword questions. While I could be mistaken, an exact solution to the top-k problem requires O(N) space where N is the number of distinct items. I can trivially think of a stream of tokens that would defeat any reasonable computer in both available memory and general "storage" (ex: 1 quintillion distinct words, then the next 10 words are a duplicate of an existing word, then end of stream). Since you asked for an algorithm and not a heuristic, it's clear you are looking for an exact answer. So the answer is "Yes, I would need unbounded storage for the new question as asked". Since this is an interview, I'd expect that'd you'd want me to give you the technically correct, and accurate answer. I've tripped up enough interviewers who have tried to slightly reword questions, but ended up changing their meaning. |
|
An exact solution to an unbounded sequence of purely random text probably does require unbounded memory (I say 'probably' only to hedge my bets here). But depending on the definition of "random", you can put pretty tight bounds on it and only be off by a little. I haven't done the math, but my gut says that a bloom filter, followed by incrementing counts only for positive hits from the filter, would scale well. There may be simpler approaches that make use of word length and the size of the latin alphabet (e.g. "all tokens of size N have 1/C(26,N) probability of colliding if letters are chosen from a uniform distribution, therefore...")
But again, if we actually had this conversation in an interview, there'd be no danger of not passing. Unless you were a jerk or something.