| HN Mirror

Well, "algorithm" is just a way of saying "procedure", so a heuristic applies (IMO). I would accept a reasonable heuristic without much argument because the problem of random text is much harder...but honestly, if you got to this part, you've already passed the interview. A total flameout on this question is not even realizing that the storage scales with vocabulary size, instead of sequence length.

An exact solution to an unbounded sequence of purely random text probably does require unbounded memory (I say 'probably' only to hedge my bets here). But depending on the definition of "random", you can put pretty tight bounds on it and only be off by a little. I haven't done the math, but my gut says that a bloom filter, followed by incrementing counts only for positive hits from the filter, would scale well. There may be simpler approaches that make use of word length and the size of the latin alphabet (e.g. "all tokens of size N have 1/C(26,N) probability of colliding if letters are chosen from a uniform distribution, therefore...")

But again, if we actually had this conversation in an interview, there'd be no danger of not passing. Unless you were a jerk or something.