"find the subset in a given collection that matches this specific criteria"
So basically, a loop through a single table. That's as simple as it gets. You can make the problem more complicated, of course (e.g. "write a method to find the minimum and maximum ages of the male users"), but it's still pretty simple stuff.
A slightly less trivial "algorithm" question that should be equally easy for any decent programmer: I give you a document of english words. Write a function that counts the words, and returns the top ten words seen, by frequency. Now...solve the same problem when it's not a single document, but a stream of words of unspecified length. Don't run out of memory.
Depending on the wording it looks like his go to intro problem is actually easier than Fizz-Buzz as it is just filtering. For example say it was "Given a set return a subset where the values are less than 10." (such as set.Where(i => i < 10))
Do you have an example which correctly solves the problem but takes too long?
Behold
var AGESORT = rows.bubblesort("AGE ASCENDING")
var AGESORT1 = rows.bubblesort("AGE DESCENDING")
var MALE = "None"
var MALE1 = "None"
for row in AGESORT
if row.isMale
MALE = row
break
for row in AGESORT1
if row.isMale
MALE1 = row
break
print "Min age: " + MALE
print "Max age: " + MALE1
Stuff like this is usually the result of thoughts like "Ok, I'll break this problem into two, then combine the pieces. First I'll find the minimum age, which is easy once I sort the rows. Done. Ok, now I can see I can slightly alter that code so that it finds the max age! Now, I just put the pieces of code together, and do a little code organization to put like with like. I"
Well either I suppose. For the subset problem as you say a filter critera can be applied while looping through the collection. And that should be it. Now if the question was multiple criteria I could see the solution diverging rapidly. If the filter can be expressed as a rank we could sort the collection first and then binary search through it to find the boundaries. I suppose I'm just trying to imagine some way an interviewee may turn that into two loops, or three loops... more?
I would like to see a solution to your problem with bounded memory.
In particular, the case where I want the top 3 words, you don't know the length of the stream, and you get random permutations of the same 4 words until I stop emitting them (where I will end by emitting 3 to break the tie).
That's not to say your problem isn't interesting -- just that while specifically constructing a problem as an example, you created one that's unbounded in required memory (in both storage per value and potentially number of values to store) and then demanded a solution that doesn't run out of memory.
I think most interview questions are similar nonsense.
"I think most interview questions are similar nonsense."
It's probably a good idea to be careful with your words when you admit that you don't know the answer to a question.
First off: your example (random permutations of the same four words) doesn't require much memory at all. So if you think it does, you're wrong. You might overflow your counters, but that's a different problem.
A stream of random gibberish is certainly more challenging. But the cardinality of the English language isn't infinite (the OED has about 230k words, and that's with lot of words that nobody ever uses), so even a naive solution doesn't require "unbounded memory", as long as you take the problem statement seriously and don't do something ridiculous. That would be good enough to pass an interview.
But OK, let's say you do have a stream of random latin-encoded gibberish. What then? The problem statement is that you have to determine the top-10 words (or in this case "tokens") by frequency. The cardinality of the set is infinite, but the probability of duplication per token is small, and the output set is tiny. Do you really think you need unbounded storage?
In any case, even if you think a problem is "nonsense", it's probably true that the interviewer has thought about it more than you have. The part that frustrates you is highly likely to be the bit worth probing. A bad candidate will bomb out immediately; a decent candidate will provide a solid, if not perfect solution; a great candidate will solve the problem, see the broader theoretical aspects, and investigate those as well.
This is an example of a trap that interviewers run into when they try to arbitrarily reword questions.
While I could be mistaken, an exact solution to the top-k problem requires O(N) space where N is the number of distinct items. I can trivially think of a stream of tokens that would defeat any reasonable computer in both available memory and general "storage" (ex: 1 quintillion distinct words, then the next 10 words are a duplicate of an existing word, then end of stream). Since you asked for an algorithm and not a heuristic, it's clear you are looking for an exact answer. So the answer is "Yes, I would need unbounded storage for the new question as asked". Since this is an interview, I'd expect that'd you'd want me to give you the technically correct, and accurate answer.
I've tripped up enough interviewers who have tried to slightly reword questions, but ended up changing their meaning.
Well, "algorithm" is just a way of saying "procedure", so a heuristic applies (IMO). I would accept a reasonable heuristic without much argument because the problem of random text is much harder...but honestly, if you got to this part, you've already passed the interview. A total flameout on this question is not even realizing that the storage scales with vocabulary size, instead of sequence length.
An exact solution to an unbounded sequence of purely random text probably does require unbounded memory (I say 'probably' only to hedge my bets here). But depending on the definition of "random", you can put pretty tight bounds on it and only be off by a little. I haven't done the math, but my gut says that a bloom filter, followed by incrementing counts only for positive hits from the filter, would scale well. There may be simpler approaches that make use of word length and the size of the latin alphabet (e.g. "all tokens of size N have 1/C(26,N) probability of colliding if letters are chosen from a uniform distribution, therefore...")
But again, if we actually had this conversation in an interview, there'd be no danger of not passing. Unless you were a jerk or something.
The number of bytes in a machine number is not the limiting factor here. You could use arbitrary precision numbers (with some constant factor of additional storage), and solve the precision problem.
Being pedantic about this point doesn't get you closer to a correct answer. Read dsp1234's responses. They are correct.
They're not the limiting factor in practice, but the limit behavior of the algorithm -- how it behaves on an arbitrary length stream -- can overflow count for a fixed length counter and if you use arbitrary precision, it's no longer bounded memory. So again, you've failed to supply a memory bounded algorithm that handles the arbitrary case (but won't admit it).
Again, I'm aware of how to solve it in practice and that the problem as stated is not what you meant, I was pointing out that as stated, it didn't have a memory bounded solution.
And that it's commom for people to self-assuredly misdefine interview problems, then not understand when the interviewee points out their implicit assumption.
In the given case, it's actually the "English words" limitation that saves the question. An exact answer to top-k problems is O(N) is storage space where N is the number of distinct items. So since N is ~200k, it's not a terrible problem to deal with. When N is 3 as in your given case, it's not really a problem. The general case, of when the stream is unbounded in the number of distinct items, is much harder, say the stream is 1 quintillion non-repeating words, then starts repeating. This requires an approximation algorithm, but is not likely the solution that was looked for since the person specifically said "slight less trivial", and the streaming approximation algorithm is more complex.
Though it could be still be saved by flipping the question back onto the interviewer's phrasing. They said "don't run out of memory", not "don't run out of disk space". So you could also solve this problem by, for example, writing the data into a database, then using SQL group by and top X functions to solve it once the stream has ended. But as an interviewer, I probably wouldn't be amused.
Hah! Hell, I'd even give credit to the SQL answer, if it came with the rest. ;-)
The only thing I'd add is that you can probably come up with a pretty simple heuristic or three to solve the problem for the case of truly random gibberish, without the need for elaborate data structures. But yes, this would be the answer of a great candidate.
You can't have finite counters per word, else you can't disambiguate between counter length of a single word then the pattern I described and just the pattern I described, despite the fact it impacts the answer.
Your solution is incorrect because it fails to handle arbitrary length streams as specified -- as do the other glib answers.
My point was the specification didn't seem to align with the intended problem, and that interviewers tend to make such mistakes on the description, and then get mad when you ask them to clarify if we're talking practice or theory.
Similarly, you don't account for arbitrary proper nouns, which could theoretically be unboundedly introduced to an arbitrary sized English corpus. I guess you can debate which of these are "English" (itself a proper noun!), but interviewers often treat you as dumb if you don't divine their intent in such underspecified situations.
1.) Create a new list of type <string, number> into COUNTS
2.) Read a word from the stream into WORD
3.) if WORD exists in COUNTS, then increment the number for that entry
4.) if WORD does not exist in COUNTS, then add WORD to COUNTS with a number of 1
5.) if not end of stream, goto 2
6.) traverse through COUNTS keeping a running list of the top 10 highest entries (Note that this and step 4 can often be combined to keep a running total of the top 10) into TOPCOUNTS
7.) print TOPCOUNTS
This algorithm is O(N) space where N is the number of distinct items, as I mentioned previously.
Again, you're assuming count has a bound size per word, which isn't true for an unbounded stream.
Step 3 of your algorithm either requires unboundedness of count (ie, count can use arbitrary amounts of memory) or can overflow on arbitrary length streams of words (and hence, has cases where it produces the wrong output).
That's more advanced then the question I ask typically. Though I might use something like this if I'm interviewing someone with many years of experience (10+).
As I explained above, my question is a warm up - meant to break the ice, calm nerves. But it's surprising to me how effective even a simple list traversal is at identifying weaknesses in a candidates programming ability.
Depending on the platform, there's at least 2 ways to do that :-)
E.g. - "functional" is simply something like:
subset = filter( predicate, superset )
... but on a "procedural" platform, it's a bit more challenging, starting with how much space to allocate for the result, or whether to mutate the input, and other wading through the swamp.
But, yeah, I can see how it would get people to start up talking somewhere, or else go into "deer in the headlights" mode.
Just out of curiosity: how would you feel if I answered this question by using existing function [1] in stdlib? Would you consider that as (good) sign of knowing the tools or would you prefer a fresh implementation?
I see it as a positive if a candidate uses an existing solution to the problem. That's exactly what a good engineer would do in the real world.
But then I take away the library/method and re-state the problem. Because the point is to see if they can understand the problem, and devise a solution.
Heh, at my previous employer the only coding question we asked during interviews was: "Find the largest element in an array of integers." Some of the ways people found to not solve that simple problem were amazing.
Now, I'm generally against asking hard-core CS questions in a live interview... but a simple bozo filter is probably a good idea.
It really does surprise me how a simple programming questions like this does stump the average interviewee. If they need a bit of guidance to get started, I never hold it against the candidate, but if they need a massive hand holding to come to a solution, or if they can't explain how their solution works and how it could be improved, then I know where they stand programming wise.
It is a low bar skills wise, and it tells a lot within a very short amount of time.
"find the subset in a given collection that matches this specific criteria"
So basically, a loop through a single table. That's as simple as it gets. You can make the problem more complicated, of course (e.g. "write a method to find the minimum and maximum ages of the male users"), but it's still pretty simple stuff.
A slightly less trivial "algorithm" question that should be equally easy for any decent programmer: I give you a document of english words. Write a function that counts the words, and returns the top ten words seen, by frequency. Now...solve the same problem when it's not a single document, but a stream of words of unspecified length. Don't run out of memory.