Hacker News new | ask | show | jobs
by StavrosK 5422 days ago
Only slightly joking:

    re.findall("(your|dict|here)", "yourword")
I like the idea of constructing a state machine to do all the matching.
5 comments

I love it. You spark an (imho) much more interesting debate: that of raw smartness vs pragmatic productivity.

A colleague of mine (now hired) was asked to code a string compare during her interview for a .NET function. She said "String.Compare()". This puzzled the interviewers for a while. They asked her whether she could write it out, she said she didn't see the point.

They ended up hiring her for other reasons (notably, references from trusted colleagues who had worked with her), but I still wonder whether that attitude worked for her or against her.

There's something to be said for NOT reinventing the wheel. Standard libraries are standard for a reason - the "String.Compare()" function is likely faster for just about every case, in addition to being already there.
I keep seeing this argument on threads about interview questions.

The principle of not reinventing the wheel has nothing to do with interview questions. Sure, if you're actually working on solving an actual problem at your job, then most of the time you'll be better off using a standard library function instead of rolling your own.

In the context of the job interview, however, it doesn't matter if the solution to the problem is well-known or whether it can be found in a standard library. The problem of the interviewer is not to find out how to compare strings, it's determining if the candidate will be able to write proper code if hired. And the string-compare question might be a good starting point for evaluating the candidate's level. In any case, the fact that a solution is in a standard library is irrelevant.

If I was the interviewer I'd create a new data structure Foo and then ask the candidate to implement Foo.compare(). The question remains essentially the same, so I wonder what the candidate would reply.

> The problem of the interviewer is not to find out how to compare strings, it's determining if the candidate will be able to write proper code if hired.

The trouble with the question is that proper code to compare strings is almost certainly going to be a call to some existing library function. There are only rare cases (e.g. when you're one of the 50 people in the world who implement libc) that it makes sense to write it yourself. It's not clear from "code a string compare" exactly which set of wheels the interviewer wants reinvented: if strcmp is out of bounds, can you use strlen and memcmp? Because if strcmp was somehow buggy, that might be a reasonable thing to do. If the problem is that strcmp is too slow, should we maybe drop to assembly? Or change to a counted-string representation to avoid byte-by-byte operations? Or calculate hashes when strings are mutated, or intern them?

(Maybe in C strcmp is a bad example, since

    while (*s && *s == *t) { 
      s++; 
      t++;
    }
    return *t - *s;
is already about as simple as anything you'd do with strlen and memcmp...)

> If I was the interviewer I'd create a new data structure Foo and then ask the candidate to implement Foo.compare().

I think that's a better approach.

They would say Foo.ToString().Compare()... ;)
And not get the job, because it's likely wrong.
That's my point. At work, I'd prefer someone using String.Compare() over some nasty hand-crafted for-loop with switches and things for all kinds of collation issues. Why would I ask something else during the interview?

I love the GGP's solution for the same reason. If the regex is compiled only once, it may be only marginally slower (if at all), and it significantly improves readability and maintainability, and tremendously reduces the chances for bugs.

I'd hire the him. I have the impression most Google-interviewers (and similar dudes/dudettes) wouldn't. I wonder why.

>I have the impression most Google-interviewers (and similar dudes/dudettes) wouldn't. I wonder why

Perhaps that was rhetorical, but I'll answer anyway:

Being able to wire up existing libraries to accomplish a goal is a pretty low bar to set as far as proficiency goes. Google doesn't want code monkeys. The solution above is perfectly good from a software engineering perspective, but it doesn't show the depth of the candidates knowledge nor how strong their grasp of CS techniques is.

Google's interviews are more like IQ tests than software engineering tests, using CS as the measuring tool. When you're Google you can afford to be that selective.

I'm glad that you like my solution, but please note that, as another commenter said, it's a buggier version of the original regex I thought of ("^(dict|here)+$"), which I think should work but doesn't, at least not in Python.

I suspect it's because the match group is being replaced with the last match rather than added as another group, but it will work as a state machine, and is pretty much equivalent to the backtracking example in the article (although with much less code, and no memoization).

That said, I think that the reason interviewers ask about functions for which we have well-known implementations is to see whether or not you know how they work and/or could implement them yourself. Nobody will reasonably expect you to implement your own string comparison routine, but you could score points if, for example, you said Boyer-Moore for string searching rather than the naive iterative version.

Nobody will reasonably expect you to implement your own string comparison routine, ...

Standard string comparisons exit on the first mismatched character, which is insecure.

Insecure how?
bzzt wrong.

>>> re.findall('(a|aa|aaa|ab)', 'aaab') ['a', 'a', 'a']

The correct answer would be ['aa', 'ab'] but unfortunately findall works greedily and so will not find the optimal solution. It is possible to specify it as a regex, but common implementations might take too much time to come up with the good solution.

The mistake arises from the different semantics of findall versus what I initially wanted to do. I initially tried "^(dict|here)+$", which should return all the subwords, but doesn't (it only returns the last one).

I'm missing something, clearly, but my initial idea was a state machine that contained all the words (sorted by length, perhaps, but that's not necessary if we don't care about the solution with the "most words") and linked back to itself. It's essentially the same as the backtracking example, from a cursory glance.

Of course, findall is a dumb version that only finds disjoint substrings.

EDIT: It turns out that the group count in regexps is fixed, so it's not possible to return all matches of a single group, even if it's repeated. All my solution above does is show you whether there's a match or not, but not what it is (unless it's the single word).

Interestingly, the article explicitly states that our dictionary supports only the exact string lookup command, dict.contains(string). Strictly speaking, the full content of the dictionary isn't available to us, and we can't create the regular expression.
From the problem definition:

Q: What if there are multiple valid segmentations? A: Just return any valid segmentation if there is one.

I think ['a', 'a', 'ab'] is also valid, isn't it? So ['aa', 'ab'] would be one of the correct answers.

(of course ['a', 'a', 'a'] it's not because the 'b' is not a valid word and it's not in the solutions)

Ah right, I was a bit too quick to challenge. Dictionary: aa, aaa, aaaa, ab Word: aaaab
There's a library for constructing a deterministic regular expression from a dictionary, by the way, which would right away give you the exponential-time result. If you wrapped it in a * and applied it with a guaranteed-linear-time regular-expression engine like RE2, you'd find out whether the string was segmentable (and as a bonus you wouldn't have to construct the deterministic RE yourself) but I don't know if you'd get the actual segments.
This was actually going to be my solution as well. Regex engines are optimised for exactly this sort of problem, why duplicate the work manually?
rfumani gave one reason: it won't produce the right answer. Here's another: your regex engine will probably fall over dead if you ask it apply a million-word alternation.
grep could be fine. I wouldn't trust backtracking regex engines, though.
Let's try it:

    >>> re.compile('|'.join(open('/usr/share/dict/words').read().split()), re.I)
    OverflowError: regular expression code size limit exceeded
For smaller dicts, it should work fine, but it evidently doesn't do so well on larger ones.