Hacker News new | ask | show | jobs
by rfurmani 5423 days ago
bzzt wrong.

>>> re.findall('(a|aa|aaa|ab)', 'aaab') ['a', 'a', 'a']

The correct answer would be ['aa', 'ab'] but unfortunately findall works greedily and so will not find the optimal solution. It is possible to specify it as a regex, but common implementations might take too much time to come up with the good solution.

4 comments

The mistake arises from the different semantics of findall versus what I initially wanted to do. I initially tried "^(dict|here)+$", which should return all the subwords, but doesn't (it only returns the last one).

I'm missing something, clearly, but my initial idea was a state machine that contained all the words (sorted by length, perhaps, but that's not necessary if we don't care about the solution with the "most words") and linked back to itself. It's essentially the same as the backtracking example, from a cursory glance.

Of course, findall is a dumb version that only finds disjoint substrings.

EDIT: It turns out that the group count in regexps is fixed, so it's not possible to return all matches of a single group, even if it's repeated. All my solution above does is show you whether there's a match or not, but not what it is (unless it's the single word).

Interestingly, the article explicitly states that our dictionary supports only the exact string lookup command, dict.contains(string). Strictly speaking, the full content of the dictionary isn't available to us, and we can't create the regular expression.
From the problem definition:

Q: What if there are multiple valid segmentations? A: Just return any valid segmentation if there is one.

I think ['a', 'a', 'ab'] is also valid, isn't it? So ['aa', 'ab'] would be one of the correct answers.

(of course ['a', 'a', 'a'] it's not because the 'b' is not a valid word and it's not in the solutions)

Ah right, I was a bit too quick to challenge. Dictionary: aa, aaa, aaaa, ab Word: aaaab