To be a little bit pedantic: in NLP terms, this is a sequence labeling problem, not parsing. "Parsing" is normally only used for methods that can handle CFGs or similar nested structures. This tool just seems to be breaking up a string into a linear sequence of labeled segments.
I don't mean that as a criticism of the project itself, though. The demos alone are pretty cool, and the framework looks incredibly useful.
you're incorrect. To be more pedantic, sequence labeling is a special case of parsing. Parsing need not be hierarchical. Chunking, aka shallow parsing, is a sequence labeling task.
Parsing is commonly understood to infer an hierarchical structure, but that's not required.
Indeed, the Latin root leads us to understand that parsing is about breaking an object into parts. That these parts nest according to a particular grammar is something of an important implementation detail.
Didn't they announce this on Hacker News a few weeks ago?
This might work better if you had a database of almost every street name and almost every place name. Then you could take in an address, and classify words as one or more of [StreetName, PlaceName, StreetType, etc.]. Some words can appear in more than one of those categories, which is when a deterministic parser without a full database fails. Then let the learning algorithm deal with ambiguities such as "1 Park Lane", "1 Lane Park", and such. You'd have a better chance of dealing with the hard cases. Expecting this to recognize street words on its own is a reach.
You can get about 95% successful parsing of US business addresses with a relatively simple parser that lacks a name database. (I have one running right now on 20 million addresses.) Then it gets hard. Are they doing better than that?
The commercial parsers with full address databases do much better.
It seems to be working a little differently now. (I was going to say "better" but that's debatable.) 1/2 is still in the number suffix, but now "Green Onion, Some City" is the place name, and it doesn't try to parse it further.
This seems to be insanely relevant: I'm currently working on transforming government documents into structured xml and I've had to be stopped repeatedly from implementing something like this. I have a treetop grammar now that (mostly) works, but I'm tempted to try this.
We're building a tracker for EU legislative process. There's xml markup for legislative documents (akoma ntoso) and we need to transform the pdfs that the EU publishes into it to allow, for example, user annotation (and just good html representation in general. We've built on this South African project: https://github.com/longhotsummer/slaw
I don't mean that as a criticism of the project itself, though. The demos alone are pretty cool, and the framework looks incredibly useful.