Hacker News new | ask | show | jobs
Parserator – A family of probabalistic parsers (parserator.datamade.us)
54 points by vierja 4089 days ago
7 comments

To be a little bit pedantic: in NLP terms, this is a sequence labeling problem, not parsing. "Parsing" is normally only used for methods that can handle CFGs or similar nested structures. This tool just seems to be breaking up a string into a linear sequence of labeled segments.

I don't mean that as a criticism of the project itself, though. The demos alone are pretty cool, and the framework looks incredibly useful.

you're incorrect. To be more pedantic, sequence labeling is a special case of parsing. Parsing need not be hierarchical. Chunking, aka shallow parsing, is a sequence labeling task.

Parsing is commonly understood to infer an hierarchical structure, but that's not required.

Indeed, the Latin root leads us to understand that parsing is about breaking an object into parts. That these parts nest according to a particular grammar is something of an important implementation detail.

Didn't they announce this on Hacker News a few weeks ago?

This might work better if you had a database of almost every street name and almost every place name. Then you could take in an address, and classify words as one or more of [StreetName, PlaceName, StreetType, etc.]. Some words can appear in more than one of those categories, which is when a deterministic parser without a full database fails. Then let the learning algorithm deal with ambiguities such as "1 Park Lane", "1 Lane Park", and such. You'd have a better chance of dealing with the hard cases. Expecting this to recognize street words on its own is a reach.

You can get about 95% successful parsing of US business addresses with a relatively simple parser that lacks a name database. (I have one running right now on 20 million addresses.) Then it gets hard. Are they doing better than that?

The commercial parsers with full address databases do much better.

"Duzbuns Hopsit pfarmerrsc"

Suffers from all the usual problems. Try e.g.

123 1/2 Green Onion, Some City, CA

1/2 is treated as a number suffix (wrong, part of the number), Onion is treated as StreetPost, e.g. a classifier similar to Blvd. or Street.

Miss a comma, and the parser will be completely confused.

So, yay, now my address is going to be probabilistically wrong ;)

It seems to be working a little differently now. (I was going to say "better" but that's debatable.) 1/2 is still in the number suffix, but now "Green Onion, Some City" is the place name, and it doesn't try to parse it further.
This seems to be insanely relevant: I'm currently working on transforming government documents into structured xml and I've had to be stopped repeatedly from implementing something like this. I have a treetop grammar now that (mostly) works, but I'm tempted to try this.
You are? That's interesting, b/c that also something I am currently working on.

What do you need the grammatical parsing for? Identification of named entities?

We're building a tracker for EU legislative process. There's xml markup for legislative documents (akoma ntoso) and we need to transform the pdfs that the EU publishes into it to allow, for example, user annotation (and just good html representation in general. We've built on this South African project: https://github.com/longhotsummer/slaw
Are you working for an NGO, like Open Data Foundation or something like that?

I'd be interested in following your progress. You can send me a mail, so we can connect.

It's an NGO startup: https://twitter.com/aecu_eu
Frederick James Jonathan Bull-White

Is three given names followed by a hyphenated family name.

Not a corporation.

I just made it up, but it is representative of a substantial subset of the English speaking population of the world.

Have looked at Spanish naming conventions?

Without a substantial amount of context I think that such parser cannot be made to be useful.

The probablepeople parser failed to parse "Alex Chamberlain"...
They might have been having problems earlier, it works as expected when I tried:

> Alex, GivenName

> Chamberlain, Surname

I can't get any kind of US address to work, just says "error parsing that address"
The two I tried both worked. Try this one: 142 West St, Cantonville, OH 98104 (Not a real address)