Hacker News new | ask | show | jobs
by thaumaturgy 4783 days ago
I think people underestimate how explicitly-programmed human language is in humans. I'm starting to think that this might be the central problem in NLP right now.

Humans have good natural pattern-matching engines in their heads, but the entire body of syntax and vocabulary available to a person is the result of the memorization of a huge amount of text. I suspect the majority of people rarely ever develop truly novel words or phrases on their own (with the notable exception of Lewis Carroll). (Aside: in fact, this is exactly how "memes" work in the modern online sense; one person invents a novel word or phrase, and that is then parroted by a huge number of other people.)

I recently started work on an attempt to improve the classification of English vocabulary by grade level. I built a database using publicly-available sources, and the number of unique words that the average child has been exposed to by the 8th grade is mind boggling. One source cited 15,000 unique words and over a million words read annually.

Aside from the words themselves, children have also by that age memorized an even larger number of phrases, pieces of sentence structure, and full sentences.

I think that because we aren't able to enumerate everything we've memorized, we don't fully appreciate just how much data is stored in our heads. As a result, I think it's possible that computer science researchers have largely been chasing a ghost in terms of some kind of magical "understanding" of language; the answer to NLP might actually be to simply store and access a terabytes-sized data structure of vocabulary and phrases.

5 comments

The kind of "programming" that you are describing is fundamentally different than what Winograd did, and that was my point. This learning from many examples is an instance of inductive inference, and the complexity involved is why modern NLP research (and you in your project) uses machine learning techniques with massive datasets -- this more closely mimics the way we naturally acquire language. Trying to hand engineer all those rules and dependencies and exceptions is prohibitively difficult, which is why we have Siri and not SHRDLU+.
Just because we memorize a whole lot (which I agree with) does not mean that language is likely to be "pre-programmed" in the way that SHRDLU follows explicit, exhaustive rules. Formulating such rules requires planning because they are brittle, and this does not seem compatible with the way language acquisition happens.

Also, after accepting the premise that humans exploit an enormous store of data in language use, there still remain very difficult questions about what kind of representations we have available, and how powerful the search and recombination mechanisms are.

Memory-based language processing exists for some time now, and while it is useful, it is certainly not the final answer to "the central problem in NLP" (whatever you define that to be, I'd suggest ambiguity resolution).

the answer to NLP might actually be to simply store and access a terabytes-sized data structure of vocabulary and phrases.

Isn't that effectively what google translate is doing? And it's results are... varied.

I get the impression that Google Translate is strictly doing it in a Bayesian sense. For example, the recent "he praised the iPad" debacle. [1][2]

[1] http://code.google.com/p/android/issues/detail?id=38538 [2] http://techcrunch.com/2013/01/04/google-now-and-google-trans...

Based on my experience, the hilarious thing about NLP is that it is easy for humans to generate easy to parse sentences like "Facebook acquires Instagram.", but if you are trying to parse a naturally flowing conversion, you rarely get easy examples like that. There is so much context in our conversations.
>> I recently started work on an attempt to improve the classification of English vocabulary by grade level.

I would be interested in this. Let me know if you plan to open this. What data sources are you using?

I'll provide an API for it, won't be ready for months though. It's not a big priority yet -- part of a larger project.

The data sources aren't that interesting. After trying for a while to find something already pre-compiled, I quit and resorted to Googling for phrases like, "9th grade spelling list", and aggregating the data from the results by hand. There are a bunch of sites for teachers and home educators and the like that include tables of vocabulary for various grades. It's tedious, but it works.

Sorry if this sounds sticky, but since you have already done this, could you please share what you have? This sounds useful to me.
A bit behind schedule, but here it is:

http://www.shomisearch.com/api/vocab/grade/8/

http://www.shomisearch.com/api/vocab/grade+sources/9/

...etc. The two lists available for now are "grade" and "grade+sources"; the first list will return the vocabulary list for that grade, the second will return the vocabulary list plus the sites that the data was pulled from.

Valid grade levels are "pre-k", "k" (or "kindergarten" if you like), "1" thru "12", and "college" (or "collegiate").

Currently it just returns results as text/plain with no bells or whistles.

On my to-do list for this is: documentation at /api, json result formatting, more lists & list options, and the ability to POST some text to the vocab api and get back the median & mode of the grade values for the text.

I don't intend to do any of that right away though, since this all just started out as a planned feature for a larger project.

There are currently ~21,000 entries in the database. Lots of duplicates and disagreements on grade levels for words, as expected.

If there's anything else you think is important enough for me to get on right away, let me know.

Thanks. This helps. This was a great idea. I would like to know what your larger project is when you are ready.
A natural-language search engine (hence my earlier comments on NLP). I've been using it as a news reader for a while now. Should be able to open invites for the reader part in a month or so.

The crawler collects lots of metadata from content. The vocabulary engine is part of a planned down-the-road feature that will provide additional metadata for crawled content, as well as eventually help users find other users that write comments they want to read. (It has an "anti-social" aspect planned, where user interaction will be allowed on reader content, but the software will encourage users to form loose-knit groups of around 20 or so.)

I think the number one problem of social networks right now is that they try to grow without restraint and force hundreds (or thousands) of people to all interact. But humans aren't wired like that; we don't do that well. What does seem to work well is LiveJournal-style communities, or the BB communities, where people get partitioned off into smaller groups by common interests, with other people crossing between interest groups.

I'm not super excited about the community stuff though. The NLP work has been a blast so far, the interpreter I wrote seems to be working out well. I took a somewhat naive, very clean approach to NLP, and I think it'll support up to a few thousand different types of metadata (I can't even imagine that yet) and at least as many different phrases. I have a little more mostly front-end work to do on the reader, then after that I'll start working on providing direct access to the search engine behind the reader. (The reader isn't an RSS reader, it's a search engine interface that lets you use the results from searches as news feeds -- like, right now I have a "front page of HN" news feed. Reddit content is also being crawled, I just need to edit the parser to accept a query like, "front page of HN and r/technology and r/startups".) Eventually I'll mess around with the community part.

So, for example, users with an account on the reader (and, later, the search engine) might have articles closer to their own reading level get a mild rankings boost.

OK. I'll have a basic version of the API available later today (Saturday). It will provide the word lists and sources. I need to sleep now.