Hacker News new | ask | show | jobs
by shmageggy 4783 days ago
SHRDLU is definitely amazing, especially given its age, but one's amazement is tempered a little bit (or maybe enhanced, depending on perspective) when you realize that it achieved what it did primarily through really great engineering rather that some fundamental insight about language. Since SHRDLU's world is so limited, Winograd was able to explicitly program every facet of its language understanding. Unsurprisingly, this approach is totally not scalable and this reveals a little about why we don't have fully human-like language programs.
2 comments

I think people underestimate how explicitly-programmed human language is in humans. I'm starting to think that this might be the central problem in NLP right now.

Humans have good natural pattern-matching engines in their heads, but the entire body of syntax and vocabulary available to a person is the result of the memorization of a huge amount of text. I suspect the majority of people rarely ever develop truly novel words or phrases on their own (with the notable exception of Lewis Carroll). (Aside: in fact, this is exactly how "memes" work in the modern online sense; one person invents a novel word or phrase, and that is then parroted by a huge number of other people.)

I recently started work on an attempt to improve the classification of English vocabulary by grade level. I built a database using publicly-available sources, and the number of unique words that the average child has been exposed to by the 8th grade is mind boggling. One source cited 15,000 unique words and over a million words read annually.

Aside from the words themselves, children have also by that age memorized an even larger number of phrases, pieces of sentence structure, and full sentences.

I think that because we aren't able to enumerate everything we've memorized, we don't fully appreciate just how much data is stored in our heads. As a result, I think it's possible that computer science researchers have largely been chasing a ghost in terms of some kind of magical "understanding" of language; the answer to NLP might actually be to simply store and access a terabytes-sized data structure of vocabulary and phrases.

The kind of "programming" that you are describing is fundamentally different than what Winograd did, and that was my point. This learning from many examples is an instance of inductive inference, and the complexity involved is why modern NLP research (and you in your project) uses machine learning techniques with massive datasets -- this more closely mimics the way we naturally acquire language. Trying to hand engineer all those rules and dependencies and exceptions is prohibitively difficult, which is why we have Siri and not SHRDLU+.
Just because we memorize a whole lot (which I agree with) does not mean that language is likely to be "pre-programmed" in the way that SHRDLU follows explicit, exhaustive rules. Formulating such rules requires planning because they are brittle, and this does not seem compatible with the way language acquisition happens.

Also, after accepting the premise that humans exploit an enormous store of data in language use, there still remain very difficult questions about what kind of representations we have available, and how powerful the search and recombination mechanisms are.

Memory-based language processing exists for some time now, and while it is useful, it is certainly not the final answer to "the central problem in NLP" (whatever you define that to be, I'd suggest ambiguity resolution).

the answer to NLP might actually be to simply store and access a terabytes-sized data structure of vocabulary and phrases.

Isn't that effectively what google translate is doing? And it's results are... varied.

I get the impression that Google Translate is strictly doing it in a Bayesian sense. For example, the recent "he praised the iPad" debacle. [1][2]

[1] http://code.google.com/p/android/issues/detail?id=38538 [2] http://techcrunch.com/2013/01/04/google-now-and-google-trans...

Based on my experience, the hilarious thing about NLP is that it is easy for humans to generate easy to parse sentences like "Facebook acquires Instagram.", but if you are trying to parse a naturally flowing conversion, you rarely get easy examples like that. There is so much context in our conversations.
>> I recently started work on an attempt to improve the classification of English vocabulary by grade level.

I would be interested in this. Let me know if you plan to open this. What data sources are you using?

I'll provide an API for it, won't be ready for months though. It's not a big priority yet -- part of a larger project.

The data sources aren't that interesting. After trying for a while to find something already pre-compiled, I quit and resorted to Googling for phrases like, "9th grade spelling list", and aggregating the data from the results by hand. There are a bunch of sites for teachers and home educators and the like that include tables of vocabulary for various grades. It's tedious, but it works.

Sorry if this sounds sticky, but since you have already done this, could you please share what you have? This sounds useful to me.
A bit behind schedule, but here it is:

http://www.shomisearch.com/api/vocab/grade/8/

http://www.shomisearch.com/api/vocab/grade+sources/9/

...etc. The two lists available for now are "grade" and "grade+sources"; the first list will return the vocabulary list for that grade, the second will return the vocabulary list plus the sites that the data was pulled from.

Valid grade levels are "pre-k", "k" (or "kindergarten" if you like), "1" thru "12", and "college" (or "collegiate").

Currently it just returns results as text/plain with no bells or whistles.

On my to-do list for this is: documentation at /api, json result formatting, more lists & list options, and the ability to POST some text to the vocab api and get back the median & mode of the grade values for the text.

I don't intend to do any of that right away though, since this all just started out as a planned feature for a larger project.

There are currently ~21,000 entries in the database. Lots of duplicates and disagreements on grade levels for words, as expected.

If there's anything else you think is important enough for me to get on right away, let me know.

Thanks. This helps. This was a great idea. I would like to know what your larger project is when you are ready.
OK. I'll have a basic version of the API available later today (Saturday). It will provide the word lists and sources. I need to sleep now.
Since SHRDLU's world is so limited, Winograd was able to explicitly program every facet of its language understanding. Unsurprisingly, this approach is totally not scalable and this reveals a little about why we don't have fully human-like language programs.

That's a good point. It does lead one to wonder, however, if techniques inspired to SHRDLU could (or do) have application in domain-specific applications where the world is likewise restricted. Given the increases in raw horsepower available since SHURDLU was first developed, I find myself wondering if we couldn't do some pretty useful things today, using this approach.

Yes. For example, consider interlingual machine translation. Most systems today (like Google) use statistical MT that learns patterns from millions of examples. In interlingua, by contrast, you analyze the input sentence to form a language-independent representation of the sentence's meaning. Then you use that representation to generate a sentence in a new language.

As you might expect, this is basically impossible for wide-domain MT because we don't have unambiguous representations of the meaning of every sentence, and we don't necessarily know how to combine them, and there's a lot of non-compositional phrases, and on and on.

However, if we restrict ourselves to one small domain, interlingua can work. For example, the KANT system [1] is an interlingua that is built for translating technical manuals for Caterpillar products (bulldozers and so on). The input has to be written in a restricted subset of English (Caterpillar Technical English), but then you can analyze it exactly with hand-written rules, and produce exact output in the target language.

[1] http://www2.lti.cs.cmu.edu/Research/Kant/

Firstly, we have done similar things. For example, we have/had http://en.wikipedia.org/wiki/METEO_System for weather reports (use "machine translation weather reports" to google Scientific literature. Among others, that finds information that work is being done on a Croatian version of this). I think there have been successes in the medical field, too, but cannot find them.

However, this 'knowledge engineering' approach to AI has fallen somewhat out of fashion a bit in favourite of statistical methods (however, I don't think anybody does statistics 'from scratch'. For example, in NLP, you could try to statistically learn the definite articles in English, but hard-coding that 'the' is the only one will get you results faster.