So, I understand this blog post is about something else completely (the
internet argument started by Yoav Goldberg on Medium, reportedly) but for me
the really interesting part is the historical information in it. I wish
Fernando Pereira could find the time to expound a bit on all those
parenthetical notes in his blog post, perhaps even write a short book on the
history of AI.
AI is kind of a strange beast like that: it's gone through a few very
different phases and it's difficult for one person to understand all of them
equally well. Which of course makes it even harder to avoid reinventing wheels
and repeating mistakes. A bit of history would do us all a world of good.
Btw, I'm getting the feeling most people here will probably hear of Fernando
Pereira for the fist time but he has a very long career in AI and NLP. He was
a prominent symbolicist, with some important contributions to logic
programming (he was one of the co-founders of Quintus, the company that sold
the first commercial Prolog, along with Warren, Byrd and others). Then he
turned to statistical AI and now he's a VP at Google (a.k.a. the den of the
connectionists, if I may be so bold). He's probably one of the few computer
scientists around who understands both symbolic and statistical AI in equal
measures. If anyone is qualified to talk about their relative merits, that's
him.
(and if I sound like a bit of a fangirl- that is because I basically am.
Pereira is one of my logic programming heroes and a great teacher to me,
albeit unbeknownst to him :)
> Idea! Let's go back to toy problems where we can create the test conditions easily, like the rationalists did back then (even if we don't realize we are imitating them). After all, Atari is not real life, but it still demonstrates remarkable RL progress. Let's make the Ataris of natural language!
> But now the rationalists converted to empiricism (with the extra enthusiasm of the convert) complain bitterly. Not fair, Atari is not real life!
> Of course it is not. But neither is PTB, nor any of the standard empiricist tasks, which try strenuously to imitate wild language
My reading is that Pereira doesn't think that deep learning has quite conquered language, and in this he's in complete disagreement both with Goldberg and Le Cunn's side (who both champion deep learning for NLP and claim that it has led to great advances in the field).
For me the problem with NLP and deep learning, or indeeed any empirical method, is that the evaluation metrics we have are imperfect. Take BLEU scores, from Goldberg's post, for instance. Those basically compare generated text to some arbitrary target. Originally, they were proposed as metrics of machine translation quality, so the target was some existing translation and the machine-generated translation was examined for coverage of this human-made translation. But of course, there is no principled way that we know of to choose one translation over another- or even say whether a translation is a good or bad translation, on its own. And that's true for translations by humans also. You give the same text to 10 professional translators, they'll give you 10 different translations. Then you give each of their translations to 10 readers and ask them for their opinion, and you get back 100 different opinions.
The translation task itself is not even particularly well defined, exactly because there may be any number of valid translations (possibly, infinitely many) of a piece of text in another language. So, with translation, we have an ill-defined task with an arbitrary metric. And that metric of course is lifted from its original task and used to evaluate language generation and so on. Then someone comes along who knows how to train a deep net but has no idea what the purpose of their chosen metric is, or what it does and has no understanding of the task itself- and claims to have solved it because they got good results on that metric.
It's a bit of a methodological mess that's not going to lead to much progress. People can keep piling on these "results" for as long as they like and pretend that they're "solving" this or that problem- but in real-world terms, nothing is really being solved at all.
OK, but this still has the same problem as BLEU- it relies on comparisons to human scores, which are entirely subjective. I'm not saying they're not the best we got, but it's a big problem for machine translation that the only way to evaluate results is, essentially, comparing it to eyballing.
Google translate is now based on a neural network and you can be sure they have solid metrics. By analogy Google search has a large panel of humans whose subjective feedback is used to test the quality of search algorithm variations.
This is something that needs to be repeated until everyone internalises it: for language pairs other than the "easy" ones Google translate sucks.
I am Greek and translations from and to my language are utterly ridiculous, on the level of Bozo the clown doing the translation with his underpants on his head back to front.
Typical example: I put in the Greek word for "swallow", the bird, and ask for the French translation. I get back the word "avaler" - the French word for "to swallow", the verb.
That's my little benchmark there, useful because Google translate has been doing this consistently, for a good few years, before it used neural networks, before it started claiming its setup essentially constitutes an "interlingua" etc etc.
Note that the bird and the verb sound nothing like each other in Greek, or French. They sound the same only in English, so GT goes from Greek to French through English. Because it doesn't have enough parallel texts to go directly to French. And so it sucks, because it doesn't have enough data. You can ask native users of other languages-that-are not-English or have few ish speakers, perhaps Turkish or Hungarian etc. I'm pretty sure you'll find out they have similar experiences.
So I don't know what metric they use to evaluate their results, it doesn't seem to be a particularly good metric of translation quality. Maybe they just care more about how many people use their system and try to optimise for that, rather than going for the much harder to know quality.
I'm Polish. I google translate even from Slavic languages that are very close to Polish (Ukrainian, Slovak - it's like 50% understandable without translation) to English not to Polish, because X -> Polish google translation sucks.
Leading somewhat off-topic, but this has also sparked a rather frank debate on r/machinelearning about some of the things discussed in the review, in particular arxiv flag painting:
I kind of disagree with some of the premises in the article. I've seen an HPSG for German in the late 90s that was able to parse almost any sentence I could throw at it correctly from a syntactic perspective.
The main problem for natural language understanding is not parsing and not even the semantic and pragmatic representations per se, it has always been the understanding. This requires an adequate knowledge representation and the drawing of inferences from it, and I don't believe that any substantial advances have been made in that field. Computational ontologies have grown larger and there are more "frameworks" than you can count, but none of them offer much knew and promising approaches like geometric meaning theories are in their infancy. Knowledge representation and, generally speaking, the problem of how to integrate different information sources in useful ways are essentially unsolved problems.
Just my 2 cents. Note that I'm talking about the principal problems, not about specific practical applications for which you can use the statistical sledgehammer to some extent.
Recently Coecke comments on Gärdenfors geometric meaning in the context of his categorical semantics that I'm finding interesting, in arXiv:1608.01402. What I would welcome is a computational link relating that semantics and oldie semantic-network based ideas. For instance in arXiv:1706.00526 description logic based knowledge representation is cast in string diagrammatic, categorical terms, and that at least puts the meaning realm in the same mathy foot.
Apologies, I'm an outsider to the field, but what exactly are you referring to here ? The whole vector-space semantic embedding that was popularized by works like word2vec ?
References for you and the other poster who asked:
Peter Gärdenfors: Conceptual Spaces - the Geometry of Thought. MIT Press 2000 (Paperback 2004).
It is very easy reading. The problems of geometric meaning theory are compositionality and quantification - how to get the expressivity of logical representations in addition to nearness measures, fuzziness and so on. There are some interesting approaches:
Martha Lewis & Jonathan Lawry: Hierarchical conceptual spaces for concept combination. Artificial Intelligence 237 (2016): 204-227.
Diederik Aerts, Liane Gabora, Sandro Sozzo: Concepts and their dynamics: a quantum-theoretic modeling of human thought. Topics in Cognitive Science 5 (4) (2013):737-772. [and other work by Aerts]
Aerts work is fascinating me personally, but it's unfortunately above my level of mathematical maturity. This is a general problem in this literature, maybe some solutions are already there but they also need to be sold in a way that allows linguists to understand and use the methods. Montague was lucky (well, not personally, of course), because he had scholars who were able to package his dense ideas in more verbose and easier to access textbooks.
Another short book worth reading in my opinion, though very programmatic in nature:
Jens Erik Fenstad: Grammar, Geometry, & Brain. CSLI Publications 2009.
Oh, you used the religiously verified word of "strawman"!
What you said made little to no sense and had no backing. Yours was a perfect example of layman speculation without any basis. Nothing you said made any coherent sense, nor had any backing. They don't even deserve a response.
That we artificially decompose the process of accepting a sentence as, eg, proper English into two phases: syntactic correctness and semantic correctness.
However, that distinction is arbitrary -- there is only the question of if the sentence is accepted by an agent (eg, person) as a well-formed sentence.
Any full accounting of the class of well-formed sentences must embed the semantic concerns; violating semantics is a syntax error (albeit, not usually a "first order" one). Similarly, even base syntax, such as the subject/verb/object distinction and ordering is carrying semantic information about word usage. The distinction between the two is non-existent: a full accounting of either must embed the other.
So semantics is syntax -- if you write a system of rules that only accepts valid sentences, then the rules will end up carrying the semantic structure of the language in them.
Ed:
I suppose I left out why this might matter --
In the quest to build an AI that understands semantics (ie, that "understands meaning"), we can bypass attacking that problem directly by training it (eg, a NN) on the full acceptance task (joined syntax and semantics -- classifying a sentence as proper English or not), and then truncate the network away from "low level" features (and perhaps at the other side, focused on 'yes' or 'no') to extract a network that has (most of) the (abstract) semantic structure embedded. We could then utilize these "middle level" features as a sort of rosetta stone, to train low level networks to embed content for them to understand, and high level networks to utilize their output on decisions to repurpose "understanding" across tasks.
I would argue that using things like Word2Vec (or the resulting vector space of words) is a similar idea.
You contend that the top sentence is valid English; I disagree with that. It subscribes to the "first order" rules (or an overly simplistic model), but isn't a sentence that an English speaker would use. Not being one that an English speaker would use makes it invalid English -- it's just a case where the first order approximation is wrong.
Similarly, if your syntax rules reject the second sentence, they're wrong -- since it's a sentence that English speakers can parse: the conclusion can only be that your syntax rules don't actually match the language you're trying to model.
I get the distinction that you're trying to point out with syntax/semantics, but you're ignoring my point: that divide is artificial and 'semantics' as you mean it is merely higher order syntax.
You haven't shown there's an inherent meaning to the difference (ie, that you haven't just drawn an arbitrary line in the sand), just that you can find examples that (naively) fall on different sides of it.
That's a very strange thing to say. The thing with human language is you can say anything you like, including things that make no sense at all and things that are syntactically incorrect. You can easily find examples of meaningless, syntactically correct sentences, like Jabberwocky ("All mimsy were the Borogoves and the mome raths outgrabe" etc). It's also easy to find examples of sensible sentences with incorrect structure (see twitter.com).
In fact, what is "incorrect syntax" keeps changing all the time, but we can still say the same things as we always could (plus a probably infinite many new things besides). If syntax was tied to meaning as tight as you say, we'd probably have only one or two languages and no dialects. Language would be a static, unchanging thing and we'd need no NLP, or translators etc.
I have to wonder if English is really the best language for NLP research. Things like the Winograd schemas which have attracted a lot of attention simply aren't possibilities in other languages.
Why not start working with more structured agglutinative * languages like Japanese/Korean and Indic family (Sanskrit esp.) .
How about other European languages ? Are they better structured empirically ? I hear German is very grammatical, and that Hungarian is ... erm odd ?
( Note: I know occidental tradition likes to split Indic tongues, and Indo in Indo-European is not considered agglutinative. I don't subscribe to this view. I use agglutinative in the sense of Panini: "particles" sticking to stems/roots/words - phonetic modifications are irrelevant for grammar.)
> I hear German is very grammatical, and that Hungarian is ... erm odd ?
Just want to point out that "grammatical" probably isn't the word you want here. Every language is grammatical by definition in the sense that there are rules that govern its sound system, word formation system, syntax, etc.
The concept you're getting at, though--that some languages are easier for computer programs and/or speakers of Indo-European languages to understand--is sound.
"Regular" would be the classic linguistics term, would it not? Although computer science limits the term to the use of regular languages in the Chomsky hierarchy sense (that is, more specifically to regular expressions and the languages they describe), I am under the impression linguistics as a whole treats regularity as a multivariate spectrum. Some languages have more regularity in terms of grammar productions or morphology than English.
That points to Isolating [1] and I think highly isolating may be the more useful distinction to this specific example. (Modern English is rather analytic, having dropped most, but not all, inflections in the Middle English era. Mandarin Chinese is much more isolating than Modern English.)
One reason is that the amount of training data is many many orders of magnitude smaller.
FWIW it seems the structure you're talking about exploiting is at a morphological and syntactic level, which modern language models tend to effectively handle. Semantics are a much harder problem.
> Things like the Winograd schemas which have attracted a lot of attention simply aren't possibilities in other languages.
I do not think that is correct. Anaphora exists in many languages. Check out the Anaphora article on wikipedia and click on different language versions. There are example sentences for many languages.
There are translation for the Winograd Schemas into a couple of languages. Granted I found some of the translations a little unnatural in some cases but they are still understandable and expose the problem.
The whole field of NLP and computational linguistics reminds me of that joke where a drunk is looking for his keys under a street lamp instead of where he actually lost them.
This is true in particular of anything that pertains to reasoning and knowledge representation. People still are trying to "infer rules" and do logical, rather than probabilistic reasoning. I get why that is. To me though, the kind of real life reasoning that humans do seems heavily probabilistic and contextual, Bayesian almost. And there's next to no notable work going on in that direction.
>> People still are trying to "infer rules" and do logical, rather than
probabilistic reasoning. I get why that is.
That is because it's very hard to collect statistics on something that you
can't really quantify- meaning, in this case.
There was a thread on HN a couple of days ago about a blog post where someone
was experimenting with, among other things, training an LSTM network to generate Java programs [1].
In one example, the LSTM did really well in reproducing the structure of a
Java program, with import declarations, followed by a class implementing an
interface with a few methods with structured comments and throws declarations
and everything- and even a test!
On the other hand, this program was completely useless. From a cursory glance
it would probably not even compile (e.g. it refered to undeclared variables
etc). There was one method named "numericalMean()" that took a single double
and returned an (undeclared) variable "sum". The class had a nonsensical name
- "SinoutionIntegrator". The test was testing something called "Cosise",
presumably a method- but not one defined in the class. In short- a mess.
That might sound a bit harsh, but I think it's a very good example of why
statistical NLP is really bad at doing meaning: because there is nothing, not
a shred, of meaning in examples of the data we use to train statistical models
of language, i.e. text.
Because, you see, the relation between meaning and text (and even spoken
language) is completely arbitrary. Or, to put it in another way, there are
potentially an infinite number of valid mappings between structure and
meaning, of which we, human beings, somehow by convention or some other crazy
mechanism, have agreed to use just one. And even though the various forms
language entities take (inflections etc) are used exactly to convey meaning,
right, the rules of how meaning varies with structure are, again, completely
independent from structure itself.
Now, we have done very well in modelling structure, from examples of it (which
is what text is). But it's completely unreasonable to expect our algorithms to
be able to extract meaning from it also.
And that is why people are still trying to put down the rules of meaning by
hand. Because that's the only way we can think of, currently, to process
meaning automatically.
I don't think these two things are mutually exclusive.
As far as I'm aware there is work underway to take logical constructions and integrate them with probablistic machine learning to do things like force zero probabilities in impossible input cases. That is encoding domain knowledge into the model directly in the form of symbolic reasoning.
I mean even Bayesian nets require some encoding of causality right? Maybe I'm reading to much of "blah symbolic reasoning is worthless" in your comment?
> We propose the Probabilistic Sentential Decision Diagram (PSDD): A complete and canonical representation of probability distributions defined over the models of a given propositional theory. Each parameter of a PSDD can be viewed as the (conditional) probability of making a decision in a corresponding Sentential Decision Diagram (SDD). The SDD itself is a recently proposed complete and canonical representation of propositional theories. We explore a number of interesting properties of PSDDs, including the independencies that underlie them. We show that the PSDD is a tractable representation. We further show how the parameters of a PSDD can be efficiently estimated, in closed form, from complete data. We empirically evaluate the quality of PSDDs learned from data, when we have knowledge, a priori, of the domain logical constraints.
Still working on my understanding but Professor Darwiche gave a lecture on the material in one of my classes. Salient bit:
> The problem we tackle here is that of developing a representation of probability distributions in the presence of massive, logical constraints. That is, given a propositional logic theory which represents domain constraints, our goal is to develop a representation that induces a unique probability distribution over the models of the given theory.
When he talks about the "computational models of language" that ruled in the 80s, is he referring perhaps to stuff like Montague semantics? https://plato.stanford.edu/entries/montague-semantics/ Or is Montague semantics merely a descriptive framework without practical applications?
What were the main "practical" approaches for natural language understanding back then?
AI is kind of a strange beast like that: it's gone through a few very different phases and it's difficult for one person to understand all of them equally well. Which of course makes it even harder to avoid reinventing wheels and repeating mistakes. A bit of history would do us all a world of good.
Btw, I'm getting the feeling most people here will probably hear of Fernando Pereira for the fist time but he has a very long career in AI and NLP. He was a prominent symbolicist, with some important contributions to logic programming (he was one of the co-founders of Quintus, the company that sold the first commercial Prolog, along with Warren, Byrd and others). Then he turned to statistical AI and now he's a VP at Google (a.k.a. the den of the connectionists, if I may be so bold). He's probably one of the few computer scientists around who understands both symbolic and statistical AI in equal measures. If anyone is qualified to talk about their relative merits, that's him.
(and if I sound like a bit of a fangirl- that is because I basically am. Pereira is one of my logic programming heroes and a great teacher to me, albeit unbeknownst to him :)