Hacker News new | ask | show | jobs
by miket 2184 days ago
Hi, founder of Diffbot here, we are an AI research company spinout from Stanford that generate the world's largest knowledge graph from crawling the whole web. I didn't want to comment, but I see a lot of misunderstandings here about knowledge graphs, abstract representations of language, and the extent as to which this project uses ML.

First of all, having a machine-readable database of knowledge(i.e. Wikidata) is no doubt a great thing. It's maintained by a large community of human curators and always growing. However, generating actually useful natural language that rivals the value you get from reading a Wikipedia page from an abstract representation is problematic.

If you look at the walkthrough for how this would work (https://github.com/google/abstracttext/blob/master/eneyj/doc...), this project does not use machine and uses CFG-like production rules to generate natural sentences. Works great for generating toy sentences like "X is a Y".

However, human languages are not programming languages. Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (those that have taken grammar class can relate to the number of exceptions to the ruleset)

Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format. Plenty of text is opinion, subjective, or describes notions that don't have an proper entity. Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.

A much better and direct approach to the stated intention of making the knowledge accessible to more readers is to advance the state of machine translation, which would capture nuance and non-facts present in the original article. Additionally, exploring ML-based ways of NL generation from the dataset this will produce will have academic impact.

4 comments

> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (...)

> Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format.

It doesn't seem like the goal of this project is to describe those languages, or to represent ever sentence in a typical Wikipedia article? The goal doesn't seem to be to have all Wikipedia articles generated from Wikidata, but rather to have a couple of templates to the order of "if I have this data available about this type of Subject, generate this stub article about it". That would allow the smaller Wikipedia language editions to automatically generate many baseline articles that they might not currently have.

For example, the Dutch Wikipedia is one of the largest editions mainly because a large percentage of its articles were created by bots [1] that created a lot of articles on small towns ("x is a town in the municipality of y, founded in z. It is nearby m, n and o.") and obscure species of plants. This just seems like a more structured plan to apply that approach to many of the smaller Wikipedia's that may be missing a lot of basic articles and are thus not exposing many basic facts.

[1] https://en.wikipedia.org/wiki/Dutch_Wikipedia#Internet_bots

This is addressed in the white paper describing the project's architecture:

10.2 Machine translation

Another widely used approach —mostly for readers, much less for contributors— is the use of automatic translation services like Google Translate. A reader finds an article they are interested in and then asks the service to translate itinto a language they understand. Google Translate currently supports about a hundred languages — about a third of thelanguages Wikipedia supports. Also the quality of these translations can vary widely — and almost never achieves thequality a reader expects from an encyclopedia [33, 86].*

Unfortunately, the quality of the translations often correlates with the availability of content in the given language [1],which leads to a Matthew effect: languages that already have larger amounts of content also feature better results intranslation. This is an inherent problem with the way Machine Translation is currently trained, using large corpora. Whereas further breakthroughs in Machine Translation are expected [43], these are hard to plan for.

In short, relying on Machine Translation may delay the achievement of the Wikipedia mission by a rather unpredictabletime frame.

One advantage Abstract Wikipedia would lead to is that Machine Translation system can use the natural language generation system available in Wikilambda to generate high-quality and high-fidelity parallel corpora for even morelanguages, which can be used to train Machine Translation systems which then can resolve the brittleness a symbolic system will undoubtedly encounter. So Abstract Wikipedia will increase the speed Machine Translation will become better and cover more languages in.

https://arxiv.org/abs/2004.04733

(Theres's more discussion of machine learning in the paper but I'm quoting the section on machine translation in particular).

Additionally of course Google Translate is a proprietary service from Google, and Wikimedia projects can't integrate it in any way without abandoning their principles. It's left for the reader to enter pages into Google Translate themselves, and will only work as long as Google is providing the service.

What is the quality of open source translation these days?

State of the art is always open source in MT.
>> Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them.

Is that realy true? If natural languages have rules, then there exists a ruleset that can describe any natural language- the set of all rules in that language. Of course, a "rule" is a compact representation of a set of strings, so if natural languages don't have such rules it's difficult to see how any automated system can represent a natural language "compactly". A system without any kind of "rules" would have to store every grammatical string in a language. That must be impossible in theory and in practice.

If I may offer a personal perspective, I think that the goal of the plan is to produce better automated translations than is currently possible with machine translation between language pairs for which there are very few parallel texts. My personal perspective is that I'm Greek and I am sad to report that basicaly translation from any language to Greek by e.g. Google Translate (which I use occasionally) is laughably, cringe-inducingly bad. From what I understand the reason for that is not only the morphology of the Greek language which is kind of a linguistic isolate (as opposed to, say, Romance languages), but also that, because there are not many parallel texts between most languages (on Google Translate) and Greek, the translation goes through English- which results in completely distorted syntax and meaning. Any project that can improve on this sorry state of affairs (and not just for Greek- there are languages with many fewer speakers and no paralle texts at all, not even with English) is worth every second of its time.

To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.

> To put it plainly, if you don't have enough data to train a machine learning model, what, exactly, are your options? There is only one option: to do the work by hand. Wikipedia, with its army of volunteers, has a much better shot at getting results this way than any previous effort.

The training data for machine translation models is also human-created. Given some fixed amount of human hours, would you rather them be spent annotating text that can train a translation system that can be used for many things, or a system that can just be used for this project? It all depends on the yield that you get per man-hour.

As the paper I quote below says, the system that would result from this project could be re-used in many other tasks, one of which is generating data for machine translation algorithms.

I think this makes sense. The project aims to create a program, basically ("a set of functions"). There are, intuitively, more uses for a proram than for a set of labelled data.

> Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators.

So, the obvious solution is to create robo-annotators, and that's what your company is supposedly trying to do?