|
Hi, founder of Diffbot here, we are an AI research company spinout from Stanford that generate the world's largest knowledge graph from crawling the whole web. I didn't want to comment, but I see a lot of misunderstandings here about knowledge graphs, abstract representations of language, and the extent as to which this project uses ML. First of all, having a machine-readable database of knowledge(i.e. Wikidata) is no doubt a great thing. It's maintained by a large community of human curators and always growing. However, generating actually useful natural language that rivals the value you get from reading a Wikipedia page from an abstract representation is problematic. If you look at the walkthrough for how this would work (https://github.com/google/abstracttext/blob/master/eneyj/doc...), this project does not use machine and uses CFG-like production rules to generate natural sentences. Works great for generating toy sentences like "X is a Y". However, human languages are not programming languages. Many natural languages, like German and Finnish, are so syntactically and morphologically complex that there is no compact ruleset that can describe them. (those that have taken grammar class can relate to the number of exceptions to the ruleset) Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format. Plenty of text is opinion, subjective, or describes notions that don't have an proper entity. Of course there are ways that engineer around this, however they will exponential grow the complexity of your ontology, number of properties, and make for a terrible user experience for the annotators. A much better and direct approach to the stated intention of making the knowledge accessible to more readers is to advance the state of machine translation, which would capture nuance and non-facts present in the original article. Additionally, exploring ML-based ways of NL generation from the dataset this will produce will have academic impact. |
> Additionally, not every sentence in a typical Wikipedia article can be easily represented in a machine-readable factual format.
It doesn't seem like the goal of this project is to describe those languages, or to represent ever sentence in a typical Wikipedia article? The goal doesn't seem to be to have all Wikipedia articles generated from Wikidata, but rather to have a couple of templates to the order of "if I have this data available about this type of Subject, generate this stub article about it". That would allow the smaller Wikipedia language editions to automatically generate many baseline articles that they might not currently have.
For example, the Dutch Wikipedia is one of the largest editions mainly because a large percentage of its articles were created by bots [1] that created a lot of articles on small towns ("x is a town in the municipality of y, founded in z. It is nearby m, n and o.") and obscure species of plants. This just seems like a more structured plan to apply that approach to many of the smaller Wikipedia's that may be missing a lot of basic articles and are thus not exposing many basic facts.
[1] https://en.wikipedia.org/wiki/Dutch_Wikipedia#Internet_bots