Hacker News new | ask | show | jobs
by 9nGQluzmnq3M 2170 days ago
As a long-time Wikipedian, this track record is actually worrisome.

Semantic Mediawiki (which I attempted to use at one point) is difficult to work with and far too complicated and abstract for the average Wiki editor. (See also Tim Berners-Lee and the failure of Semantic Web.)

WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.

3 comments

> WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.

Note that the internal data format used by Wikidata is _not_ RDF triples [0], and it's also highly non-relational, since every statement can be annotated by a set of property-value pairs; the full data set is available as a JSON dump. The RDF export (there's actually two, I'm referring to the full dump here) maps this to RDF by reifying statements as RDF nodes; if you wanted to end up with something queryable by SQL, you would also need to resort to reification – but then SPARQL is still the better choice of query language since it allows you to easily do path queries, whereas WITH RECURSIVE at the very least makes your SQL queries quite clunky.

[0] https://www.mediawiki.org/wiki/Wikibase/DataModel [1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Fo...

The sparql api is no fun. Limited to 60s for example is death. I had to resort to getting the full dump.
How do you dump general purpose, encyclopedic data into a relational database? What database schema would you use? The whole point of "triples" as a data format is that they're extremely general and extensible.
Most structured data in Wikipedia articles is in either infoboxes or tables, which can easily be represented as tabular data.

  Table country:

  Name,Capital,Population
  Aland,Foo,100
  Bland,Bar,200
Now you need a graph for representing connections between pages, but as long as the format is consistent (as they are in templates/infoboxes) that can be done with foreign keys.

  Table capital
  ID,Name
  123,Foo
  456,Bar

  Table country
  Name,Capital_id,Population
  Aland,123,100
  Bland,456,200
> Most structured data in Wikipedia articles is in either infoboxes or tables

Most of the data in Wikidata does not end up in either Infoboxes or Tables in some Wikipedia, however, and, e.g., graph-like data such as family trees works quite poorly as a relational database; even if you don't consider qualifiers at all.

Those infoboxes get edited all the time to add new data, change data formats, etc. With a relational db, every single such edit would be a schema change. And you would have to somehow keep old schemas around for the wiki history. A triple-based format is a lot more general than that.
RDF shouldn't be lumped in with SPARQL
That’s the same set of technology. SPARQL is used to query RDF graphs, that’s pretty tightly coupled.