Hacker News new | ask | show | jobs
by flanked-evergl 291 days ago
RDF is great but it's somewhat inadvertently captured by academia.

The tooling is not in a state where you can use it for any commercial or mission critical application. The tooling is mainly maintained by academics, and their concerns run almost exactly counter to normal engineering concerns.

An engineer would rather have tooling with limited functionality that is well designed and behaves correctly without bugs.

Academics would rather have tooling with lots of niche features, and they can tolerate poor design, incorrect behavior and bugs. They care more for features, even if they are incorrect, as they need to publish something "novel".

The end result is that almost all things you find for RDF is academia quality and lots of it is abandoned because it was just part of publication spam being pumped and dumped by academics that need to publish or perish.

Anyone who wants to use it commercially really has to start from scratch almost.

3 comments

Yes and no.

I worked for a company that went hard into "Semantic Web" tech for libraries (as in, the places with books), using an RDF Quad Store for data storage (OpenLink Virtuoso) and structuring all data as triples - which is a better fit for the Heirarchical MARC21 format than a relational database.

There are a few libraries (the software kind) out there that follow the W3 spec correctly, Redland being one of them.

How well did that work? Based on your experience at that company would you build a new project on the stack that they chose?
It worked very well, as I mentioned, Marc21 (the interchange format for bibliographic data) is heirarchical, not relational, so there was already a better impedance match.

Then with URL's being the primary identifiers, it was trivial to take a large dataset like VIAF (Virtual International Authority File - canonical representation of all authors) and query the two together seamlessly.

Virtuoso was a pretty good Quad Store, and we got away with storing tens of billions of triples on a 4 node cluster, with very fast query times (although sticking to Sparql 1.1 and not leaning on property paths).

As to if I would choose it again ... I don't know. I'm now a decade out of the library space and haven't seen anything in my day-to-day work (backend distributed systems) that would benefit from the RDF data model.

I'm in a similar boat. On my case, it's software for public libraries, and it's a must having data accessible as RDF. Event, we have our own public fork of Marc4j .
Tooling sounds like it can be fixed? If the knowledge bases are useful, why not use them with better tools?
> even if they are incorrect

Uh. Do you have a source for this? Correctness is a major need in academia.

Correct != Bug-free.

My experience working with software developed by academics is that it is focused on getting the job done for a very small user base of people who are okay with getting their hands dirty. This means lots of workarounds, one-off scripts, zero regards for maintainability or future-proofing...

“Incomplete” seems like a better word than “incorrect” for this. The code is likely correct in the narrow scope it was needed for, but is missing features (and error checking!) beyond the happy path, making it easy to use incorrectly.
This I fully agree with.
> Correctness is a major need in academia.

How so? Consider the famous result that most published research findings are false.

How so? Finding correct stuff is the whole point of research, no matter the extent at which it actually succeeds in reaching this. So yes, regardless on the actual results it is a major need in academia. We have nothing better anyway (which doesn't need it can't improve; we critically need it to improve).

Now. I'll assume you are referring to "Why Most Published Research Findings Are False". This paper is 20 years old, only addresses medical research despite its title, and seems to have mixed reception [1]

> Biostatisticians Jager and Leek criticized the model as being based on justifiable but arbitrary assumptions rather than empirical data, and did an investigation of their own which calculated that the false positive rate in biomedical studies was estimated to be around 14%, not over 50% as Ioannidis asserted.[12] Their paper was published in a 2014 special edition of the journal Biostatistics along with extended, supporting critiques from other statisticians

14% is a huge concern and I think nobody will disagree with this. But we are far from most, if this is true.

[1] https://en.wikipedia.org/wiki/Why_Most_Published_Research_Fi...

I think they mean things like a tool that has feature X even if it crashes 50% when it is used is preferable to a tool that doesn't have feature X at all.
Ok, makes sense, I hadn't read it like this. For me, "correct" means "provides correct results".