| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ktk 3023 days ago

I'm an engineer that used to do RDBs for a long time. One day a customer of a friend came with an issue that was in my opinion impossible to solve with relational DBs: He described data that is in flow all the time and there was no way we could come up with a schema that would fit his problem for more than one month after we finished it. Then I remembered that another friend once mentioned this graph model called RDF and its query language SPARQL and started digging into it. It's all W3C standards so it's very easy to read into it and there are competing implementations.

It was a wild ride. At the time I started there was little to no tooling, only few SPARQL implementations and SPARQL 1.1 was not released yet. It was PITA to use it but it still stuck with me: I finally had an agile data model that allowed me and our customers to grow with the problem. I was quite sceptical if that would ever scale but I still didn't stop using it.

Initially one can be overwhelmed by RDF: It is a very simple data model but at the same time it's a technology stack that allows you to do a lot of crazy stuff. You can describe semantics of the data in vocabularies and ontologies, which you should share and re-use, you can traverse the graph with its query language SPARQL and you have additional layers like reasoning that can figure out hidden gems in your data and make life easier when you consume or validate it. And most recently people started integrating machine learning toolkits into the stack so you can directly train models based on your RDF knowledge graph.

If you want to solve a small problem RDF might not be the most logical choice at first. But then you start thinking about it again and you figure out that this is probably not the end of it. Sure, maybe you would be faster by using the latest and greatest key/value DB and hack some stuff in fancy web frameworks. But then again there is a fair chance the customer wants you to add stuff in the future and you are quite certain that at one point it will blow up because the technology could not handle it anymore.

That will not happen with RDF. You will have to invest more time at first, you will talk about things like semantics of your customers data and you will spend quite some time figuring out how to create identifiers (URIs in RDF) that are still valid in years from now. You will have a look at existing vocabularies and just refine things that are really necessary for the particular use case. You will think about integrating data from relational systems, Excel files or JSON APIs by mapping them to RDF, which again is all defined in W3C standards. You will mock-up some data in a text editor written in your favourite serialization of RDF. Yes, there are many serializations available and you should most definitely throw away and book/text that starts with RDF/XML, use Turtle or JSON-LD instead, whatever fits you best.

After that you start automating everything, you write some glue-code that interprets the DSL you just built on top of RDF and appropriate vocabularies and you start to adjust everything to your customer's needs. Once you go live it will look and feel like any other solution you built before but unlike those, you can extend it easily and increase its complexity once you need it.

And at that point you realize that this is all worth is and you will most likely not touch any other technology stack anymore. At least that's what I did.

I could go on for a long time, in fact I teach this stack in companies and gov-organizations during several days and I can only scratch the surface of what you can do with it. It does scale, I'm convinced by that by now and the tooling is getting better and better.

If you are interested start having a look at the Creative Commons course/slides we started building. There is still lots of content that should be added but I had to start somewhere: http://linked-data-training.zazuko.com/

Also have a look at Wikipedia for a list of SPARQL implementations: https://en.wikipedia.org/wiki/Comparison_of_triplestores

Would I use other graph databases? Definitely not. The great thing about RDF is that it's open, you can cross-reference data across silos/domains and profit from work others did. If I create another silo in a proprietary graph model, why would I bother?

Let me finish with a quote from Dan Brickly (Googles schema.org) and Libby Miller (BBC) in a recent book about RDF validation:

> People think RDF is a pain because it is complicated. The truth is even worse. RDF is painfully simplistic, but it allows you to work with real-world data and problems that are horribly complicated. While you can avoid RDF, it is harder to avoid complicated data and complicated computer problems.

Source: http://book.validatingrdf.com/bookHtml005.html

I could not have come up with a better conclusion.