Hacker News new | ask | show | jobs
by j-pb 1460 days ago
The paper is pretty tone deaf and arrogant, but in line with the culture that I experienced in my short time at Vrije Universiteit Brussel.

It feels like the semantic community has a fault line between the realists and the theorists. With technologies like JSON-LD and SHACL done by the realists under the linked data banner. And OWL and Description logics done by the theorists under the old semantic web label.

Tim Berners-Lee is really hurting the progress of semantic data processing and representation by opposing any web technology that could be an alternative to RDF unless it's folded under the RDF banner.

2 comments

You may be right about the pre-existing "wedge" in the semantic/LD community, but this paper is an attempt to address it and provide a unifying perspective. I'm not quite sure what you're seeing as "arrogant" or tone deaf in it.
People get seduced by specifications that don't really specify anything.

Conversely there is a lot of pushback against W3C standards because they are specific and unfortunately people don't see that at freedom (freedom to choose tools that interoperate) and don't see the slavery in being stuck with poorly specified "standards" that are controlled by one entity.

GraphQL is improved (we almost know what the algebra is) but originally it was an asymmetric specification meant to keep power in the hands of Facebook.

That is, they didn't want to specify what the exact rules for traversing the graph are because they have commercial reasons for controlling what information you can get plus some responsibility to protect user's privacy.

Schema.org was another asymmetric standard, as it wasn't particularly good for exposing semantic metadata that people could consume as such (it took me a few years to really figure out how) but it was great for companies like Google to make a training set that would ultimately let them extract entities from documents that aren't marked up. It achieved some popularity because there is no limit to the hoops people will jump through if it improves their SERP rating from 78 to 55.

> People get seduced by specifications that don't really specify anything.

Like the RDF 1.1 Spec?[https://www.w3.org/TR/rdf11-concepts/]

The whole "abstract syntax" shenanigans that RDF pulls is one of its biggest flaws. It makes the entire ecosystem huge and unwieldy, and has little upsides besides giving everybody their favourite serialisation flavour.

It makes things like canonical representations for content addressable hashing and singing pretty much impossible, which is a huge detriment to proper authentication and provenance tracking.

It also pulls in all of these other open ended standards, where everything and anything is a valid subject identifier, so long as it's URI resolvable by http (which is pretty vague and random).

Subjects and predicates should have always just been 16byte random UIDs, which at the same time would have delivered us from the bane of blank nodes, endless discussions on predicate names, and broken links.

The object part should also have been limited in the amount of data it can hold, just hash anything bigger and store it in some form of content addressable blob store.

> so long as it's URI resolvable by http

It doesn’t have to be. URI is a pretty broad concept and URLs are just a subset. It’s perfectly fine to identify an entity with other URIs that are not URLs. If for instance you’re talking about a book then you could use the ISBN for instance “ urn:isbn:0-486-27557-4”

The usefulness of using resolvable URL’s as URIs is just that if you have absolutely no knowledge about the resource except it’s URI, and that URI happens to be a resolvable URL, then at least you know where to go looking to find out more.

The URI resolution idea is 99% crap.

That is, most of the time you don't want to publish subjects and predicates as resolvable URIs. However, people see so many examples of http:... that they don't release it's even possible to make non-resolvable URIs.

I used random UUIDs all the time but that is a super-fraught area since some people really want them to be in temporal sequence so their database index is happy.

I've also done the content addressable blob store thing.

I've had some pretty good experiences with using the following random UID scheme.

32 bit millisecond timestamp that just rolls over, i.e. truncate(current_time_ms()), concatenated with 96 bytes of crypto grade entropy.

You get both nice properties, database index locality + proper entropy that you can sleep well and not worry about collisions (since the entire timestamp space will get more densely populated with every overflow (roughly every 50 days)).

It's also what PostgreSQL uses for it's index friendly UID format. :D

I think proposed UUID v7 is sortable, FWIW.

https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

That's hilarious that they're so out of touch that the JSON-LD and SHACL people are the realists. JSON-LD is ridiculous and a desperate attempt to attach semantic web technology to something that is actually used. It was announced with the misleading post titled "JSON-LD and Why I hate the Semantic Web". Only it's completely about the semantic web and is just a serialization format for RDF expressed in JSON. So you take JSON a format that can be described in a single page and layer this monstrosity over it. For what? The best thing that can be said for it is you can ignore it (maybe) and just treat it as JSON. There was zero need for JSON-LD. They had a perfectly good serialization format in TURTLE. It was similarly easy to describe and understand as JSON, but nooooo. They're always riding the coat tails of some other popular technology trying to get a free ride. It's like that obnoxious kid who shows up to the party and tells you how much smarter they are than you and then complains that no one will talk to them.

SHACL is almost dumber that JSON-LD, if you can believe that. At least JSON-LD is just a serialization format. If you can manage to get it to parse you can just reserialize it to a sane format like TURTLE and get rid of the stupid. With SHACL you're stuck with it. So you go and create the worlds slowest database by basically normalizing the hell out of it because if a little normalization is good a lot of normalization is better. Screw knowing anything about the world. Let's throw that all out the window and allow people to express anything.

So now you've got a database that can express anything and people say, "hey, I actually know some things about the world that seem to hold and my database is becoming a ridiculous mess. Can we make it so that people can't express something like I already can with my relational database?" Well the semantic web people went off in deep thought for a decade and finally came back with SHACL. It's a constraints language, expressed in RDF, of course, and if you express it in JSON-LD that means you've got SHACL expressed in RDF expressed in JSON, joy. I guess you could implement is in a number of different ways but it ends up firing off a series of queries saying, "Is this ok? What about that? How about this?" and if they're all successful then it will allow you to run the query you actually wanted to run. So now after all that you've got the world's slowest database that is now orders of magnitude even slower so that it operates a bit like MySQL.

Semantic web databases allow you to express just about any query you'd like but it allows you to express queries that you're never going to, and is extremely slow for the ones you are.

I'm not gonna add much, because I think we're on a similar page in terms of RDF-rantyness. I find the entire semantic web/linked data space horribly bloated and overcomplicated.

However! There is something to be said about triples and normalisation. Is the general idea of triples a really good format for databases? Maybe not. Is it a really good format for knowledge representation? Yeah I think so.

Real world knowledge is quite messy and riddled with exceptions. People from cultures without a last name. A stump is still a tree in the right context, even if it doesn't have a crown. A character in D&D might have traits that are completely uncommon for their class. The greek gods for sleep, death or the night sky are both concepts and characters.

You can't model these things with a database, which is primarily tailored towards modelling the "inner world" of a computer.

The semantic web is still frustratingly bad at these things, because of description logics, OWL and their mind-share that _everything must be class based_ and enforced/checked at creation/load time.

In reality it's much better to throw all of that away, and just do duck typing at query time, by letting the consumer decide which entities they want to process. Sure some entities might not get processed, because they don't conform to the shape that the consumer expects, but that's a strength. A different system might consume them, they might be ignored indefinitely, or they are handled at a later time.

The directedness of individual facts also allows you to implicitly encode "who makes this assertion", providing a mechanism to make distributed consistency much easier.

Limiting the number of columns (to 3) also allows you to materialise all possible indices (for each ordering), which is really interesting in combination with worst case optimal join algorithms.

However nothing in the RDF ecosystem makes use of these strengths. It's all rigid, classy, complex, slow and buggy, but I don't think that a heavily normalised knowledge base build on triples has to be that way.

Huh? OWL doesn't check anything when you load data. What OWL does is infer new facts based on the facts that are there.

For instance if you put in an axiom that says "a manager must manage one or more employees" then the system will infer that X is a manager once you add a fact that Y is managed by X. Classes in OWL are classes as in "classification", not classes like Java where you have to create a class simply to have a place in memory to put a facts.

Some of the reason why people "just don't get RDF" is that it works exactly in the opposite of conventional systems and that creates so much cognitive dissonance that you can see people's brains shorting out when they encounter it.

Sorry for my imprecision. Yes, OWL can be used like that, and automatic tagging is one of the few good use cases.

But in reality A-Box completion is not the big use case for OWL. T-Box model checking is.

All those fancy bioinformatics ontologies and "databases" that get paraded around by the Dl folks. All those lower ontologies. There is not a single A-Box fact describing genes, diseases, products, objects or whatever. It's all T-Box concepts.

I mean, there's papers lamenting common RDF database T-Box size and performance limitations, because they want to collect medical data, but have to shoehorn it into the ontology.

That's also something that the authors don't seem to get. Shacl popped up because people wanted to have something that operstes over their A-Boxes without slowly dragging their entire modeling and data storage into the T-Box. That's why they don't want the "description logic perspective", as it automatically leads down that "no instances, just theories" rabbit hole.

As an aside. Even if OWL was used for classification only. It'd be rather moot. So you've classified something as a manager. Once you act in it, e.g. by having a query that only asks for manager entities, you are stuck with the same brittle class based approach, where the query requires more constraints than it actually needs. The query already contains all the properties required, it's its own anonymous classification so to speak.

Practically I work with SPIN. I wish somebody would make a production rules engine that was easy to live with. I want to like Drools but I can't read the error messages for complex programs I write. With Jena Rules the system is simple enough that I can figure problems out looking at the source code but it doesn't have as many features.

Unfortunately logic is a depressing subject because it starts with a bunch of theorems about what is impossible (Gödel, Tarski,Turing.) There is no system of negation that is without problems (OWL takes the radical choice of no negation) Commonsense reasoning involves a lot of "Alice thinks that Jane thinks that..." and "A was true until 12:30 this afternoon, now A is false".

The theory vs interpretation split is another one of those decisions you have to make if you want to do logic: I am on a committee where I'm the guy who speaks for interpretations and the A-Box but some of the other people are serious T-Boxers.

It amazes me that this system

http://inform7.com/

creates an illusion of letting an English major write a script for an adventure game that reads like English that someone can play in what looks like a subset of English. It does it all with a very primitive production rules engine that relies heavily on defaults. Practical logic requires attention to rules and "schemes" (X macros, configuration settings on the rules engine.) I wrote an adventure game with a few rooms and objects in Drools and dreamt of making something like "Inform 7 for business rules".

> Commonsense reasoning involves a lot of "Alice thinks that Jane thinks that..." and "A was true until 12:30 this afternoon, now A is false".

These are both examples of modalities. From a formal point of view, description logics are special cases of multi-modal logics. The semantics of these can in turn be understood as computationally well-behaved restrictions of FOL, where the logical quantifiers are understood to range over so-called "possible worlds".

> However! There is something to be said about triples and normalisation. Is the general idea of triples a really good format for databases? Maybe not. Is it a really good format for knowledge representation? Yeah I think so.

There's Datomic and XTDB as practical examples of databases built on data models that equal/similar to triples.

Right, not semantic web technologies and not very often used. Before someone jumps in with, "but it's used at....", I'm not saying people aren't using it but in the grand scheme they are extremely niche products.
That’s a backwards way to think about it.

JSON-LD lets you paint some extra structure onto a JSON document by adding a little metadata. That is, add some semantic smarts to existing JSON document.

What people miss is that the semantic web was never an attempt to fit everybody into a straightjacket but rather an attempt to mash together all the data in the world into a huge Katamari ball.

That so beautifully captures the attitude of the semantic web community, "What people miss...". Didn't miss anything. It's this attitude, that the problem isn't with what's been done, it's that other people have failed to recognize how brilliant it all is.
It hasn't been communicated well. The work hasn't been planned well. SHACL should have been introduced before OWL. RDFS doesn't really work as a data integration language because you can't write a rule like

   ?x :lengthInInches ?t -> ?x :lengthInCentimeters ?t*2.54
Since RDF isn't fit for the purpose it was designed for, people get really confused about it.

The strange thing is that semweb work has been massively overfunded in zones that face severe multilingual problems (Europe), but it seems to be almost banned in academia in the United States. (except for a few enclaves I can enumerate on the fingers of a few hands.) I know quite a few Africans who are semweb believers because they are looking at markets fragmented by languages, but they don't have the overfunding that Europeans have. (Funny enough I am working on a standards doc for ISO 20022 which is badly wanted by Chinese authorities who are interested in semweb tech because they want to feel included in financial messaging despite language barriers.)

Pardon my ignorance, but is any of this used for anything practical?
I see JSON-LD show up in a lot of standards-driven document formats and it's made sense to me in that context. For instance, in DIDs^1 the basic document shape is established by one standard but then the key material is often identified using other standards. JSON-LD makes possible to merge together the definitions from the multiple standards without too much trouble or loss of specificity.

How you actually use JSON-LD seems to vary in practice. Programmatically, the most useful thing it can do is normalize a JSON into a shape you expect without losing specificity^2. More often, however, it serves as a form of documentation for where the terms are coming from, because developers will usually try to maintain a nice shape with the JSON documents they produce.^3

The @context is hard for developers to create, however, so outside of the standards process I'm unsure it gets a lot of utility. I'm also unsure how reliably the @context values have consistent machine-readable documents produced, so the programmatic value may be limited.

^1 Which we'll assume are a good thing for the purpose of this conversation.

^2 You can do this by expanding the JSON into the very noisy RDF graph-triple form, then re-condensing it into an object with sane (unprefixed, non-URL) keynames and values using your own @context.

^3 Of course you can't RELY on people to do that, which is where the programmatic utility comes into play.

Firms that are getting value out of semantic web tech are often very quiet about it because they see it as a secret weapon.
I find that a bit implausible since the semantic web is designed as an interop layer, not implementation-level tech. The whole point of it is lost if you keep it a secret!
Not really.

The real ideology behind the semantic web is that you can use something like RDFS to rewrite the vocabulary somebody else uses to what you want... Except there are a number of reasons why you can't, for instance it is reasonable to write

  :Today :temperatureInCentigrade 28.8 .
but maybe you want to query

  :Today :temperatureInFahrenheit 84.0 .
The production rule to convert one to another is pretty simple, and you can implement it with SPIN, but not RDFS or OWL. Similarly there are data formats that organize a tree-like structure in an arbitrary way and to convert one to another you have to match a graph pattern to a graph pattern not a predicate to a predicate.

The whole fun of RDF is it is a clean basis to build your own data model. Want to add some kind of reification? Just add properties to triples.

I look at the OMG standards for an example where broken standards are par for the course because a few organizations fix the standards and can build proprietary tools on them. For instance the claimed reason why EMOF exists is that you could bootstrap UML 2 from it. It's not quite possible because the standard is broken. I'm pretty sure you can fix a few small things with EMOF and get it to work but it's not accidental it all it doesn't work out of the box.

The semweb community is absolutely sick of the standards process which is one more reason why broken things don't get fixed.