Hacker News new | ask | show | jobs
by IanCal 291 days ago
This seems to miss the other side of why all this failed before.

Rdf has the same problems as the sql schemas with information scattered. What fields mean requires documentation.

There - they have a name on a person. What name? Given? Legal? Chosen? Preferred for this use case?

You only have one id for apple eh? Companies are complex to model, do you mean apple just as someone would talk about it? The legal structure of entities that underpins all major companies, what part of it is referred to?

I spent a long time building identifiers for universities and companies (which was taken for ROR later) and it was a nightmare to say what a university even was. What’s the name of Cambridge? It’s not “Cambridge University” or “The university of Cambridge” legally. But it also is the actual name as people use it. The university of Paris went from something like 13 institutes to maybe one to then a bunch more. Are companies locations at their headquarters? Which headquarters?

Someone will suggest modelling to solve this but here lies the biggest problem:

The correct modelling depends on the questions you want to answer.

Our modelling had good tradeoffs for mapping academic citation tracking. It had bad modelling for legal ownership. There isn’t one modelling that solves both well.

And this is all for the simplest of questions about an organisation - what is it called and is it one or two things?

4 comments

Indeed, I often get the impression that (young) academics want to model the entire world in RDF. This can't work because the world is very ambiguous.

Using it to solve specific problems is good. A company I work with tries to do context engineering / adding guard rails to LLMs by modeling the knowledge in organizations, and that seems very promising.

The big question I still have is whether RDF offers any significant benefits for these way more limited scopes. Is it really that much faster, simpler or better to do queries on knowledge graphs rather than something like SQL?

I think it's a journey a lot of us have gone on, it's an appealing idea until you hit a variety of really annoying cases and where you are depends on how you end up trying to solve it. I'm maybe being unfair to the academic side but this is how I've seen it (exaggerated to show what I mean hopefully).

The more academic side will add more complexity to the modelling, trying to model it all.

The more business side will add more shortcuts to simplify the modelling, trying to get just something done.

Neither is wrong as such but I prefer the tendency to focus on solving an actual problem because it forces you to make real decisions about how you do things.

I think being able to build up knowledge in a searchable way is really useful and having LLMs means we finally have technology that understands ambiguity pretty well. There's likely an excellent place for this now that we can model some parts precisely and then add more fuzzy knowledge as well.

> The big question I still have is whether RDF offers any significant benefits for these way more limited scopes. Is it really that much faster, simpler or better to do queries on knowledge graphs rather than something like SQL?

I'm very interested in this too, I think we've not figured it out yet. My guess is probably no in that it may be easier to add the missing parts to non-rdf things. I have a rough feeling that actually having something like a well linked wiki backed by data sources for tables/etc would be great for an llm to use (ignoring cost, which for predictions across a year or more seems pretty reasonable).

They can follow links around topics across arbitrary sites well, you only need more programmatic access for aggregations typically. Or rare links.

The academic / business divide is a great example of the correct model depending on what you want to do. The academic side wants to understand, the business side wants to take action.

For example, the Viable System Model[1] can capture a huge amount of nuance about how a team functions, but when you need to reorganize a disfunctional team, a simple org chart and concise role descriptions are much more effective.

[1] https://en.wikipedia.org/wiki/Viable_system_model

Which company? I need to build an enterprise knowledge graph.
A small startup in the Netherlands, but they're very much searching for approaches themselves, I don't think they can help you right now.
That university example is fantastic.

I went looking and as far as I can tell "The Chancellor, Masters, and Scholars of the University of Cambridge" is the official name! https://www.cam.ac.uk/about-the-university/how-the-universit...

That's the one! It's not even that weird of a case compared to others but is an excellent example.

Here's the history of the Paris example: https://en.wikipedia.org/wiki/University_of_Paris where there was one, then many, then fewer universities. Answering a question of "what university is referred to by X" depends on why you want to know, there are multiple possible answers. Again it's not the weirdest one, but a good clear example of some issues.

There's a company called Merk, and a company called Merk. Merk is called Merk in the US but MSD outside of it. The other Merk is called Merk outside the US and EMD inside it. Technically one is Merk & Co and used to be part of Merk but later wasn't and due to trademark disputes, which aren't even all resolved yet.

This is an area I think LLMs actually have a space to step in, we have tried perfectly modelling everything so we can let computers which have no ability to manage ambiguity answer some questions. We have tried barely modelling anything and letting humans figure out the rest, as they're typically pretty poor at crafting the code, and that has issues. We ended up settling largely on spending a bunch of human time modelling some things, then other humans building tooling around them to answer specific questions by writing the code, and a third set who get to actually ask the questions.

LLMs can manage ambiguity, and they can also do more technical code based things. We haven't really historically had things that could manage ambiguity like this for arbitrary tasks without lots of expensive human time.

I am now wondering if anyone has done a graph db where the edges are embedding vectors rather than strict terms.

> I am now wondering if anyone has done a graph db where the edges are embedding vectors rather than strict terms.

Curious: how would you imagine it working if there were such a graph db?

I had the idea a few hours ago so I'm sure there are holes in this but my first idea is forming a graph where the relationship isn't a fixed label but a description that is then embedded as a vector.

First of all, consider that in a way each edge label is a one-hot binary vector. And we search using only binary methods. A consequence is anything outside of that very narrow path all data is missed in a search. A simple step could be to change that to anything within an X similarity to some target vector. Could you then search "(fixed term) is a love interest of b?" and have b? filled from facts like "(fixed term) is intimate with Y" and "(fixed term) has a date with Z"?

There are probably issues, I'm sure there are, but some blend of querying but with some fuzziness feels potentially useful.

Isn't this exactly what neo4j does for graphrag?
Is that vectors for edges or for searching the nodes? I’m talking about encoding the edges as vectors for traversal.
Brb updating my LinkedIn
> The correct modelling depends on the questions you want to answer.

Coincidentally, my main point in any conversation about UML I've ever had

Basically it's name spacing hell right?

To adapt the saying, an engineer is talking to another engineer about is system, saying he's having issues with names. So he's thinking of using name spaces.

Now he has two problems