Hacker News new | ask | show | jobs
by LAC-Tech 1255 days ago
What are people using graph databases for, and what do your queries look like?

I've read about them briefly but I have to admit my imagination fails me as to how it would look in the real world.

8 comments

I use Neo4j to create a CMDB that pulls in data from Active Directory, File Shares, Cloudstrike API, Okta API, Windows Services, Processes, and TCP ports, VCenter, Cisco CDP , ARP tables, Routing Tables, and MAC address tables from routers and switches. Powershell get-foo commands combined with the ConvertTo-JSON makes it very easy to import data from Windows.

A possible query would be match (host:ESXihost)-[:running]->(vm:WindowsVM)-[:running]->(:Process {name:$processName}) return vm,host

I feel graph databases work very well to document the myriad dependencies in a enterprise IT stack and to integrate siloed data.

That's an interesting idea. Having done CMDB stuff in a previous life and also used Neo4J in my last job, I appreciate that one. I don't know whether you'd gain much vs using Postgres with JSON fields, but I bet the ergonomics are better, and if you do need a big relationally recursive query then it'd work well.
Graph databases are very cheap to traverse relationships between things, but slower to do per-item-type operations. So finding your friends of friends of friends is cheap, but finding the mean age of everyone in the database is slow.
They are useful specifically in the intelligence field like NSA(no wonder they have so much graph stuff opensourced). Let me share one obvious use case you have data on a lot of people like call data records, Facebook friends list, Twitter followers/following list and potentially a lot of other data as well. Now you have two targets person A and person B with graph databases it is a trivial one liner to find how these two people are linked. They can be linked directly or they could have 5 people between them doing the same in SQL recursive CTE is a major PIA and takes a lot of time(see degrees of kevin bacon using graph database). There are very niche companies that are making big bucks by just selling libraries/softwares just to plot these graphs and most of their customers are government agencies with a lot of funds.
I don't think recursive CTEs are that bad
Have you tried Cypher?
In a word: Facebook.

A more technical use case that I liked was a system that can analyse the configuration of resources across and entire network and find a "path" from a normal user account to a full admin privilege.

Something like: "Helpdesk user A can reset the password of a service account that can write to a file share that contains a script that is run on logon by every user including the full admin, allowing user A to trigger an action in the context of an admin B, making them equivalent to an admin."

You map out "things" on the network like file shares, security groups, accounts, etc... with links between them, and then ask for the shortest path from A to B.

> In a word: Facebook.

Which, funnily enough, uses a relational database.

I'm betting Facebook uses a lot of different types of databases.
No doubt, but the core product known for being graph-y is based on MySQL.

Indeed, there is a graph data store (TAO) built on top of that base, but as we're talking about databases...

Many graph databases are relational "under the hood". The graph part is often just a specialised index.
Just as Facebook uses MyRocksDB (a KV store) underneath MySQL. There is a definite turtles all the way down.

But where do you draw the line? Is your Ruby on Rails CRUD app that exchanges JSON documents a document database? Fundamentally, what's the difference between said Rails app and TAO, aside from one being centred around documents and the other graphs?

Surely "base" is meant to be more specific?

I'll give you an example of a graph database use case.

The police have a ton of data lying around, and the consensus in the industry is that the 80/20 rule applies to criminals as well ie: 20% of the population takes up 80% of the police resources. You could probably also posit that 20% of that 20% are "peak criminals."

Anyway, they would like to track interactions of "things."

Say a car is involved in an incident. They normally track the make, model, plate, and color of the car - on paper. There's a lot of other info they they track: who owns that car? Who's in the car? Where is the car? Where does the owner live? Where do the occupants live? What other incidents has that car been involved in? Given the addresses of the people involved, who else is known to be around them?

All this relationship information can give someone a better understanding of the relationship between criminal elements in an area. If a car is being used in lots of crimes, it's easier to find out using a DB than some cop going "I recognize that car." If lots of people are being picked up and all live in a 2 block area, it'll be easier to see that if it's in a DB than a cop recognizing that fact from multiple incident reports.

I actually tried doing this in SQL, and it's super slow because you have to iterate over your tables over and over. With the graph database this becomes, well, substantially easier if you model it correctly.

This product, BTW, is known as CopLink by IBM.

As an aside, fusion centers have this problem too but worse, because they're supposed to coordinate information between different police departments in a region...all of whom don't particularly give a shit.

I’ve blogged about a couple of examples:

Understanding the spider web that is AWS IAM permissions: https://eng.lyft.com/iam-whatever-you-say-iam-febce59d1e3b,

Calculating whether a vuln was introduced from a parent image or from the service itself in a microservice arch: https://eng.lyft.com/vulnerability-management-at-lyft-enforc...

Fraud detection. Detecting and analyzing anomalous flows of financial transactions requires you to look at multi-hop series of transactions.
They are a joy to use for ontologies or certain types of metadata.

At my last job we had a bunch of entity categories, and each of those had a huge number of individual entity types. When an entity was picked up through the data pipeline we'd query the graph db and convert that entity into whatever the "base" entity is for the category.

It also allowed us to easily query for strange connections or one off transformations that our customers frequently had without worrying about having a more rigorous and structured RDBMS schema for relatively uncommon queries.

Finally it made using algorithms like PageRank in our data science pipeline an absolute breeze.

I loved it, but we never used it as our primary database (postgres & athena in this case)