Hacker News new | ask | show | jobs
by LukeEF 1117 days ago
I am not sure of the exact statistic, but something like 95% of all production databases are less than 10GB. There seems to be a 'FAANG hacker' fascination with 'extreme-scale' which probably comes from seeing the challenges faced by the handful of organizations working at that level. Much of the time most graph database users want (as in why are they there) a DB that allows you to flexibly model your data and run complex queries. They probably also want some sort of interoperability. If you can do that well for 10GB, that is holy grail enough. We certainly found that developing graph database TerminusDB [1] - most users have smaller production DBs, more lightly use bells and whistles features, and really want things like easy schema evolution.

[1] https://github.com/terminusdb/terminusdb

11 comments

This research paper is talking about performance whilst you're talking about scalability.

Those are related but are distinct from each other.

And sure about 95% of companies would have their needs met with a simpler system but that does leave a lot of companies who will not. And for those of us in say finance doing customer/fraud analytics I would welcome all the performance I can get.

> This research paper is talking about performance whilst you're talking about scalability. Those are related but are distinct from each other.

The paper has "Scale to Hundreds of Thousands of Cores" in the title. I have not yet read the paper but it seems unlikely it doesn't talk about scalability.

I was referring to scalability in the sense of the size of the data being stored.

You can have slow queries with 10GB of data just like you can have fast queries with 10PB of data.

If your data is small enough to easily fit in ram, you kind of can't have that slow a query on it (or at least you no longer are talking about a database problem).
If you end up having to scan the 10 GB graph many times per query without acceleration structures helping you (like indices), it will be slow. I'd say it's still a DB problem.
I'm guessing that, when the paper's author mentioned "hundreds of thousands of cores", they didn't have 10GB of data in mind. That works out less than a typical L1 cache's worth of data per core.
> I have not yet read the paper

This is really common across article-comment platforms; is anyone interested in discussing how to incentivise comment sections that have read the paper?

This isn't a graph database like neo4j. This is a graph database like I hoped neo4j would be. It's not about having an easier time working with schemas. It's about analyzing graphs that are too big to fit in RAM. Transaction analysis for banks, trafic analysis of roads, failure resilience of utility networks, etc.

In these kinds of workloads you quickly run into performance bottlenecks. Even in-memory analyses need care to avoid conplete pointer chasing slowdowns.

I do still hope this is fast in like a single CPU 32 core 64GB system with an SSD. But if this takes a cluster to be useful, then I will still love it.

But the 5% of places where that kind of scale is needed are the ones paying the top 1% salary band, so this is the content distributed systems engineers like to read about and work on.
>There seems to be a 'FAANG hacker' fascination

Yeah, but the hacker fascination is what drives progress. You could have made the same type of argument about ML, and we would have been content with MNIST.

I think I kind of agree with this.

One of the simpler supported backends for our Modality product (https://auxon.io/products/modality), which results in a data model that’s a special case of a DAG for modeling big piles of casually correlated events from piles and piles of distributed components for “system of systems” use cases, is built using SQLite, and the scaling limiter is almost always how efficiently the traces & telemetry can be exfiltrated from the systems under test/observation before how fast the ingest path can actually record things becomes a problem.

That said, I do love me some RDMA action. 10 years ago I was fiddling with getting Erlang clustering working via RDMA on a little 5 node Infiniband cluster. To mixed results.

Interesting that you mention the value 10 GB, as it is the size of a DynamoDB partition or an AWS Aurora cell...
I agree with your sentiment but I suppose you're considering the wrong statistics. Instead you should consider: - how many jobs have interviews that necessitate knowing how to handle extreme scale

- proportion of jobs (not companies) requiring extreme scale - the fact that non extreme scales are the long tail doesn't mean it's a fat tail

- proportion of buyers/potential users that walk away from the inability to handle extreme scale

... and more sarcastically

- proportion of articles about extreme scale

- proportion of repos about extreme scale

Only one anecdote, but I found out a while after starting at my current job that directly questioning the extent to which scale-out was actually needed to solve a problem during a technical interview question is the thing that made me stand out from the rest of the crowd, and landed me the job. Being able to constructively challenge assumptions is an incredibly valuable job skill, and good managers know that.
Counter-anecdote: directly questioning "scale-out fantasies" has contributed to my early departure from a handful of jobs and contracts. One place was obsessed with getting everything into AWS auto-scaling groups when the problem was actually that they were running on MySQL with a godawful schema, dumbass session management, and horrific queries that we weren't allowed to fix because they were "migrating to node microservices anyway" (pretty sure that still hasn't happened years later.)

> Being able to constructively challenge assumptions is an incredibly valuable job skill

I would agree but ...

> good managers

... are few and far between.

The best people challenge bad assumption and worst bosses get mad.

Had one boss get mad that I reduced the database footprint by 94% - why? Because he wrote the initial implementation and refused to believe that his baby, which cost so much space because of how awesome it was, could fit into 5GB.

But challenging the status quo has gotten me to where I am, so I wont stop it anytime soon :)

I get that angle but I also see orgs capturing too much data. What's the use case for it? Not sure but if we ever do need it we'll have it is the typical answer.
really? I don't quite believe that. We're a tiny company with maybe 70 customers and db is roughly 11Tb.
Assuming 1kb per "record" that's 150 million records per customer.

Definitely a data heavy product, wherever it is that you're offering.

(Unless you keep large blobs in the DB. But database scale has more to do with records than raw storage.)

That seems a lot. What type of data?
Congratulations, you're in the 5%, along with us :)
Just have a look at the size of all of English Wikipedia. Or all of StackOverflow.

And these are seemingly huge services.

And yet…

What are the databases with easy schema evolution?