Hacker News new | ask | show | jobs
by jiggawatts 4 days ago
I recently encountered a query that deadlocked on itself because it used a parallel execution plan and updated multiple indexes in a manner that the different threads could conflict with each other.

Naively one would expect that no individual UPDATE or INSERT statement could deadlock in isolation… but there you go.

If that is possible, the possibilities across multiple concurrent data mutating queries are beyond human comprehension!

Serializable should absolutely be the default!

Similarly, all columns should be automatically indexed to at least some degree, like Postgres BRIN indexes at a minimum.

Time and experience have shown that the vast majority of developers are pathologically unable to properly define all required indexes ahead of time.

4 comments

I've definitely experienced the opposite, where indexes run rampant and are NN times the size of data (for the entire database!)

That being said also "required indexes ahead of time" is impossible - because query patterns change and things get released unless you mean something like "the required indexes for our obvious query patterns we just freaking released.

I dont think most columns need to be indexed because that is mostly nonsense, most columns never have a where clause applied to them or are sorted on themselves, so an index provides ??? value.

You are supposed to define indexes based on how you query the data, not ahead of time.
"Supposed to" is doing a lot of heavy lifting here.

As a consultant I come across a lot of CotS software, in-house or otherwise bespoke software, etc. Roughly 40% of the former has 100% of the minimum required number of indexes and approximately 5% of the latter. By "minimum", I mean the indexes required to avoid full scans of tables that will become large enough for this to be a problem in production.

"Disciple doesn't scale." is one of my favourite sayings now, for a reason!

1) Developers almost always work with toy data, and are hence insulated from poor indexing decisions. Problems turn up 'x' years from now. It is well established that humans learn poorly when consequences are delayed... by mere hours, let alone years!

2) DBAs and developers often have an adversarial relationship. A common consequence of this workplace dynamic is that developers aren't granted the required access to tune indexes, especially in production, which is where the issues manifest.

3) I've heard anecdotes, including here, along the lines of "XYZ cloud native / webscale database is so much faster than ABC traditional RDBMS!". Very often the difference is just that XYZ auto-indexes by default. CosmosDB, Google Firestore, Kusto, Elastic, Druid, and many columnar formats are in this category of "magically" faster!

I'm now 99% convinced that RDBMS needs to be reinvented for the modern fast-paced, vibe-coded, "I'm a fullstack(lol) dev" world where people simply don't have the bandwidth to pay attention to minutiae like on-disk sort order and filtered secondary indexes. A better fit for today's world would be a system that is: columnar by default like SAP HANA, compressed[1] by default, indexed by default (thanks to being columnar!), serializable by default, and "include batteries" like native queue capabilities so that nobody has to figure out cross-RDBMS complications like distributed transactions, outbox patterns, or deal with the consequences of a DBA rolling back one of two databases to a backup.

Semi related, friend works here https://www.nextdata.com/product (looks like they rebranded for AI heavily...)

AI buzz aside, unified multimodal data platform is an interesting idea.

Amazing how many technologies get adopted not because they're the best, but because they have sane defaults, verbose error messages, and an integration tutorial.
Something I neglected to mention is that you can make a valid argument for "indexing all columns by default is bad", but you'd have a much harder time explaining why no typical RDBMS engine indexes columns participating in foreign keys! There are very few scenarios where you won't need an index on a key, primary or foreign!
> compressed[1]

You missed adding the reference to [1], I'm curious now.

I was going to elaborate and say that even though typical columnar databases are already compressed with some variant of dictionary lookup compression, I'd like to see a database engine where large objects (bulk text or binary data) is stored efficiently by default. If I were to wave my hands about, I'd say something like a Merkle or Prolly tree of large ~256KB chunks stored in deduplicated external blob storage, where the individual chunks are compressed with a modern throughput-optimised algorithm.
> I recently encountered a query that deadlocked on itself because it used a parallel execution plan and updated multiple indexes in a manner that the different threads could conflict with each other.

MS SQL?

We've had to set MAXDOP 1 on some specific queries for a long time.

Worth noting that Oracle has an auto-indexing feature that builds and drops indexes based on observed query patterns. DBMS_AUTO_INDEX.

Disclosure: I work part time in the DB group.