Hacker News new | ask | show | jobs
by theptip 2024 days ago
This is a bit dumbed down, and ignores the domain terminology required to properly discuss the trade-offs here (which is puzzling given that it links to a post by Aphyr, where you can find incredibly thorough discussions around isolation levels and anomalies).

> The fundamental problem with using Kafka as your primary data store is it provides no isolation.

This is false. I can only assume the author doesn't know about the Kafka transactions feature?

To be specific, Kafka's transaction machinery offers read-committed isolation, and you get read-uncommitted by default if you don't opt-in to use that transaction machinery (the docs: https://kafka.apache.org/0110/javadoc/index.html?org/apache/...). Depending on your workload, read-committed might be sufficient for correctness, in which case you can absolutely use Kafka as your database.

Of course, proving that your application is sound with just read-committed isolation is can be challenging, not to mention testing that your application continues to be sound as new features are added.

Because of that, in general I think that the underlying point of this article is probably correct, in that you probably shouldn't use Kafka as your database -- but for certain applications / use-cases it's a completely valid system design choice.

More generally this is an area that many applications get wrong by using the wrong isolation levels, because most frameworks encourage incorrect implementations by their unsafe defaults; e.g. see the classic "Feral concurrency control" paper http://www.bailis.org/papers/feral-sigmod2015.pdf. So I think the general message of "don't use Kafka as your DB unless you know enough about consistency to convince yourself that read-committed isolation is and will always be sufficient for your usecase" would be more appropriate (though it's certainly a less snappy title).

1 comments

"Read-committed isolation" is not a meaningful implementation of transactions. If you can't do read, then a write, while guaranteeing the database didn't change in between, then you don't really have transactions.
Depends on your use-case; if it's meaningless, why is it implemented in all the leading SQL DBs? It's the default in Postgres...

https://www.postgresql.org/docs/9.5/transaction-iso.html

If you're arguing that in practice this isn't enough isolation, then sure, that's what I said in my post; most applications need more than the default isolation levels. I feel like you're making an absolutist point (just like the original article) where my point was that the domain is actually more nuanced, and absolutes just obscure the technical complexity.

This sounds like "serializable" which is (in my experience) rarely useful for a meaningful system.
If you read the "Feral concurrency control" paper I linked above, particularly section 7 on conclusions, they make the case that serializable is the only isolation level that's actually safe with naive coding styles on frameworks like Rails and Django which do application-level validation. If you do validation in your application and don't use serializable isolation, then you have to be careful about manually locking, OR just be sure that your usage pattern isn't vulnerable to the anomalies that you're introducing by using a weaker isolation level.

If you're building a financial ledger, you absolutely must use serializable isolation. If you're building a Twitter clone, sure, use something weaker that will gain you some performance.

I'd make the case that we should be recommending the use of serializable by default unless you have a reason why you think it's OK to use something weaker, rather than having the default be better-performing-but-unsafe. The sort of concurrency validation errors that you get if you needed Serializable and used Read-Committed instead are really, really hard to reproduce, debug, and diagnose.

It's not strictly speaking serializable, what they described is a lost update anomaly, to avoid which repeatable read or snapshot isolation is sufficient.

Serializable (or serializable snapshot isolation) is stronger, it doesn't allow anomalies such as write skew, but also a lot more expensive since you need to keep track of changes in rows matched by predicates to avoid phantom reads (as compared to just keeping track of the rows returned by the query with predicates in this particular transaction).

Also worth noting that some DBs such as Oracle actually lie about implementing serializable (in this case they only offer snapshot isolation), so it's worth keeping that in mind and use locks if necessary.