| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saurik 5084 days ago

In your last paragraph, I feel like you are mischaracterizing my overall thesis. I am not claiming the design is simple: MVCC took many lives in sacrifice to its specification and discovery, and I certainly am not claiming "anyone could have thought that up". Instead, my primary issue is that this is a talk about databases and database design that is providing motivation vs a strawman: specifically, the way Rich seems to believe "traditional databases" work, and for which we spend the first almost 20 minutes learning the negatives, roadblocks, and general downsides.

However, almost none of the things that he indicates actually are downsides of most modern database systems, and certainly not of PostgreSQL. His downsides include that the data structuring is simplistic, that you can't have efficient and atomic replication of it (not multi-master mind you, but seemingly even doing real-time replication of a master to a read-only slave while maintaining serialization semantics seems to be dismissed), and that if you attempt to make multiple queries you will get inconsistent data due to update-in-place storage.

Yes: update-in-place "storage", not "update-in-place semantics within the scope of an individual transaction". Even if he was very clear about the latter (which is again quite different from "update-in-place semantics", which MVCC definitely does not have), that would still undermine his points, as the problem of inconsistent data from multiple reads, a problem he goes into great detail about with an example involving a request for a webpage that needs to make a query first for its backend data and then for its display information, does not exist with MVCC.

During this discussion of storage, he specifically talks about how existing database storage systems work, not at the model level, but at the disk level, discussing how b-trees and indexes are implemented with their destructive semantics... and all of these details are wrong, at least for PostgreSQL and Oracle, and I believe even for MySQL InnoDB (although a lot of its MVCC semantics are in-memory-only AFAIK, so I'm happily willing to believe that it actually destroys b-tree nodes on disk).

The talk then discusses a new way of storing data, and that new way of storing data happens to share the key property he calls new with the old way of storing data. The result is that it is very difficult to see why I should be listening to this talk, as the speaker either doesn't know much about existing database design or is purposely lying to me to make this new technology sound more interesting :(. Your response that in a different talk he attempted to backpatch his argument with something that still doesn't seem to address MVCC's detectably-not-the-same-as-update-in-place-semantics doesn't help this.

Now, as I stated up front, after listening to half of this talk, I couldn't take it anymore, and I gave up: I thereby didn't hear an entire half hour of him speaking. Maybe somewhere in that second half there is something new about how some particular quirk of his model allows you to get a distributed system, but that seemed sufficiently unlikely after the first half that it really doesn't seem worth it, and based on the comments from discussion (such as in the threads started by bsaul and sriram_malhar, which seems to indicate that writes are centralized and reads are distributed, something you can do with any off-the-shelf SQL solution these days) that seems to hold up.

2 comments

richhickey 5084 days ago

The model of consistency envisioned by Datomic is one in which consistency normally available only within a transaction is available outside of any transactions, and without any central authority. Consistent views can be reconstituted the next hour, day or week. Consistent points in time can be efficiently communicated to other processes. Nothing about MVCC gives you any of that. MVCC is an implementation detail that reduces coordination overhead in transactional systems. I used MVCC in the implementation of Clojure's STM. While you might imagine it being simple to flip a bit on an MVCC system and get point-in-time support, it is a) not efficient to do so, and b) still a coordinated transactional system.

The differences I am pointing out, and the notion of place I discuss, are not about the implementation details in the small (e.g. whether or not a db is MVCC or updates its btree nodes in place) but the model in the large. If you 'update' someone's email is the old email gone? Must you be inside a transaction to see something consistent? Is the system oriented around preserving information (the facts of events that have happened), or is the system oriented around maintaining a single logical value of a model?

The fact is with PostgreSQL et al, if you 'update' someone's email the old one is gone, and you can only get consistency within a transaction. It is a system oriented around maintaining a single logical value of a model. And there's nothing wrong with that - it's a great system with a lot of utility. But it isn't otherwise just because you say it could be.

Also, you seem to be reacting as if I (or someone) has claimed that Datomic is revolutionary. I have never made such claims. Nothing is truly novel, everything has been tried before, and we all stand on the shoulders of giants.

I'm sorry my talk didn't convey to you my principal points, and am happy to clarify.

link

saurik 5084 days ago

First of all, thank you very much for the reply: you really didn't need to bother, as despite being a Clojure user who stores a lot of data, I'm probably simply not in your target market segment ;P.

For the record, I do not believe that you have explicitly stated this is revolutionary, although I believe various other people on HN in various threads on Datomic have. However, my specific reactions in the comment you are responding to are due to DanWaterworth's insistence that I believe that it is trivial: my original comment does not touch on this angle, and is entirely about "real databases aren't implemented like this".

That said, I do believe that if after 30 minutes of listening to a talk that doesn't mention "this is largely how existing systems are implemented, but we provide the ability to see all the rows at once", there is an implication "this isn't at all like anything you've ever seen or implemented before", which is why after DanWaterworth's comment, I started exploring that angle.

Yes: in the case of PostgreSQL's MVCC, the old e-mail is gone from the perspective of the model for other people not inside of a transaction viewing the contents, however the kinds of problems you were describing at the beginning of the talk did not need to avoid transactions.

However, the implementation is so close that if I were explaining this concept to someone else, I'd probably use it as a model, especially given that it even already reifies the special columns required to let you do the historical lookups (xmin and xmax).

As I mentioned in another comment on this thread (albeit in an edit a few minutes later), you can get historical lookup in PostgreSQL by just adding a transaction variable that turns off the mechanism that filters obsolete tuples: you can then use the already-existing transaction identifier mechanism and the already-existing xmin and xmax columns as the ordering.

The result is then that I'm watching the talk wondering where the motivation is: many of the listed motivations weren't really true faults of the existing systems, and the ones that remain seem like implementation details of the database technology.

In the latter situation, when I say it "could be" I really do mean "it is": PostgreSQL can take advantage of the fact that it is built out of MVCC when it builds other parts of itself, such as its streaming master/slave replication (which is another feature of many existing systems that you seemed to discount in your motivation section).

I am thereby simply not certain what the problem is that Datomic is trying to solve for me, whether it be revolutionary or evolutionary (again: I don't really care; I'm just commenting on the motivation section), as the listed motivations seem to be fighting against a strawman design for a database solution that doesn't have transactions to get you 90% there and isn't itself implemented and taking advantage of append-only storage.

link

nickik 5084 days ago

Well, all you point out is that one aspect of datomic could be implmented with some SQL systems. Datomic however has many other aspacts that are intressting.

Other then that, the true genius is to recogniced that a system like that would be worthwhile. Just pointing out that one could theoreticly do that with something else is kind of pointless if nobody has ever done it.

link

saurik 5084 days ago

I am not saying "Datomic is stupid" or anything so simple; I'm saying I was "disappointed" in this talk because it motivated Datomic against a strawman that mischaracterized the actual problems that people using "traditional databases" have sufficiently that it was no longer possible to determine what was actually being claimed as an advantage.

I realize that to many people it is impossible to dislike a presentation of something without disliking the thing being presented, the person making the presentation, and the entire ideology behind the presentation, but that is a horrible thing to assume and is unlikely to ever be the case to such a simple extreme.

I will even go so far as to say that watching this talk seems to be doing a disservice to many people on the road to doing them a legitimate service: some of the people commenting on this thread (or previous ones on HN about similar talks and articles about Datomic) actually do/did not realize that "traditional databases" can even do this at a transaction level, as the argument in the talk downright claims they can't.

The result is that when I bring up that you actually get even some of these advantages with off-the-shelf copies of PostgreSQL, I get comments of the form "I had no idea one could get a consistent read view across multiple queries within a transaction using most sql databases. That does poke a hole in a major benefit that I thought was unique to datomic, great to know!"; that can only happen when there is some serious misinformation (accidentally) being presented.

Now, does that mean that Datomic is something no one should use, and that it doesn't put things together in a really nice way, and that it doesn't have a single thing in it that is innovative, or that Rich is wasting his time working on it? No: certainly it does not. I did not claim that. I can't even claim that, as I gave up on the talk after the first half so I could spend my time attempting to clarify some of the things said in the first half that were confusing people.

link

fauigerzigerk 5084 days ago

I agree with most of what you say, but I don't think that MVCC is really what this is about. The qualities you describe are a feature of ACID. MVCC is just a way of implementating ACID so that it requires less locking.

More importantly, I think, there are issues with some data structures that are not well supported by postgres or any other DBMS (relational or otherwise). I do a lot of text analytics work and there are things I need to store about spans of text that I could model in a relational fashion but I don't because it would lead to 99% of my data being foreign keys and row metadata.

There will always be domains where you need highly specialized combinations of data structures and algorithms that are not efficient to model relationally and even less in terms of some of the other datamodels that you find in the NoSQL space.

That said, I found that even in natural language processing, RDBMS do a lot of things surprisingly more efficiently than conventional wisdom would have it. Storing lots of small files for instance, something that file systems are suprisingly bad at.

Sometimes I'm surprised how many people like to complain about premature optimization using languages that are hundereds of times slower than others but then go ahead and use horribly inflexible crap like the BigTable data model just in case they need to scale like Google.

Of course that's off topic because it's not remotely what Hickey proposes.

link

saurik 5084 days ago

If you implement ACID using the "normal" locking semantics (such as the ones the SQL standard authors who defined the isolation levels used in the language were assuming) you can tell the difference because old values are not preserved.

Instead, we would have had contention: in the case of my example walkthrough, to implement the repeatable read semantics that I requested, the first connection would have taken a share lock on the rows it queried, causing the second connection to block on the update until after the first connection committed.

This means that you would not have been able to have the semantics where the first connect and the second connection were seeing different things at the same time (which, to be again clear, is due to none of the data being destroyed: MVCC is providing the semantics of a snapshot).

(As for your text analytics work, I am curious: are you using gin and trigrams at all? There are a bunch of things I dislike about PostgreSQL's implementation of both, but if you haven't looked at them yet you really should: if your use case fits into them they are amazing, and if not the entire point of PostgreSQL is to let you build your own data types and indexes using the ones it comes with as examples.)

link

fauigerzigerk 5084 days ago

I don't use gin or trigrams because I don't do much general purpose text search. I do things like named entity recognition, anaphora resolution, collecting statistics about the usage of terms over time, etc.

But you're right, it might be a good idea to look into the postgres extension mechanism. I've never seriously done that.

link