| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fauigerzigerk 5084 days ago

I agree with most of what you say, but I don't think that MVCC is really what this is about. The qualities you describe are a feature of ACID. MVCC is just a way of implementating ACID so that it requires less locking.

More importantly, I think, there are issues with some data structures that are not well supported by postgres or any other DBMS (relational or otherwise). I do a lot of text analytics work and there are things I need to store about spans of text that I could model in a relational fashion but I don't because it would lead to 99% of my data being foreign keys and row metadata.

There will always be domains where you need highly specialized combinations of data structures and algorithms that are not efficient to model relationally and even less in terms of some of the other datamodels that you find in the NoSQL space.

That said, I found that even in natural language processing, RDBMS do a lot of things surprisingly more efficiently than conventional wisdom would have it. Storing lots of small files for instance, something that file systems are suprisingly bad at.

Sometimes I'm surprised how many people like to complain about premature optimization using languages that are hundereds of times slower than others but then go ahead and use horribly inflexible crap like the BigTable data model just in case they need to scale like Google.

Of course that's off topic because it's not remotely what Hickey proposes.

1 comments

saurik 5084 days ago

If you implement ACID using the "normal" locking semantics (such as the ones the SQL standard authors who defined the isolation levels used in the language were assuming) you can tell the difference because old values are not preserved.

Instead, we would have had contention: in the case of my example walkthrough, to implement the repeatable read semantics that I requested, the first connection would have taken a share lock on the rows it queried, causing the second connection to block on the update until after the first connection committed.

This means that you would not have been able to have the semantics where the first connect and the second connection were seeing different things at the same time (which, to be again clear, is due to none of the data being destroyed: MVCC is providing the semantics of a snapshot).

(As for your text analytics work, I am curious: are you using gin and trigrams at all? There are a bunch of things I dislike about PostgreSQL's implementation of both, but if you haven't looked at them yet you really should: if your use case fits into them they are amazing, and if not the entire point of PostgreSQL is to let you build your own data types and indexes using the ones it comes with as examples.)

fauigerzigerk 5084 days ago

I don't use gin or trigrams because I don't do much general purpose text search. I do things like named entity recognition, anaphora resolution, collecting statistics about the usage of terms over time, etc.

But you're right, it might be a good idea to look into the postgres extension mechanism. I've never seriously done that.