| HN Mirror

I'd personally say that one of the key things that people have to realize when going into the question "what can I accomplish with these tools" is that you have to understand how the database technology works to get reasonable performance out of it.

So, if you are using PostgreSQL, you really need to know a little about journalling, B+-trees, multi-version concurrency (and the associated vacuum), their "heap-only tuples" update strategy, and settings like synchronous_commit.

Likewise, if you are using Cassandra, you really need to know a little about LSM-trees, inverted indices, eventual consistency, the purpose behind column-family storage, and how their read-to-write ratio affects all of consistency, durability, and performance.

Outside of the specific database technology, if you want your data to actually be there when the shit hits the fan, you absolutely have to understand write-back vs. write-through cache behavior and how and where to apply cache barriers (and what tools work correctly for them).

Finally, your specific hardware is going to drastically affect how much performance you get for your specific application, as random/sequential read/write I/O performance is going to drastically differ between technologies and how that matches up with the read/write ratio and locality of your application.

If you don't spend the time to learn these things, you are seriously just going to get burned. We (as a civilization) simply do not have the science and theory yet to make the practicalities of setting up and maintaining a database server a totally seamless and simple process with well-understood performance characteristics unless you constrain absolutely every single variable.

Unfortunately, learning a lot about how these algorithms work is really hard. I mean, PostgreSQL HOT is almost undocumented at the user-level: you have to drop to README.HOT from the source distribution to see what the performance characteristics are.

Meanwhile, there is a ton of misinformation out there. :( I haven't watched much of this video at all, but within 10 seconds of skipping around in it I saw that this person claims that relational databases force re-writes during schema updates, which is not true for the majority of updates that developers actually try to make.

Finally, half of this stuff is really only understood well at the academic level: to really get an understanding of why your particular load causes PostgreSQL to slow down you might end up reading academic papers from the 80s and 90s on the core algorithms we use to do data storage and indexing.

In the end, though, I'll say that most of my database needs are being served right now by a single server for an application that is getting a million users every day (with ten to fifteen million or so users total) with a ton of room to grow.

That said, two core things that I value in my logging are currently being stored to S3 directly (of all places, which is a ludicrous thing to use a database really), and while I am pretty certain they will work great on the new database server, I'm not entirely positive.

(For the record, my architecture is a m4.4xlarge running PostgreSQL with three EBS disks, one with ext2 for the WAL, two in md RAID0 with xfs for the data, where a couple heavily updates tables are set with a 50% fillfactor, and all non-durable writes are done without synchronous commit. I have a two-level external pool, with pgbouncer running on both the application servers and on the database server.)

(edit: I said RAID1 when I wanted RAID0. RAID0 increase your random I/O performance on EBS, which is useful for most transactional database server loads; given snapshotting and the inherent durability of EBS, RAID1 is not necessary and will just saturate your network I/O)