Hacker News new | ask | show | jobs
by usgroup 2624 days ago
I’d contend this personally. You can employ disk, or row compression on PG if you want. Compressed disk will actually make your queries faster. You can use cstore for ORC based column storage with PG if you want.

Presumably the cost of a few TB on EBS is the least of your worries.

Finally, the time saving of full transactional support and constraints + sql to write etl in will drastically reduce the amount of work needed to write etl.

IMO, if RDBMS is an option for you, do it whilst your data is small enough.

2 comments

> sql to write etl in will drastically reduce the amount of work needed to write etl.

:)

My experience with writing an ETL in SQL is that it is almost never, quick, easy, correct or easy to test, and also almost always denormalized, or unconstrained (dimensonal keys which aren't 'real' foreign keys, just numbers so you can parallelize the data inserts and updates without constraint errors).

So... your milage may vary with that.

It's most certainly not true that writing any kind of ETL that uses SQL saves time in all cases.

Well SQL would present the ETL declaratively for one ... whether the output is denormalised or unconstrained has nothing to do with SQL.
In benchmarks I've seen CStore is about 50% slower than Parquet on Spark.

Where is the transactional requirement? This person is working with a copy of the real data.

ETLs only need to be written once and if he decided on a PSQL approach he'd be writing ETLs to send the data there too. He's probably going to find a number of consistency problems so trying to normalise all this data again will just result in more work that won't make his team of DS' more productive.

If he's at ~1 TB of data today, where will he be in a few years time? What's the point of putting infrastructure in place that won't last for the next 10+ years?

The RDBMS advantage is that you can update your records and you can append to them without having to rewrite the dataset. That makes ETL much easier. Eg recalculate a column. It’s also that referential constraints can make sure your database is coherent for you. This saves a lot of time and a lot of mistakes. You also get well thought through scheme management and other benefits besides. Pg11 will scale happily to 10x his requirement. I don’t see why you’d want to build infrastructure for the next 10 years on Spark... since Spark is unlikely to be the thing by then anyway.

I don’t know about cstore being slower at all at 100GB. Nor do I know that it matters for the use case. Spark runs like a dog on a single machine and requires far more resource to do so. PG also has options like pgstrom for gpu acceleration if speed is even s thing.

Also EtL is rarely written once ... it’s an ongoing body of work that changes as the data does.