Hacker News new | ask | show | jobs
by amypinka 2624 days ago
In benchmarks I've seen CStore is about 50% slower than Parquet on Spark.

Where is the transactional requirement? This person is working with a copy of the real data.

ETLs only need to be written once and if he decided on a PSQL approach he'd be writing ETLs to send the data there too. He's probably going to find a number of consistency problems so trying to normalise all this data again will just result in more work that won't make his team of DS' more productive.

If he's at ~1 TB of data today, where will he be in a few years time? What's the point of putting infrastructure in place that won't last for the next 10+ years?

1 comments

The RDBMS advantage is that you can update your records and you can append to them without having to rewrite the dataset. That makes ETL much easier. Eg recalculate a column. It’s also that referential constraints can make sure your database is coherent for you. This saves a lot of time and a lot of mistakes. You also get well thought through scheme management and other benefits besides. Pg11 will scale happily to 10x his requirement. I don’t see why you’d want to build infrastructure for the next 10 years on Spark... since Spark is unlikely to be the thing by then anyway.

I don’t know about cstore being slower at all at 100GB. Nor do I know that it matters for the use case. Spark runs like a dog on a single machine and requires far more resource to do so. PG also has options like pgstrom for gpu acceleration if speed is even s thing.

Also EtL is rarely written once ... it’s an ongoing body of work that changes as the data does.