Hacker News new | ask | show | jobs
by amypinka 2624 days ago
I don't know your ratio of HDF5 to Parquet files but remember for every GB of parquet you have it will equate to about 10 GB of space needed in CSV or PostgreSQL's internal format. So your data set is probably closer to 1 TB than 100 GB.

Storing that data on S3 is probably 50% the price of storing it on EBS and you won't have the durability guarantees of S3 when you're using PostgreSQL on EBS volumes.

If you're both exploring data and building models then Spark is fine. Its APIs are no more complicated that anything else out there for these tasks.

Hive is doing nothing more than offering schema on read and shouldn't be something you're thinking much about.

PostgreSQL is row-oriented and won't be able to offer features like row-group statistics that allow queries to get minimum and maximum values for every 10-15K rows of data for the columns their interested in. This gives queries a huge speed up over needing to scan over rows rather than just the statistics for the columns their interested in.

Remember that you can have a single engineer run a single query on Spark and distribute it across several servers. This allows you to scale CPU and memory bandwidth in a way you won't be able to with PostgreSQL.

It sounds like your data isn't well organised. If you moved it around and put some consistent naming conventions in place that could help. You could also look to build an atlas of the data for newcomers to get an overall picture of what data you're storing and where it lives.

2 comments

None of that matters. It's a hundred gigs. You can store it in a textfile and read it in its entirety if you want. It fits in RAM.

It is perfectly reasonable to store this in a database. If and when you change your mind about the data format you can just scrap it and start over.

It's 100GB compressed. Parquet does a very good job of compressing most data so that's where the estimate of 10x (so 1TB) uncompressed was mentioned as a rule of thumb.

Parquet also supports much better access mechanisms, like being able to deserialize a single column without having to read in entire rows.

But like you mentioned, 1TB of data in a traditional database isn't that bad.

... also remembering that a traditional dB will typically not store data raw. Row compression is normal and disk compression is normal . The typical column store advantage is block compression, predicate pushdown and column order storage.
Regular databases such as SQL Server and Oracle have had columnar compression built in as an option along with the row stores for years now. I use it in SQL Server a lot and it works great.
you can run sql DB over compressed filesystem, and some DBs allow you to compress tables too

> like being able to deserialize a single column without having to read in entire rows.

and it reads filesystem's whole page anyway

Sorry for the late reply, but parquet is a columnar format so if it's big enough data, you should have multiple pages/blocks of data in a single column for a specific row group, and then be able to seek to the next row group and sequentially read the next set of blocks.
I’d contend this personally. You can employ disk, or row compression on PG if you want. Compressed disk will actually make your queries faster. You can use cstore for ORC based column storage with PG if you want.

Presumably the cost of a few TB on EBS is the least of your worries.

Finally, the time saving of full transactional support and constraints + sql to write etl in will drastically reduce the amount of work needed to write etl.

IMO, if RDBMS is an option for you, do it whilst your data is small enough.

> sql to write etl in will drastically reduce the amount of work needed to write etl.

:)

My experience with writing an ETL in SQL is that it is almost never, quick, easy, correct or easy to test, and also almost always denormalized, or unconstrained (dimensonal keys which aren't 'real' foreign keys, just numbers so you can parallelize the data inserts and updates without constraint errors).

So... your milage may vary with that.

It's most certainly not true that writing any kind of ETL that uses SQL saves time in all cases.

Well SQL would present the ETL declaratively for one ... whether the output is denormalised or unconstrained has nothing to do with SQL.
In benchmarks I've seen CStore is about 50% slower than Parquet on Spark.

Where is the transactional requirement? This person is working with a copy of the real data.

ETLs only need to be written once and if he decided on a PSQL approach he'd be writing ETLs to send the data there too. He's probably going to find a number of consistency problems so trying to normalise all this data again will just result in more work that won't make his team of DS' more productive.

If he's at ~1 TB of data today, where will he be in a few years time? What's the point of putting infrastructure in place that won't last for the next 10+ years?

The RDBMS advantage is that you can update your records and you can append to them without having to rewrite the dataset. That makes ETL much easier. Eg recalculate a column. It’s also that referential constraints can make sure your database is coherent for you. This saves a lot of time and a lot of mistakes. You also get well thought through scheme management and other benefits besides. Pg11 will scale happily to 10x his requirement. I don’t see why you’d want to build infrastructure for the next 10 years on Spark... since Spark is unlikely to be the thing by then anyway.

I don’t know about cstore being slower at all at 100GB. Nor do I know that it matters for the use case. Spark runs like a dog on a single machine and requires far more resource to do so. PG also has options like pgstrom for gpu acceleration if speed is even s thing.

Also EtL is rarely written once ... it’s an ongoing body of work that changes as the data does.