| I don't know your ratio of HDF5 to Parquet files but remember for every GB of parquet you have it will equate to about 10 GB of space needed in CSV or PostgreSQL's internal format. So your data set is probably closer to 1 TB than 100 GB. Storing that data on S3 is probably 50% the price of storing it on EBS and you won't have the durability guarantees of S3 when you're using PostgreSQL on EBS volumes. If you're both exploring data and building models then Spark is fine. Its APIs are no more complicated that anything else out there for these tasks. Hive is doing nothing more than offering schema on read and shouldn't be something you're thinking much about. PostgreSQL is row-oriented and won't be able to offer features like row-group statistics that allow queries to get minimum and maximum values for every 10-15K rows of data for the columns their interested in. This gives queries a huge speed up over needing to scan over rows rather than just the statistics for the columns their interested in. Remember that you can have a single engineer run a single query on Spark and distribute it across several servers. This allows you to scale CPU and memory bandwidth in a way you won't be able to with PostgreSQL. It sounds like your data isn't well organised. If you moved it around and put some consistent naming conventions in place that could help. You could also look to build an atlas of the data for newcomers to get an overall picture of what data you're storing and where it lives. |
It is perfectly reasonable to store this in a database. If and when you change your mind about the data format you can just scrap it and start over.