| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yuanchuan 3772 days ago

I once worked on similar project. Each day, the amount of the data coming in is about 5TB.

If your data are event data, e.g. User activity, clicks, etc, these are non-volatile data which should preserve as-is and you want to enrich them later on for analysis.

You can store these flat files in S3 and use EMR (Hive, Spark) to process them and store it in Redshift. If your files are character delimited files, you can easily create a table definition with Hive/Spark and query it as if it is a RDBMS. You can process your files in EMR using spot instances and it can be as cheap as less than a dollar per hour.