| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by matttah 2749 days ago

Agreed that storage is cheap, unless it's in a live cluster. Right now I keep 100% of the data in Redshift then use window functions to unload the latest X per id. Keeping it in real time I haven't tried but right now the unloading of the full dataset of last X per id takes multiple days on a 6 node cluster.

The analysis at end of month is simply give me 100% of the data set chunked up by ID. All analysis is done outside of the system.

Flat files is my thinking right now with S3 and prefixes per partition, I'm not sure on file format, since one thing is with each day's data being able to process and update existing data quickly. Current thought is to load current day's data to Redshift -> unload sorted by id -> process concurrently. With multiple prefixes on S3 I won't hit the rate limits. My main worry is if read in, loading/parsing each file will take too long to be scalable at 250-500 million unique id's per day. I wanted to check here before going down that route to see if anyone had a different recommendation.

1 comments

verdverm 2749 days ago

Some things I might try...

    1. Hadoop / HDFS / Spark on an ephemeral cluster with disk snapshots
    2. Group 1M ID's into a single file
    3. If analysis is once a month, save daily then prep data right before analysis.
    4. Consider using Cassandra database
    5. Rent a big machine where the data can fit into memory

link