| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by x0x0 4284 days ago

There's good news and bad news. Good news is storing this much data isn't hard; there's plenty of people who've done it and many systems will scale enough.

Bad news is picking a system means understanding access patterns -- reading, not writing. Do you only need to look within a single user? That's much easier. If you have to query across users, or do things like (and I have no idea what your problem domain is, but if it's utility usage, things like average usage by zip or block; if it's wearables, activity by city, etc), stuff gets much harder. How granular do you need to be able to query, and how far back? What is the sla on a query: are results calculated in batch mode or on demand for a website? You often have to duplicate data in order to optimize one set for throughput access and the other set for minimal random query time. Can you get away with logarithmic granularity for queries, ie every sample is available for 1 month, every 3rd for the next month, every 10th for a couple months after that, etc. What windowing functions do you need to run, and how frequently do they need to be updated? What is the ratio of writes to reads? If you have to access random data quickly, eg for a site, can you calculate > 1 day back in batch mode, cache those results, and add the last 24h of data at runtime? etc etc etc.

You need to have some conversations with the data consumers.

Edit: and I've assumed these data are read-only; if you can update them, then there's far more difficulty.

1 comments

BWStearns 4284 days ago

There should be no updates but there is a possibility that records can be added out of order. I've seen that this is a problem for some systems and not for others.

link