Hacker News new | ask | show | jobs
by BWStearns 4281 days ago
36-100MB/person per day ~250 days/year expecting ~20,000 (an educated stupid wild ass guess) initially when the system is actually put into production. ~100-400TB per year(?). Most of the data would only be of interest for a month or so, but we do want to preserve the data in general in some usable fashion for testing and some research stuff.
3 comments

In this case, I would still recommend Cassandra. It can easily handler the data sizes you mention as well as the write rates you imply further down the thread.

Cassandra has a nice and simple architecture (every node is identical, no zookeeper roles etc), high write performance and scalability [1], and is fairly robust. My main piece of advice is to get the tables correctly set up. You need to know exactly what queries you want to make and design a table around that query (Cassandra only allows performant queries to be made, unless you go out of your way to set a flag). Whether a query is possible or performant depends on the key of the rows for the table, which may be a composite key. Take a look at the cassandra documentation for more details.

1. http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

Thanks a ton. I am leaning towards a solution that involves Cassandra. What would you say about using something on top of it like Blueflood?
I havent used Blueflood, so I couldnt say but it looks like an interesting project.
You might look into partitioning. Oracle and SQL Server both support that type of operation. Additionally, being able to find support when things get "too big to handle" can be easier on a mature technology with lots of users.

On a side note, you can hook a Hadoop cluster up to SQL Server if you're into that kind of thing for storage.

When it comes to time series, reasoning in terms of byte size does not really make sense, it's better to state how many datapoints you need to handle and in how many distinct time series they are distributed.
8-16ish datapoints per sample and they'll be distributed more or less evenly during the day and then pretty much go dead at night. There may or may not be a value for every data point at every sample.
There's good news and bad news. Good news is storing this much data isn't hard; there's plenty of people who've done it and many systems will scale enough.

Bad news is picking a system means understanding access patterns -- reading, not writing. Do you only need to look within a single user? That's much easier. If you have to query across users, or do things like (and I have no idea what your problem domain is, but if it's utility usage, things like average usage by zip or block; if it's wearables, activity by city, etc), stuff gets much harder. How granular do you need to be able to query, and how far back? What is the sla on a query: are results calculated in batch mode or on demand for a website? You often have to duplicate data in order to optimize one set for throughput access and the other set for minimal random query time. Can you get away with logarithmic granularity for queries, ie every sample is available for 1 month, every 3rd for the next month, every 10th for a couple months after that, etc. What windowing functions do you need to run, and how frequently do they need to be updated? What is the ratio of writes to reads? If you have to access random data quickly, eg for a site, can you calculate > 1 day back in batch mode, cache those results, and add the last 24h of data at runtime? etc etc etc.

You need to have some conversations with the data consumers.

Edit: and I've assumed these data are read-only; if you can update them, then there's far more difficulty.

There should be no updates but there is a possibility that records can be added out of order. I've seen that this is a problem for some systems and not for others.
My guess would be you would want Cassandra, specifically to incur less overhead for empty values. I haven't built finance backtesting/monitoring infrastructure - which sounds exactly like what you're building - but in this case, I think you'll get real value from triggers, even if that's only being supported experimentally right now.
What will the sampling frequency be? How many samples per sampling interval?