Hacker News new | ask | show | jobs
by teddyknox 3343 days ago
When I think exabyte scale queries on a columnar datastore I think aggregations, but then I have this question: Why do we need to do exabyte scale queries in the first place? Wouldn't statistical inference via random sampling be faster and accurate enough?

(Granted, often times aggregations are happening after some filtering, at which point the relation being aggregated might be considerably smaller than exabyte scale.)

4 comments

Redshift is designed to fill the classic accounting datawarehouse role in an organisation. Whilst I'm sure there aren't too many companies with account ledgers that large (or any), I doubt too many accountants would be happy with statistical inference of their books... ;)

This new model of processing directly on S3 is pretty much aimed specifically at eliminating the "Load" part of the ETL process. Just dump to csv from whatever sources you originally had, and don't worry about the schema conversion/loading into a DB. The fact that it happens to scale to exabytes is just good marketing fluff.

Yeah, I think filtering is a big part of it. If you want to answer a statistical question about the entire dataset, then a random sample is probably good enough. If you want to drill down and do an analysis that only looks at a particular narrow slice of the data, then it's likely that the corresponding subset of your sample isn't big enough to be meaningful.

(You can pre-filter or pre-aggregate before sampling, but that assumes you know a priori what types of queries you'll want to do.)

it really depends on what you are doing. A large data set shouldn't be limited to longitudinal analysis. If you're storing every log record or every stock bid/ask, there may be times that you need to understand the specifics of what exactly was going on. There may be a lot of filtering on the underlying corpus for these sorts of exact match queries, but data set sizes continue to grow.
that said, I agree that approximate functions should be part of a modern database system. Redshift has approximate count distinct (based on hyperloglog) and approximate percentiles (based on quantile summaries)
See BlinkDb