Hacker News new | ask | show | jobs
by BoiledCabbage 1774 days ago
> The problem is that performance and normalization do not go well together.

> Let's say you have billions of rows of event data you with to perform summary counts for by a few different key columns.

Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

Such a terrible tradition in our industry in focusing on outlier cases, or on what FB/Google/... might need when making technical decisions.

2 comments

>Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

having built a real time analytics solution a few years ago because we wanted one we controlled for our startup, it doesn't take much to get to billions of rows of event data in the modern web.

At last $dayjob we were a speck compared to even the smallest "webscale" consumer B2C app and we gathered 800mil tracing events a week.
Yup. How many variations of analytics startups/companies exist out there? How many customers do they each have with how many events happening per customer?

Analytics on large-ish data is definitely a common challenge. And sure, CRUD is even more common. But knowing what your options are for each challenge makes sense.

this relates https://news.ycombinator.com/item?id=28047618 if the giants are generating exabytes per year it seems reasonable to expect that your smaller application can end up with a few terabytes in the same time.
OK well I'm telling you based on my experience and it wasn't at FB or Google. :) Do most people have these problems, no you're right. Should everyone ignore them and be unaware of options when they do have the problem? Probably not?
I had those problems several times within the last few years - having to aggregate and detect various signals from billions of data points - and keep the system flexible enough that different business people could “try out” various ideas and play with the results.

All sorts of different patterns and architectures have to be brought together to make sense of it.

But the technique described above still has its place within such a system. For example - you boil billions of data points down to a handful (several thousand) facts, such as events that occurred or anomalies. Then those several thousand facts can be analysed standalone (as described above) without going back to the source. Etc.