| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eatonphil 1821 days ago

The problem is that performance and normalization do not (always) go well together.

Let's say you have billions of rows of event data you want to perform summary counts for by a few different key columns.

Doing this up front as the events are ingested is going to allow for much more efficient querying on an already grouped table than having to group on your billions of events in each SELECT query.

I'm not saying don't normalize. But normalizing creates its own problems too you may need to think about.

5 comments

BoiledCabbage 1821 days ago

> The problem is that performance and normalization do not go well together.

> Let's say you have billions of rows of event data you with to perform summary counts for by a few different key columns.

Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

Such a terrible tradition in our industry in focusing on outlier cases, or on what FB/Google/... might need when making technical decisions.

link

bryanrasmussen 1821 days ago

>Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

having built a real time analytics solution a few years ago because we wanted one we controlled for our startup, it doesn't take much to get to billions of rows of event data in the modern web.

link

Spivak 1821 days ago

At last $dayjob we were a speck compared to even the smallest "webscale" consumer B2C app and we gathered 800mil tracing events a week.

link

eatonphil 1821 days ago

Yup. How many variations of analytics startups/companies exist out there? How many customers do they each have with how many events happening per customer?

Analytics on large-ish data is definitely a common challenge. And sure, CRUD is even more common. But knowing what your options are for each challenge makes sense.

link

bryanrasmussen 1821 days ago

this relates https://news.ycombinator.com/item?id=28047618 if the giants are generating exabytes per year it seems reasonable to expect that your smaller application can end up with a few terabytes in the same time.

link

eatonphil 1821 days ago

OK well I'm telling you based on my experience and it wasn't at FB or Google. :) Do most people have these problems, no you're right. Should everyone ignore them and be unaware of options when they do have the problem? Probably not?

link

LeonB 1821 days ago

I had those problems several times within the last few years - having to aggregate and detect various signals from billions of data points - and keep the system flexible enough that different business people could “try out” various ideas and play with the results.

All sorts of different patterns and architectures have to be brought together to make sense of it.

But the technique described above still has its place within such a system. For example - you boil billions of data points down to a handful (several thousand) facts, such as events that occurred or anomalies. Then those several thousand facts can be analysed standalone (as described above) without going back to the source. Etc.

link

vidarh 1821 days ago

Sqlite has triggers, and Sqlite has indices, and Sqlite lets you write custom functions in your host language.

While I'm not necessarily arguing in favour of using Sqlite for this, it most certainly can do this up front during ingestion.

So yes, sometimes you do end up sacrificing normalisation, but often that means keeping a normalised form, and having code that can re-generate summary data from the normalised form.

link

naasking 1821 days ago

The question to ask is whether building a cache of normalized data will be more efficient than addressing the complexities of non-normalized data, eg. duplication, renaming, integrity problems, etc.

link

Pokepokalypse 1821 days ago

I've seen the downside of this. And it's having to have an oncall staff of DBA's hand-grooming fragile databases and running deduplication processes that sometimes get so far behind that it's a mathematical impossibility to catch up with new data coming in. (never mind what happens to your maintenance schedule when you have entire teams focused on "fixing this problem" for weeks at a time).

If it were up to me (and it almost never is, because I'm not a DB expert), I would ALWAYS normalize as much as possible when designing a database. I really can't even wrap my brain around why large (and old) databases ever get into this state. But I've seen it at two different employers and it's very painful (and costly) to deal with. My take was that these databases were probably originally set up by people who had no idea what they were doing, and ended up locking the company into a shitty implementation that kept the company crippled 15 years later. But that's just me.

link

eatonphil 1821 days ago

Totally. All I mean to say is that "everything should always be normalized" doesn't necessarily make sense. You need to consider your situation.

link

Retric 1821 days ago

Premature optimization is the root of all evil. It’s unlikely normalizing data is going to be an actual problem you can’t solve via caching.

link

10000truths 1821 days ago

I hate how this quote has been perverted. “Premature optimization” was meant to apply to people making their programs an unreadable/unmaintainable mess to save a couple of CPU instructions. Figuring out how to organize your data is not premature optimization, it’s literally one of the first things you have to tackle when starting a project, because trying to change it down the line is going to be miserable, especially if you can’t afford any downtime.

link

naasking 1821 days ago

Yes, and I think the parent was pointing out that denormalization is often a premature optimization that makes the data schema an unreadable/unmaintainable mess to save a few CPU instructions. The normalized schema is answering the question of "how to organize your data". Denormalizing that schema is an attempt at optimization.

link

bob1029 1821 days ago

> The problem is that performance and normalization do not (always) go well together.

Absolutely agree. That is why we use in-memory SQLite projections of business state that are scoped per-user-session so that row counts never exceed double digits in any given table.

For us, adding indexes to these special database instances would actually make things go slower.

link

reilly3000 1821 days ago

Use a columnar database for this. Use materialized views. Don’t do this on the prod DB for the CRUD app. Simple :)

link

eatonphil 1821 days ago

That sounds good!

link