| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bob1029 1822 days ago

> accidental and essential complexity

I feel like I am turning into a bot for posting the Out of the Tar Pit paper:

http://curtclifton.net/papers/MoseleyMarks06a.pdf

This changed my understanding of computer science and our product virtually overnight. We are using a hybrid model of Functional Relational Programming (see section 9 in the paper). This is in production right now and its clearly the right answer for managing complexity in our product.

If you think "eww i have to learn some FRP language" - No. You just need to tack SQLite onto whatever preferred language you use today and model all your business logic as SQL queries over properly-normalized tables. If you can achieve this with 6NF, you have an infinitely-extensible domain model. If you don't know where to start, 3NF is the safest place. Most humans tend to think in terms of 3NF when referring to the business entities.

3 comments

eatonphil 1822 days ago

The problem is that performance and normalization do not (always) go well together.

Let's say you have billions of rows of event data you want to perform summary counts for by a few different key columns.

Doing this up front as the events are ingested is going to allow for much more efficient querying on an already grouped table than having to group on your billions of events in each SELECT query.

I'm not saying don't normalize. But normalizing creates its own problems too you may need to think about.

BoiledCabbage 1822 days ago

> The problem is that performance and normalization do not go well together.

> Let's say you have billions of rows of event data you with to perform summary counts for by a few different key columns.

Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

Such a terrible tradition in our industry in focusing on outlier cases, or on what FB/Google/... might need when making technical decisions.

bryanrasmussen 1821 days ago

>Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

having built a real time analytics solution a few years ago because we wanted one we controlled for our startup, it doesn't take much to get to billions of rows of event data in the modern web.

Spivak 1821 days ago

At last $dayjob we were a speck compared to even the smallest "webscale" consumer B2C app and we gathered 800mil tracing events a week.

eatonphil 1821 days ago

Yup. How many variations of analytics startups/companies exist out there? How many customers do they each have with how many events happening per customer?

Analytics on large-ish data is definitely a common challenge. And sure, CRUD is even more common. But knowing what your options are for each challenge makes sense.

bryanrasmussen 1821 days ago

this relates https://news.ycombinator.com/item?id=28047618 if the giants are generating exabytes per year it seems reasonable to expect that your smaller application can end up with a few terabytes in the same time.

eatonphil 1822 days ago

OK well I'm telling you based on my experience and it wasn't at FB or Google. :) Do most people have these problems, no you're right. Should everyone ignore them and be unaware of options when they do have the problem? Probably not?

LeonB 1822 days ago

I had those problems several times within the last few years - having to aggregate and detect various signals from billions of data points - and keep the system flexible enough that different business people could “try out” various ideas and play with the results.

All sorts of different patterns and architectures have to be brought together to make sense of it.

But the technique described above still has its place within such a system. For example - you boil billions of data points down to a handful (several thousand) facts, such as events that occurred or anomalies. Then those several thousand facts can be analysed standalone (as described above) without going back to the source. Etc.

vidarh 1822 days ago

Sqlite has triggers, and Sqlite has indices, and Sqlite lets you write custom functions in your host language.

While I'm not necessarily arguing in favour of using Sqlite for this, it most certainly can do this up front during ingestion.

So yes, sometimes you do end up sacrificing normalisation, but often that means keeping a normalised form, and having code that can re-generate summary data from the normalised form.

naasking 1822 days ago

The question to ask is whether building a cache of normalized data will be more efficient than addressing the complexities of non-normalized data, eg. duplication, renaming, integrity problems, etc.

Pokepokalypse 1822 days ago

I've seen the downside of this. And it's having to have an oncall staff of DBA's hand-grooming fragile databases and running deduplication processes that sometimes get so far behind that it's a mathematical impossibility to catch up with new data coming in. (never mind what happens to your maintenance schedule when you have entire teams focused on "fixing this problem" for weeks at a time).

If it were up to me (and it almost never is, because I'm not a DB expert), I would ALWAYS normalize as much as possible when designing a database. I really can't even wrap my brain around why large (and old) databases ever get into this state. But I've seen it at two different employers and it's very painful (and costly) to deal with. My take was that these databases were probably originally set up by people who had no idea what they were doing, and ended up locking the company into a shitty implementation that kept the company crippled 15 years later. But that's just me.

eatonphil 1822 days ago

Totally. All I mean to say is that "everything should always be normalized" doesn't necessarily make sense. You need to consider your situation.

Retric 1822 days ago

Premature optimization is the root of all evil. It’s unlikely normalizing data is going to be an actual problem you can’t solve via caching.

10000truths 1822 days ago

I hate how this quote has been perverted. “Premature optimization” was meant to apply to people making their programs an unreadable/unmaintainable mess to save a couple of CPU instructions. Figuring out how to organize your data is not premature optimization, it’s literally one of the first things you have to tackle when starting a project, because trying to change it down the line is going to be miserable, especially if you can’t afford any downtime.

naasking 1822 days ago

Yes, and I think the parent was pointing out that denormalization is often a premature optimization that makes the data schema an unreadable/unmaintainable mess to save a few CPU instructions. The normalized schema is answering the question of "how to organize your data". Denormalizing that schema is an attempt at optimization.

bob1029 1821 days ago

> The problem is that performance and normalization do not (always) go well together.

Absolutely agree. That is why we use in-memory SQLite projections of business state that are scoped per-user-session so that row counts never exceed double digits in any given table.

For us, adding indexes to these special database instances would actually make things go slower.

reilly3000 1822 days ago

Use a columnar database for this. Use materialized views. Don’t do this on the prod DB for the CRUD app. Simple :)

eatonphil 1822 days ago

That sounds good!

scythmic_waves 1822 days ago

I'm curious, can you recommend an open source example that implements these ideas? My first reaction when I read

> You just need to tack SQLite onto whatever preferred language you use today and model all your business logic as SQL queries over properly-normalized tables.

was to think "well that can't work for _all_ my business logic". But I'd like to see how this idea works in practice before I jump to that conclusion.

Verdex 1821 days ago

I want to second this.

My first thought was the same as yours (no way that works for everything).

It seems like a neat idea. But I can't imagine that anyone has ever done this for anything non-trivial. I want some sort of compelling argument that this works at all before I'm willing to accept this as anything else than a pipe dream.

bob1029 1821 days ago

I am curious what edge cases (i.e. "non-trivial" things) you have in mind that this would not work for.

For us, the more complex the business gets, the more justified this path becomes. Our business fails to scale if we have to write custom code for every customer. Writing custom SQL per customer (against a common schema) is far more tenable.

Are you thinking of some specific logical determination that would be infeasible in SQL vs in code? Or, is this more about webscale arguments?

Verdex 1821 days ago

It's the same sort of disbelief that I would have towards being introduced to a skyscraper made out of cardboard. At first I would think, "oh, this is just like concept art." ~No, we actually built it.~ "Err, so, like nobody can actually go in there." ~It supports people just fine.~ "Um, for what, 20 minutes." ~Well, Nick has lived in it for 20 years now.~

Finally, I'm not going to venture in (let alone actually live in it myself) without someone sitting me down and carefully explaining the structural integrity principles (and other logistics) that make the cardboard skyscraper possible.

Hearing business rules in sql works for some unknown entity doesn't really give any information about the viability for anything else. Maybe the initial modeling is really hard, but all your customers have near identical concerns, so you only pay that cost once. Maybe having multiple customers is actually the key because if one customer's needs violates some assumption that makes the whole thing work, you can tell them go to someplace else. Maybe your customers only have five business logic rules apiece that are all well spec'ed.

What's the domain? The industry? What problems are actually being solved? Can we have an example (even a contrived one)? Is there an open source example? How about a mathematical proof or even principle that argues for why this can be expected to always (or even nearly always) work? How about a youtube video with someone describing the approach?

EDIT: How about this. Show me a partial parser using this approach (grammar rules can be considered business logic, right). Just parse function declarations in C or something. I'll be able to project such an example onto what the full solution would look like to see if I believe it could actually work in general.

scythmic_waves 1819 days ago

Your cardboard skyscraper analogy is fantastic. It does a good job capturing the tone of my disbelief.

I'm disappointed to see there wasn't a response because I was pretty willing to entertain the notion of business-logic-as-normalized-tables. But I needed, like, any evidence.

karmicthreat 1822 days ago

Going to high levels of normalization also makes the code harder to understand. It's very extensible, but every system that I have dealt with that had high levels of normalization was a huge pain to figure out.