Hacker News new | ask | show | jobs
by bob1029 1774 days ago
> accidental and essential complexity

I feel like I am turning into a bot for posting the Out of the Tar Pit paper:

http://curtclifton.net/papers/MoseleyMarks06a.pdf

This changed my understanding of computer science and our product virtually overnight. We are using a hybrid model of Functional Relational Programming (see section 9 in the paper). This is in production right now and its clearly the right answer for managing complexity in our product.

If you think "eww i have to learn some FRP language" - No. You just need to tack SQLite onto whatever preferred language you use today and model all your business logic as SQL queries over properly-normalized tables. If you can achieve this with 6NF, you have an infinitely-extensible domain model. If you don't know where to start, 3NF is the safest place. Most humans tend to think in terms of 3NF when referring to the business entities.

3 comments

The problem is that performance and normalization do not (always) go well together.

Let's say you have billions of rows of event data you want to perform summary counts for by a few different key columns.

Doing this up front as the events are ingested is going to allow for much more efficient querying on an already grouped table than having to group on your billions of events in each SELECT query.

I'm not saying don't normalize. But normalizing creates its own problems too you may need to think about.

> The problem is that performance and normalization do not go well together.

> Let's say you have billions of rows of event data you with to perform summary counts for by a few different key columns.

Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

Such a terrible tradition in our industry in focusing on outlier cases, or on what FB/Google/... might need when making technical decisions.

>Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

having built a real time analytics solution a few years ago because we wanted one we controlled for our startup, it doesn't take much to get to billions of rows of event data in the modern web.

At last $dayjob we were a speck compared to even the smallest "webscale" consumer B2C app and we gathered 800mil tracing events a week.
Yup. How many variations of analytics startups/companies exist out there? How many customers do they each have with how many events happening per customer?

Analytics on large-ish data is definitely a common challenge. And sure, CRUD is even more common. But knowing what your options are for each challenge makes sense.

this relates https://news.ycombinator.com/item?id=28047618 if the giants are generating exabytes per year it seems reasonable to expect that your smaller application can end up with a few terabytes in the same time.
OK well I'm telling you based on my experience and it wasn't at FB or Google. :) Do most people have these problems, no you're right. Should everyone ignore them and be unaware of options when they do have the problem? Probably not?
I had those problems several times within the last few years - having to aggregate and detect various signals from billions of data points - and keep the system flexible enough that different business people could “try out” various ideas and play with the results.

All sorts of different patterns and architectures have to be brought together to make sense of it.

But the technique described above still has its place within such a system. For example - you boil billions of data points down to a handful (several thousand) facts, such as events that occurred or anomalies. Then those several thousand facts can be analysed standalone (as described above) without going back to the source. Etc.

Sqlite has triggers, and Sqlite has indices, and Sqlite lets you write custom functions in your host language.

While I'm not necessarily arguing in favour of using Sqlite for this, it most certainly can do this up front during ingestion.

So yes, sometimes you do end up sacrificing normalisation, but often that means keeping a normalised form, and having code that can re-generate summary data from the normalised form.

The question to ask is whether building a cache of normalized data will be more efficient than addressing the complexities of non-normalized data, eg. duplication, renaming, integrity problems, etc.
I've seen the downside of this. And it's having to have an oncall staff of DBA's hand-grooming fragile databases and running deduplication processes that sometimes get so far behind that it's a mathematical impossibility to catch up with new data coming in. (never mind what happens to your maintenance schedule when you have entire teams focused on "fixing this problem" for weeks at a time).

If it were up to me (and it almost never is, because I'm not a DB expert), I would ALWAYS normalize as much as possible when designing a database. I really can't even wrap my brain around why large (and old) databases ever get into this state. But I've seen it at two different employers and it's very painful (and costly) to deal with. My take was that these databases were probably originally set up by people who had no idea what they were doing, and ended up locking the company into a shitty implementation that kept the company crippled 15 years later. But that's just me.

Totally. All I mean to say is that "everything should always be normalized" doesn't necessarily make sense. You need to consider your situation.
Premature optimization is the root of all evil. It’s unlikely normalizing data is going to be an actual problem you can’t solve via caching.
I hate how this quote has been perverted. “Premature optimization” was meant to apply to people making their programs an unreadable/unmaintainable mess to save a couple of CPU instructions. Figuring out how to organize your data is not premature optimization, it’s literally one of the first things you have to tackle when starting a project, because trying to change it down the line is going to be miserable, especially if you can’t afford any downtime.
Yes, and I think the parent was pointing out that denormalization is often a premature optimization that makes the data schema an unreadable/unmaintainable mess to save a few CPU instructions. The normalized schema is answering the question of "how to organize your data". Denormalizing that schema is an attempt at optimization.
> The problem is that performance and normalization do not (always) go well together.

Absolutely agree. That is why we use in-memory SQLite projections of business state that are scoped per-user-session so that row counts never exceed double digits in any given table.

For us, adding indexes to these special database instances would actually make things go slower.

Use a columnar database for this. Use materialized views. Don’t do this on the prod DB for the CRUD app. Simple :)
That sounds good!
I'm curious, can you recommend an open source example that implements these ideas? My first reaction when I read

> You just need to tack SQLite onto whatever preferred language you use today and model all your business logic as SQL queries over properly-normalized tables.

was to think "well that can't work for _all_ my business logic". But I'd like to see how this idea works in practice before I jump to that conclusion.

I want to second this.

My first thought was the same as yours (no way that works for everything).

It seems like a neat idea. But I can't imagine that anyone has ever done this for anything non-trivial. I want some sort of compelling argument that this works at all before I'm willing to accept this as anything else than a pipe dream.

I am curious what edge cases (i.e. "non-trivial" things) you have in mind that this would not work for.

For us, the more complex the business gets, the more justified this path becomes. Our business fails to scale if we have to write custom code for every customer. Writing custom SQL per customer (against a common schema) is far more tenable.

Are you thinking of some specific logical determination that would be infeasible in SQL vs in code? Or, is this more about webscale arguments?

It's the same sort of disbelief that I would have towards being introduced to a skyscraper made out of cardboard. At first I would think, "oh, this is just like concept art." ~No, we actually built it.~ "Err, so, like nobody can actually go in there." ~It supports people just fine.~ "Um, for what, 20 minutes." ~Well, Nick has lived in it for 20 years now.~

Finally, I'm not going to venture in (let alone actually live in it myself) without someone sitting me down and carefully explaining the structural integrity principles (and other logistics) that make the cardboard skyscraper possible.

Hearing business rules in sql works for some unknown entity doesn't really give any information about the viability for anything else. Maybe the initial modeling is really hard, but all your customers have near identical concerns, so you only pay that cost once. Maybe having multiple customers is actually the key because if one customer's needs violates some assumption that makes the whole thing work, you can tell them go to someplace else. Maybe your customers only have five business logic rules apiece that are all well spec'ed.

What's the domain? The industry? What problems are actually being solved? Can we have an example (even a contrived one)? Is there an open source example? How about a mathematical proof or even principle that argues for why this can be expected to always (or even nearly always) work? How about a youtube video with someone describing the approach?

EDIT: How about this. Show me a partial parser using this approach (grammar rules can be considered business logic, right). Just parse function declarations in C or something. I'll be able to project such an example onto what the full solution would look like to see if I believe it could actually work in general.

Your cardboard skyscraper analogy is fantastic. It does a good job capturing the tone of my disbelief.

I'm disappointed to see there wasn't a response because I was pretty willing to entertain the notion of business-logic-as-normalized-tables. But I needed, like, any evidence.

Going to high levels of normalization also makes the code harder to understand. It's very extensible, but every system that I have dealt with that had high levels of normalization was a huge pain to figure out.