Hacker News new | ask | show | jobs
by bob1029 1747 days ago
We've built a proper data-driven business application - all the logic is effectively SQL over tables.

The challenge has nothing to do with the particular language or type system. Having a strongly-typed language helps a ton with refactor, but its not a prerequisite for data-driven to work correctly.

The most important part of doing this right is stepping away from the computer and listening to how the business refers to the things and their relationships. Getting the schema/normalization correct is foundational to long term success.

For instance, understanding that circular dependencies exist in the real world, and then actually being able to model them accurately is super important. If you have a Customer & Account that need to talk to each other both ways, invent a 3rd (or more) type(s) to relate them together. Model the nature of the relationship itself inside this new type. Very rarely is the relationship between 2 business types just an ID map (e.g. when was the relationship established? By whom?). Also, absolutely don't do any sort of bullshit where you arbitrarily decide which type "wins" in a hierarchy, unless the business has expressly informed you that one of those types is king always.

Closely related to this is nesting of complex types. Just don't. Make arrays (tables) of things and map between them via relations. You can query into JSON blobs with most database engines (we use SQLite), but it's a shitty way to author your BL in my experience.

7 comments

This, completely.

Data modelling is the skill that is missing the most throughout the process. As you say that's developing a business domain model and mapping that to an appropriate persistence model. Both layers are important and quite often non trivial.

Working with a BA or technical product owner can help a lot in business domain and if you can use as much immutability as possible with a relational model you'll be in a reasonable position for quality development.

Agile killed proper data modeling.

Tackling complexity at the type level is tackling it when it's too late. Same goes for TDD.

It's like appliying a thousand band aids to a shotgun wound.

What? No, careless devs and businesses "killed" data modeling. People care as much about their data structures now as they used to before agile, which means they cared "not a lot".

Agile would push more focus on proper data modeling, as agile promotes changing requirements, devs and stakeholders working together closely and maintaining a constant pace, all things that gets better by proper data modeling and they also help you focus on data modeling.

But then what the Agile Manifesto said and what "Agile/Scrum folks" actually promoted tends to be too different things.

What usually happens now is researching and understanding the business domain is rushed, flawed assumptions are then mapped to a flaky, misaligned database model/prototype and then a mountain of code is furiously piled on top of that.

When you finally understand the domain, it's too late. Because you rushed to deliver "something".

Nobody dares to rework the DB model as it's too risky and expensive.

So you remedy it with transformer layers, refactorings, abstractions, advanced types, tools and tests.

You treat the symptoms.

Teams that aligned their DB model with their business domain early on have it easier in every possible way. Performance, correctness, stability, velocity.

Also, I can't remember the last time I saw data modeling mentioned in a job post. And I've seen thousands of them. From SWE to CTO roles.

My guess: business analysts, data modeling, diagrams, long research cycles sounded too waterfall-y so it had to be killed by agile.

I will say this about agile - we didn't get this right the first time. If we didn't plan for a little bit of failure and iteration, we wouldn't have found this solution.
>Same goes for TDD.

How so?

Please check my answer to "capableweb" user in this thread.
> Getting the schema/normalization correct is foundational to long term success.

Great points; I think things get werider when we try to blur the line between schema and type (conceptually)

I wonder what it would look like to have a nice single abstraction that enjoys the practicality (BL-wise) of schema and the pragmatism (programming-wise) of types

Or maybe we can demonstrate that such isomorphism can't exist?

> I wonder what it would look like to have a nice single abstraction that enjoys the practicality (BL-wise) of schema and the pragmatism (programming-wise) of types

This already exists in my view. With enough discipline, you can accurately encode any type system into the relational model itself. Something to settle up front is the notion of basic types vs domain types. Basic types are required anywhere and you just need to target the simplest subset that is guaranteed to work everywhere - string, int, guid, date, bool, etc.

Domain types are the most important part of solving complex problems. If it would help, this is an example of one of my schemas. Each line item below is a table:

  /* Basic supporting types */
  String (yes, all strings in the domain live in 1 table/column)
  Date (same for date[time]s)
  /* Everything below this line is a pure domain type */
  Customer 
  Customer_CellPhone 
  Customer_Name
  CellPhone_PhoneNumber
  CellPhone_WirelessCarrier
  WirelessCarrier
  WirelessCarrier_Name
  PhoneNumber
  PhoneNumber_String
  PhoneNumber_LastModified
  LastModified
  LastModified_Date
  LastModified_User
  User
  User_Name
  Name
  Name_First
  Name_First_String
  Name_First_LastModified
  Name_Middle_String
  Name_Middle_LastModified
  Name_Last_String
  Name_Last_LastModified
  Name_Business_String
  Name_Business_LastModified
  
The above encodes a fairly detailed view of the domain in my opinion. If you look carefully, you will see that Null is not possible here and that complex data types are strictly enforced.
I do this too in some apps. Very near exactly how you show it.

How do you enforce required fields in the data model? By eliminating the possibility of null values, you eliminate the possibility of not-null constraints in the data model. I end up having to enforce these types of (non-complex) constraints outside the relational model. A missing row is not detectable in the same way as a null column value.

This is really interesting; if we hook this up with hasura we can still get some nice graphql endpoints
I can imagine this might end up looking like a triple store fairly quickly. Which is fine if total flexibility is what you want but just moves the modelling problem upwards.

Did (or Do) you have any heuristics, or rules of thumb, for deciding how much to break relationships down into separate entities? I could see getting that wrong could lead to tricky db migrations.

Good comment, though one minor semantic nitpick:

> Make arrays (tables) of things and map between them via relations.

In relational algebra, the "table" is the relation, defining a set of related attributes (column headers) and fields (corresponding column values in each row, or tuple). The mappings between multiple tables/relations are operations. Operations (select, union, etc) take one or more relations as inputs, and return a new relation as the output.

While it does appear that the foreign-key links between tables/relations are the "relations" there, since they seem to define the way different sets of data relate to each other, but the original relation is the set of attributes within each table/relation.

What is your client-facing representation of the model?

How do you manage model changes, from getting new verbal requirements from the client, to communicating your understanding back to them, to tasking it out (user stories?), to schema updates?

Thank you, very good points!
Another saying is that all data relationships become many-to-many over time. First you think that each customer has a phone number, but then you find that they often have several phone numbers, and some phone numbers will belong to multiple customers, and so on.
This is the most important realization for me.

I am starting to chase a form of normalization even more strict than 6NF. Everything gets a synthetic key, there is no such thing as a domain key anymore. Facts are in tables 2 columns wide, no exceptions. Tuples of id & fact. The only other things I permit are a "root" type and a "relation" type, being 1 and 3 columns respectively.

If someone told me I need to start tracking when a phone number was last modified, or create a new approval loop for that change, I want this to be a concern entirely contained under the phone number tables. Not something that needs to be added to everything that has string column called PhoneNumber.

I strongly believe this degree of normalization to be perfect from a domain modeling perspective. It almost feels stupid to not do it this way looking back at all the unexpected changes we've endured.

SQL is also the perfect tool for putting Humpty Dumpty back together again. You can build views to present your change-proof 6NF+ schema in something the business can consume. I.e. wider tables that make some assumptions. If we get our assumptions wrong on the views, we just iterate so we don't break old SQL. VCustomers2, VCustomers3, etc. The cost of iterating your schema by way of views is negligible compared to fucking with it elsewhere.

> I strongly believe this degree of normalization to be perfect from a domain modeling perspective.

How's performance with several tens if not hundreds of joins in a query?

Performance is just fine for us. It's all about scoping for the right things. You are never looking for the whole schema all at once. No one ever said it had to be running on disk either.
> You are never looking for the whole schema all at once.

Well we have to. Our system exports the data to other systems, which would involve generating a file with all the data (think order or invoice but with 100+ fields). For many of our customers this job is time sensitive.

Besides that we have overview grids (think "all active orders") for our customers, for many of these our customer will want to have 20+ fields visible. Using your method that can be 50-100 joins, and this would be an interactive case so again time sensitive.

That's why I was curious how it scaled.

> No one ever said it had to be running on disk either.

Well our data absolutely has to be persisted to disk on commit, and a stale cache is not tolerable. But apart from that, sure.

Is that anchor modelling?
Very nearly, yes. The biggest distinction is that all keys are synthetic.