Hacker News new | ask | show | jobs
by beachy 1860 days ago
At the other end of the scale is a data model designed for extreme extensibility.

If you ever hear anyone bragging that their data model is entirely metadata driven, and can be used to model anything - without changing the database - that's a huge red flag, as is looking in and seeing tables called "element", "business object" and the like.

Unfortunately, for most serious Enterprise systems, a degree of flexibility is essential. It's being able to pick the right balance between hard coding first class domain objects into the database and allowing for extensibility that IMO marks the truly expert system designer.

2 comments

One underlying reason for this is that DBMS systems have an unnecessary source of complexity: They have a separate Data Manipulation Language (DML) and a Data Description Language (DDL). They really ought to be unified, but few (any?) mainstream SQL databases are homoiconic in this way.

E.g.: It should be possible to take a query definition, request its columns ("schema only" execution), and then insert or merge the columns into a table definition somewhere. Something like:

    SELECT SCHEMA( SELECT * FROM "blah" ) 
    INTO "tablename"
When a black market is formed, it's a sign that there is an unmet demand. When you see the exact same "wrong" design pattern turn up over and over, it's a sign that the underlying system isn't meeting the needs of the developers.
it sounds like you're making an interesting point but I'm afraid I don't follow, could you elaborate?
The "bad" design that keeps cropping up over and over is the second system effect: Tables, relationships, and columns defined as data in a few simple tables, instead of being defined explicitly in the SQL schema as expected. This is less than optimal for lots of reasons: duplication of metadata, inefficient query plans, no foreign keys, inability to use most kinds of indexes effectively, etc...

However, the need is real: the ability to easily extend or generate table schemas without having to switch languages. You can argue that this is "not that hard", but you'd be absolutely wrong. It's obscenely difficult. I've tried, failed, and have given up. One use-case I had was automatically generating tables for importing data from PowerShell. My goal was to be able to write something like this:

    Get-Process | Export-Sql -ConnectionString '...' -TableName 'Processes'
And have the "Export-Sql" command automatically generate the table schema on the fly based on the input columns it sees. I even wanted to be able to represent object hierarchies as sets of related tables with parent-child foreign key relationships automatically put in. I got a proof-of-concept working, but there were just too many edge-cases. Things like maximum lengths for key or index columns, maximum row size, inability to switch column types on the fly, etc...

So what to do other people do? They also give up and resort to defining a single table that has the columns: "TableName, RowId, ColumnName, ColumnValue" and call it a day. I mean... what other options are there? Bang your head against the wall for months trying to deal with idiotic things like identifier length limits? Correctly escaping arbitrary input strings? Generating hundreds of distinct commands that cannot be parametrised and hope you don't have a SQL injection vulnerability lurking in there somewhere? It's nuts!

Have you seen just how much code it takes to take an arbitrary table, one with dozens of foreign key references, several filtered multi-column indexes, and views that depend on it, and then insert a column in a specific position? You have to drop everything, copy the table, rename, and then recreate everything. Doing this in code would be... I dunno... a few hundred thousand lines? That's absurd. You're not doing it. I'm not doing it. Microsoft did it once for SQL Server Management Studio, and I bet they're not rewriting that code in a hurry!

But go back to the "bad" schema example before: Is it really that bad? What if the database engine had the ability to store common prefixes just once, instead of repeating them for each row? What if the schema looked like this:

    Database, Owner, Table, Column, Value
Starting to look familiar? A bit like [server].[database].[dbo].[Table] perhaps, familiar to every user of Microsoft SQL Server?

There's a natural hierarchy for describing the data, that lends itself well to being represented as a B-Tree, along with the data itself! That's what the "bad" schema is doing: it's representing the data with the most natural representation! It's not wrong at all!

The problem is that database engines have all sorts of features and optimisations that we want that isn't directly compatible with the naive implementation of the bad schema. Constraints, foreign keys, efficient storage, indexing, etc...

However, none of these features are fundamentally incompatible with an API that merely "pretends" that the data is stored in a single flat list that can have its schema updated with a simple insert. The database engine could simply factor out the schema part and the data part into different physical storage layouts, with all the usual efficiencies such as not having to repeat column names for every row.

To summarise: we could have our cake and eat it too. Database engines could be developed that use ordinary select/insert/delete/update statements to modify the schema, and have it perform just as efficiently as a database that uses clumsy statements like "alter table add column". In this world, a database schema upgrade could be as simple as a single literal "MERGE" statement!

While schema creation SQL can be a be a bit unwieldy, I'm not sure I appreciate which part is the problem or what you're trying to achieve.

Obviously you can SELECT * INTO FROM .. if you're just temporarily inserting data.

I'm not sure I see the value in automatically importing arbitrary data into a schemad database object. I think it's too complicated to be carried out by the database and should probably be done by some other piece of software and under human supervision/guidance.

Are you suggesting that INSERT/UPDATE statements also have the ability to modify objects? That sounds like it would add complexity without much gain as opposed to just running and alter table query.

I'm not saying that the process of writing schema modifying queries is painless but I'm not sure I'm convinced that we can have the cake and eat it

For your example I personally would've AWKED into a predefined schema or inserted the data as JSON

> Obviously you can SELECT * INTO FROM .. if you're just temporarily inserting data

Okay, bad example. I should have specified: Insert into an existing table, add the missing columns automatically. There are some special-cases where most database engines can automatically generate schemas, and "SELECT... INTO" is one of those few. Usually only allowed with an empty destination table.

> I'm not sure I see the value in automatically importing arbitrary data into a schemad database object.

There are lots of use-cases for this. You might not have them, but other people do all the time. Just about any large-scale enterprise software needs to be extensible in the field, for example. Think the likes of SAP or Siebel. I've seen similar problems crop up in CMDBs, job and ticket management software, etc... Famously, Jira is slow precisely because it is so bad at handling this kind of extensibility efficiently.

> Are you suggesting that INSERT/UPDATE statements also have the ability to modify objects?

Not necessarily, although that could be an option. What I mean is that changes to the table schema should be made using insert/update/delete statements, where the only "data" is things such as the column names, types, constraints, etc...

> That sounds like it would add complexity without much gain as opposed to just running and alter table query.

It would dramatically reduce complexity, removing an entire language from database engines, along with all the associated vendor-specific syntax, quirks, and limitations.

In fact, the same "schema" that is used in the wire protocol could be directly equal to the actual schema, and its update language. So if you get back a query result (with data), and want to create a copy of that schema elsewhere (e.g.: a local cache database), then you just take the "header" from the result set and "insert" it into the destination database schema. No conversion, no escaping, nothing.

Seriously: Try this as an exercise. Use Java or C#, write a select statement from a query (that has join, views, etc...), and then write the code that generates a table to cache this data in a separate, local database instance.

Do it. Sit down and give it a good go. You won't appreciate how hard this is until you do!

Now write the code to update the cache table dynamically if the source table changes. Assume that table has a petabyte of data. (That's why it's a table, it's a cache for something that's far too big to fit in memory!)

Now write the code to do all of the above with foreign constraints.

Good luck!

> I'm not sure I'm convinced that we can have the cake and eat it

There is a reason NoSQL databases exist and are wildly popular. It's not that "schemaless" is truly better so often, it's more that modifying the schema in most DBMS offerings is so fiddly that it's essentially impossible to do programmatically.

> or inserted the data as JSON

NoSQL in a nutshell! You haven't solved the problem, you've given up and resorted to schemaless tables instead, exactly the "bad" example above with all of its limitations and issues.

I've seen some good attempts at solving this problem, but they barely scratch the surface.

> Okay, bad example. I should have specified: Insert into an existing table, add the missing columns automatically. There are some special-cases where most database engines can automatically generate schemas, and "SELECT... INTO" is one of those few. Usually only allowed with an empty destination table.

See, I don't think I would want SQL that would modify the schema programatically. And imho, that sounds like an invitation for trouble. It can be done of-course, but, to me, the value of the relational data model is to enforce consistency through constraint. If I don't care about consistency, I may as well just dump the data as schemaless JSON and then periodically turn that semi-structured data into actual relational data.

Note that I said schemad database object.

Normalizing the data model is too complex to do programmatically and sounds ill advised. If you want the advantage of painless schemaless data, dump it as JSON.

> NoSQL in a nutshell! You haven't solved the problem, you've given up and resorted to schemaless tables instead, exactly the "bad" example above with all of its limitations and issues.

Note that I was meant this as a stop-gap solution, which to build on and turn into schema-controlled data. Preferably iteratively and hopefully not in production.

I've done this when I've had semi-structured data that I wanted to dump into the RDBMS at which point there was no schema. But it's a band aid. That data is in my experience best turned into schema controlled data ASAP, otherwise inconsistencies start to creep in. Relations that were not immediately obvious pop-up and I don't think there's hope to let the SQL database carry out such a task programmatically.

There are already tools do a good job in analyzing data, which aid the construction of schemas. There are also tools that make updating them relatively painless, especially if you've already inserted the semi-structured data, as most RDBMS' have decent support for handling JSON these days.

Then again, I'm not sure I'm able to appreciate the language you dream of. To me, it sounds like SQL but where ALTER TABLE and UPDATE/INSERT are conjoined into one query. And I guess it would also deal with simple relations (ie, the ones already present by virtue of arrays, associative arrays etc.) Which, admittedly, I see little point besides having same lines which were in two queries, now in one. Which I imagine would mostly just slow down the query parser and make things hard for the optimizer. Not that I don't appreciate that it might be convenient, it doesn't sound like the added complexity has enough trade-offs. Besides, with dumb relations, you end up with deformalized data, ripe for inconsistencies. Ie, the whole problem with NOSQL data modelling in the first place.

I wonder if such a job is not better put on something that is not the RDBMs, and in my experience, such tooling exists and already does a decent job without putting an extraneous responsibility on the already very complicated RDBMs.

I think I would feel comfortable putting schema generation in the hands of Java/C# applications. Instead I would go the route of dumping the semi-structured data straight into the RDBMs as JSON, if I was unable to to predefine the schema. I would do this in a development environment, then I would analyze the data, define the schema to structure it and then move it into production where the data would be queried in regular SQL.

speaking entirely from my own experience, schemaless data is nothing but a pain that results in much more work in the long run. For me it has been the data modelling equivalent of peeing your pants. First, it's nice and warm but pretty soon it get's cold and unpleasant.

Much like there's a reason we do not make computer program themselves, giving them the responsibility of designing the data model seems even unwiser. It's a task that requires intelligence.

Secret hacker pro-tip: inner platforms and key/value pair representations do not actually improve extensibility. You always have a schema -- you get to decide whether it's explicit or implicit. Their problem isn't that their data model is too extensible: it's that it's plain old bad.