Hacker News new | ask | show | jobs
by VLM 5026 days ago
Quick summary, assuming you know what MVC stands for:

Mysql - Model is in your code. PostgreSQL - Model is at least partially in your database.

There is a HUGE mistake in the article in the assumption that WRT the model design, that the database always knows best. Its possible to come up with weird situations where you just want the DB to store stuff and not nanny you. Consider a database of actual, real world, gravestone inscriptions. If someone's gravestone stone has "1890-02-30" inscribed on it, I know thats wrong but I don't care, I need to store it exactly as is for historical purposes, I don't want a DB crash or need to recompile postgres to accept it, I don't want to force the users to falsify gravesite records, I don't want to have to store as a CHAR or VARCHAR and have to write my own date handling routines in my app... The correct way to handle data modeling/integrity is to allow the app designer to decide how flexible he wants to be WRT reality, and let him decide exactly how to shoot himself in his foot.

On a bigger scale, if I made a database table and related CRUD app to store philosophical positions, if I wanted an AI to only accept "truth" then I'd put the AI in my app, I would not want the DB model and the app model to have to fight over Marx being right or wrong before the data could be stored. What if filesystems needed to verify "truth" before allowing a file to be saved? Weird.

Is it persistent storage or is it a turing complete theorem prover and why most both be in the same executable? Note I'm not claiming a "middleware" of a model is a bad idea, in fact its a great idea, it just doesn't belong in the persistant DB store anymore than it belongs in the filesystem layer.

5 comments

I think you missed the _reason_ why that matters.

If it's _just_ your app/code that's sending and retrieving data from tha database, you can pretty much do as you please.

If other code, especially other code written by other people needs to interact with that data, then explicit rules and agreements need to be made about exactly what "1890-02-30" means.

The argument in the article is that Postgres (and Oracle) have features that help in the multiple application interfacing with the same database, when compared to the MySQL and NoSQL end of the database spectrum.

It's not so much "the model" that's moved into the database, but the validation of the values stored by your model.

I think it's making a better argument than you imply. If you want to be able to store 30 Feb in your model, you'd better consider what might happen if you try and store that in a date column in your database. I'm pretty sure at least some versions of MySQL will happily let you insert that date, and "magically" return 02 (or 01) March when you query it. Is that the "expected behavior" of your Gravestone app?

Author here.

Mysql - Model is in your code. PostgreSQL - Model is at least partially in your database.

Also your code can be at least partly in your database which is what makes this possible.

There is a HUGE mistake in the article in the assumption that WRT the model design, that the database always knows best.

I didn't say that. However if you read the entire O/R modelling series you will see in PostgreSQL it is possible to fully define your model in an OO-like way in your database, and if you do that then that model can be re-used across applications written in different development environments. We now have proof of concept PHP classes for integrating with LedgerSMB because of the fact that our model is in our db. This makes it very easy to write classes which interop across different languages.

The fact is you can decide where you want the line to be. PostgreSQL allows you to build interfaces which give you much more intelligent data models at every line.

These tools have complexity costs though. Use where appropriate.

Is it persistent storage or is it a turing complete theorem prover and why most both be in the same executable? Note I'm not claiming a "middleware" of a model is a bad idea, in fact its a great idea, it just doesn't belong in the persistant DB store anymore than it belongs in the filesystem layer.

Mike Stonebraker's example was: Create a db query to tell you what images (in your database) are pictures of sunsets taken within 20 miles of Sacramento.

His argument for code being in the database is that the last thing you want to do is select several thousand images and hand them over to the middleware or client for processing. Instead you need some way of having the database answer this and only send you back the ones you want. He suggests:

    select id
    from slides P, landmarks L S
    where sunset (P.picture) and
    contains (P.caption, L.name) and
    L.location |20| S.location and
    S.name = 'Sacramento';
The point here is that you have two good examples of why this approach can be important here: spacial queries, and filtering out images by content using image recognition algorithms. In this way, you aren't burdening your least scalable tier with transferring MB and MB of information back to a middlware so it can perform the processing and return only a few records to the client.
"There is a HUGE mistake in the article in the assumption that WRT the model design, that the database always knows best.

I didn't say that. However if you read the entire O/R modelling series you will see in PostgreSQL it is possible to fully define your model in an OO-like way in your database"

Perhaps the area of disagreement in our interpretations is that I'm thinking "the database knows best" as in the DBA gets the last word on what can be stored vs the DEV whereas I think you're defining the data definition in the DB as a DEV task, or maybe all DEVs should be both DBA and DEV, which I don't think will work very often but when it does work it's great.

"and if you do that then that model can be re-used across applications written in different development environments."

Again, would be great if its possible. Probably one very important part of the workflow would be not to allow the DEVs to code in a MVC framework, essentially VC only, or just vestigial M like anything goes and rely solely on the DB for all data modeling. Otherwise each environment will have a different, probably incompatible, model.

"PostgreSQL allows you to build interfaces which give you much more intelligent data models at every line." "In this way, you aren't burdening your least scalable tier" There's no free lunch, only tradeoffs. In your case at least in one example, it works great and I'd glad for you, there is no better proof than working code / working system. However in general for most situations I don't think it would work very well at all.

Perhaps the area of disagreement in our interpretations is that I'm thinking "the database knows best" as in the DBA gets the last word on what can be stored vs the DEV whereas I think you're defining the data definition in the DB as a DEV task, or maybe all DEVs should be both DBA and DEV, which I don't think will work very often but when it does work it's great.

Maybe. but I don't think I passed judgement on that issue. What I think I was saying was that if you have multiple applications writing to the same relation, you have to assume lax data controls on the part of every other writing app. I suspect, as I put in the article, that your view is that the API level should be app-level only, with web services instead of db queries.

Again, would be great if its possible. Probably one very important part of the workflow would be not to allow the DEVs to code in a MVC framework, essentially VC only, or just vestigial M like anything goes and rely solely on the DB for all data modeling. Otherwise each environment will have a different, probably incompatible, model.

Ok, let's look carefully at the role an ORDBMS plays in this, it is as an information model not a behavior model. The former is more or less a proper subset of the latter.

So things we can model are storage and retrieval stuff:

1) Save a GL transaction. Is it balanced? Throw error if not.

2) What is the balance of the checking account?

3) Store the info assuming we dispose of asset '12345-56665' by selling it for $100.

Things we should not do:

1) Presentation layer stuff

2) i18n stuff

3) Anything non-transactional (emails etc).

But the point is that the former category provides a save API for integration with other apps. The latter category is less important for integration. If the tools are there, however you can decide when and where they are appropriate. If they aren't there you don't have that choice.

One huge tradeoff though is that as soon as you go this direction you give up on portability and get really truly locked into one ORDBMS.

I've read through all your articles regarding "object relational modelling" and am still having a problem with the notion of, "in order to do a complex relational query, we need code in the database". Stonebraker isn't entirely impartial here as he's trying to sell his own product in this area (VoltDB) which is highly dependent on the "database-side logic" approach.

There's an important tradeoff being discussed here, which is, "can we get directly the data we want from the query", versus, "do we need to load all the data into our app first and filter it there". This is of course the critical thing that a lot more people need to learn, and the work I do with SQLAlchemy is all about this. But in the SQLA approach, we use Python constructs on the app side which expand into SQL functions when rendered in a query. The effect is very similar to that which I see in most of the examples in your posts.

While I think advanced data models and rich SQL-side functionality are essential, the usage of stored procedures is IMHO not the only way to get there. In practice I often use a mix of both, depending on how verbose the function needs to be.

Keeping SQL functions as app-side constructs has the advantage of source code management. It's easier to support multiple kinds of backends (I run against PG and SQL Server a lot) since you aren't tied to a stored procedure language. There's no need to emit new stored procedure definitions to the database in order to support new features of the application. You don't have the issue of updating a stored procedure on the database side such that multiple application versions, targeted to different versions of the database function, still continue to function. I think there are ways to approach these problems in favor of SPs, but they require some thought on how the source code is maintained, managed, and deployed. For now I've just stuck with keeping most SQL functions on the app side.

The big namespacing problems I see are, what if two different kinds of "classes" want to have the same method name ? The definition of a PG function here creates a name that's global to the whole schema - this suggests we may want names that are qualified with a "class name". And what if you do in fact need two versions of the same function present to support different application versions ? In that case maybe we want to qualify the names of the functions with version ids as well. This actually sets up a great opportunity to use an application side system of rendering class/version qualified SQL names in response to plain names on the app side.

I guess my point is that the "app logic in stored procedures" approach is interesting, it has some management/deployment issues that also might be interesting to solve, but app-rendered SQL when using an effective enough app-side toolkit can solve the problem just as well in most cases.

Stonebraker isn't entirely impartial here as he's trying to sell his own product in this area (VoltDB) which is highly dependent on the "database-side logic" approach.

Well, the quote is old, and the db he was trying to sell at the time was Informix, but I suppose that's a fair bit of truth to that. It is worth noting however, that he suggests in that paper that RDBMS and ORDBMS engines operate in different markets.

There's an important tradeoff being discussed here, which is, "can we get directly the data we want from the query", versus, "do we need to load all the data into our app first and filter it there". This is of course the critical thing that a lot more people need to learn, and the work I do with SQLAlchemy is all about this. But in the SQLA approach, we use Python constructs on the app side which expand into SQL functions when rendered in a query. The effect is very similar to that which I see in most of the examples in your posts.

The only reason we do what we do in Postgres is because we want to support multiple programming languages with minimal work. It is a matter of having this be an API accessible to multiple tools where some may be written in Perl, some in Python, some in Perl, and some in Java. If you are just writing a single app and don't want that portability, yeah, it is the wrong approach.

Keeping SQL functions as app-side constructs has the advantage of source code management. It's easier to support multiple kinds of backends (I run against PG and SQL Server a lot) since you aren't tied to a stored procedure language.

Right. There's a huge tradeoff here between "one database with logic centralized for many apps" and "one app that runs on many databases." I am not convinced you can do both gracefully.

The big namespacing problems I see are, what if two different kinds of "classes" want to have the same method name ?

Yeah, we struggled with that, which is one reason why we are using input types to construct classes. Function overloading then solves the problem.

save(asset_item) and save(journal_entry) then both work and can be discovered as needed from the system catalogs.

I am not saying this is the right approach always. I am saying it is an approach which trades away the ideal of "one app on multiple databases" for the ideal of "one database for many apps."

Choose the right tool based on what you are doing.

"Consider a database of actual, real world, gravestone inscriptions. If someone's gravestone stone has "1890-02-30" inscribed on it, I know thats wrong but I don't care, I need to store it exactly as is for historical purposes, I don't want a DB crash or need to recompile postgres to accept it, I don't want to force the users to falsify gravesite records, I don't want to have to store as a CHAR or VARCHAR and have to write my own date handling routines in my app..."

I think this is a crucial point of distinction between the two philosophies: structure defined in the query; and structure structure defined before data is loaded.

Structure defined in the query is the more obvious approach. You collect all of the data, and write queries over that data that handle all the cases. The queries often become quite complex and error-prone. Even if the query is slightly wrong, the result generally looks about right. Queries may take a long time to develop and get right, and may react badly to new data that is loaded (e.g. "I thought that was a number field, but now it has letters"). This approach is useful when you are trying to interpret the input data in several different ways -- in other words, when the query is helping you determine the nature of the data you have.

Defining structure before loading is generally more robust and less error-prone, but requires planning that may be frustrating to people just trying to get their hands on the data. The queries generally don't have branches or special cases, so usually if the query runs at all, it will give the right answer. If someone is trying to file an expense, and the receipt says Feb 30th, the accountants still don't want to see the expense as happening on Feb 30th. If they let it in, it could (potentially) break all of their other queries by creating inconsistencies (e.g. it happens after one month is closed and before the next is opened, and it causes the accounts to be out of balance somewhere). So, the person filing the expense has some extra work to do -- maybe they need to look at their bank statement to see what day it really happened, and add a note saying the receipt has the wrong date, in case anyone does an audit.

Broadly speaking, the first philosophy is easier for writers of data, and people writing the applications that help people input data (in part, because you never have to tell the user that the data is wrong, and they need to reexamine their records). The second philosophy is easier and more reliable for the people querying the data, but harder for the people trying to input data and the people trying to write applications to help input data (because they have to handle more error cases and try to provide context so the user can correct them).

In your example, it all depends on what you are trying to ultimately do with the data you collect. The easiest thing to do is to take the data in an even more raw form: just have people take pictures and automatically upload them. But it's awfully difficult to query pictures, so you have to demand a little more structure at load time if you want to query the data at all. I'm not sure what the right balance is for you -- maybe they have a 13th month or a 32nd day, so you should just ask for 3 integers. Or maybe people put question marks or ranges (e.g. born sometime between X and Y), and you want to represent those as well. But the more of that you do, the more burden you put on query writers, and the higher the chance that you get wrong results.

In one of my other posts in the object-relational series I noted that select * has very different implications in an object-relational vs a strictly relational model. In a strictly relational model you want your SQL query to define your data structures on output. In an object-relational model often times you want your data structures to be formed properly so the db can do other object-relational stuff with them later. So there select * becomes very useful as a way of ensuring that the data structures on output can be simply re-used later.

Of course if you are doing pure physical storage queries, select * is probably not what you want but if you have a logical model built up, you may want to do select * from it in order to ensure that your output matches some specific set of rules.

How is MySQL going to store "1890-02-30" as a date? What internal format does it use to allow storing a date like that?
A varchar named something like "DateAsEnscribed". With a related date field that you can search on with some well defined policy about what happens when gravestones have invalid dates enscribed them.

The problem is, you probably don't work out ou need that until you've got a million rows stored in a date column, and when you discover it, you then start asking ourself "I wonder how many of our dates have been auto-magically 'corrected' from accurate-but-invalid enscribed dates into valid-but-not-as-enscribed ones."

I guess it uses a mixed-radix number with radixes 10-10-10-10-12-31 or 10000-12-31 (or, maybe, 10000-13-32 to allow for zero months and days) if the config flag ALLOW_INVALID_DATES (http://dev.mysql.com/doc/refman/5.5/en/server-sql-mode.html#...) is set.

I still fail to see why anybody would want that or even the default 'if you cannot figure it out, use 0000-00-00' mode, though. That flag makes a broken system more broken, and if someone wants more flexibility in storing dates, he could always use char(8) or so.

In the context of this article: if you use your database as a dumb store and put all logic in your application, why would you let MySQL decide for you that, e.g., 2000-12-34 becomes 0000-00-00 and not, for instance, 2001-01-03?

>if you use your database as a dumb store and put all logic in your application, why would you let MySQL decide for you that, e.g., 2000-12-34 becomes 0000-00-00 and not, for instance, 2001-01-03?

I'm not letting MySQL decide for me intentionally. My application should be checking my dates; if I ever get as far as attempting to store 2000-12-34 in the database, it's because I made a mistake in my code.

So when live customer data discovers some untested path, what do you want to happen? In my experience in real applications, silently storing "corrupt" data (which I can fix by hand as soon as I discover the bug) is better than throwing an error back to the end user, and those are pretty much the only options.

> silently storing "corrupt" data (which I can fix by hand as soon as I discover the bug) is better than throwing an error back to the end user,

I think this statement might be a good test if you want to predict which camp they will fall into.

I frequently store enough data that fixing anything by hand is a large task and my experience with these types of errors is that this silently corrupted data (no need for quotes, that's what corrupted means) is sometimes corrupted in a lossy way, so you can't fix it by hand or in any other way.

Even if you can fix it by hand and it's not lossy I still find the fail fast philosophy is right most of the time, I want an error logged so I get notified and can fix it even if that means that an end user sees an error (there was one after all).

I might be biased having had the experience of exactly this type of mysql error destroying months of data that was the result of very expensive marketing because no one noticed until they tried to analyze it. Mysql was silent and our testing had missed it (if it had thrown an error our testing would have easily found it).

CHAR(10) worst case, or probably something a lot more like rowname.year INT, rowname.month INT, etc. Yes you could do your own homemade date type in that in postgres and your query would look like "SELECT " and then you'd write your own date DBMS routines, but it would be icky. Compare the execution time of "Select from blah order by somedate limit 10" on each design, especially if the DB and webserver are on separate boxes.

It comes down to the fundamental question of who defines bad data, the DEV in his model or the DBA in his table design. Worst case is both, with no coordination, second worst case is both with coordination (wasted effort)

I suspect, given a large enough sample of gravestones, you find enscriptions like "Christmas Day 1832" or "The last day of Winter 1906". I suspect the argument for keeping the "30-02-1890" data intact would apply equally to my made-up examples. I'd design this with a "date as enscribed" varchar column, and an "linterpreted date for search/sorting purposes" date column.
I'd do varchar with a table method and a check constraint. Not hard, not a lot of effort. Still allows for conversion.

A more interesting question becomes what happens when you have to store local calendar values which are non-Gregorian, like '1712-02-30' which was a date that existed in Sweden (due to a rare Gregorian to Julian conversion). PostgreSQL treats all dates as Gregorian and so Julian dates and weird pseudo-Julian dates (the double leap day to abort the failed conversion to the Gregorian calendar) have to be handled by conversion.

This is good and consistent. If you are recording dates and you need to know what date they represented you need a consistent calendar. If you want to convert Gregorian to Julian that can be done. but you'd have to code that no matter what db you are working with.

Otherwise you run into weird issues like determining the length of an interval across two calendars where you may not know that because calendars changed at different times in different countries.

Wasn't it you who said "I don't want to have to store as a CHAR or VARCHAR and have to write my own date handling routines in my app"? rowname.year and rowname.month sounds a lot like writing your own date handling routines.
And I am confused as to why you wouldn't use varchar or char to record, you know, inscribed writings. I mean if it says 1890-03-300 I assume you'd want the extra zero recorded, right?
+1 for that

That's also my impression and a reason why MySQL is a great DB for ORM-driven Apps.