Hacker News new | ask | show | jobs
by heyadayo 4488 days ago
This was a fun blog post, and was shockingly prescient: SQLAlchemy rose right about this time to become one of my favorite libraries and an impressive feat of solving the ORM issue with finality.

They have a sort of low-level python SQL api, a higher level declarative ORM layer, and a custom glue that let's you wire objects and sql together in arbitrary but useful & maintainable ways.

I think this was possible because the SQLAlchemy devs had this insight: "SQL databases behave less and less like object collections the more size and performance start to matter; object collections behave less and less like tables and rows the more abstraction starts to matter."

3 comments

Yeah, I think sqlalchemy broke the mold. It starts from the idea that you already know and understand SQL and relational constructs well, then adds a layer of abstraction that lets you skip tons of boilerplate and write really elegant code. Personally, while there are lots of good decisions in sqlalchemy like the unit of work pattern (in contrast to the active record pattern), the one that pays off time and again is representing the SQL ast as python objects. It eliminates and entire class of text munging issues that occur when you write raw SQL and just completely makes you wonder "where's the mismatch they keep talking about?"
Do you know of any Java-OR mappers in the same spirit?
Java: http://www.jooq.org/

Scala: http://slick.typesafe.com/

They also have the benefit of being fully type-safe across query projections.

I don't think Slick can be considered a very SQL-centric API. While the Slick folks embrace the relational model, they certainly do not embrace the SQL language. Just as with LINQ, this can be desirable if you want to reason about collections in a more general sense. On the other hand, the SQL standard has gone far beyond the "occasional" OUTER JOIN that can turn out to be a true challenge to Slick.

In other words, SQL has never been purely relational. It is a beast of its own.

I've been there with SQLAlchemy in late 2006, and I vowed never to return.

Let me state this, SQLAlchemys pool/engine/SQL-Api is great, and it's got nothing whatsoever todo with ORM.

But SQLAlchemy does contain an ORM, and it does become unmaintainable just like every other ORM. It roughly goes like this:

1) You use the ORM

2) Things get slow

3) You write a bit of custom mappers

4) Soon all your logic can't be operated without custom mappers

5) Congratulations, you've now written all the handwired code you would have written without an ORM, plus all the ORM code on top, it's now unmaintainable, full of bugs and oversights AND slow.

And that about sums up everybodies experience with ORMs unless they do trivially small things. It's at this junction people begin thinking about solving this ORM problem. And they learn it's bloody hard.

Obligatory XKCD reference: There are N ORMs on the market, they all suck. Let's solve all these problem and do a new ORM to rule them all. There are now N+1 ORMs that all suck on the market.

Second this experience. A single row pk lookup using SQLAlchemy adds 30ms overhead to a query. That is huge. To avoid this, last time I touched it, I ended up just creating my own DB access library that spit out simple objects and used raw SQL underneath. I believe for any large system that is going to be the answer.

Having said that, do not underestimate the number of apps doings "trivial" things. There are lots. The problem with SQLAlchemy I think is that it is actually advanced enough to have you use it right up to the point where you must rewite all your code because it suddenly is just dead slow. In contrast the Django ORM is too simplistic to get too far with it.

Sure for simple things, of which there are many, ORMs can work sufficiently up to a point.

But, the problem is that, you rarely get to go to your boss and say, well, this thing we've been using to do, you know, our data stuff, it's completely wrong. We need to throw everything away and start from scratch.

The only situation in which you get to say that, is when you're explicitly make it clear that you're writing a raw prototype that'll not scale, and that nothing short of a complete rewrite is gonna be required as soon as you finish (and even then, it can turn out tricky to actually do that).

Unfortunately, most people who get themselves into the ORM mess don't realize that this coming for them in the future. So they don't go their boss and make the prototype proposal. This all but cements the entire failure of the project right then.

I don't know my main experience is with the Django ORM. Others have mentioned that some ORMs seem to start from the relational model, and transform those into objects (which sounds like how the Django ORM works to me).

I find this works well. It speeds up development of all the simple stuff , and by the time it gets complicated, you can drop down to SQL.

I don't find the two ways of thinking incompatible, in fact the "extra" method on Djangos ORM is an intermediate where you let the ORM do its normal simple style of mapping, but allows you to add extra clauses to the SQL manually. Once you do this you see how the ORM works, and can choose quite easily when it is appropriate to use it.

You loose a bit of the magic that comes with Django by the time your queries get more complex (like auto generated admin pages), but the benefits outweigh the costs immensely in my case.

I am in the same boat as you currently. I work with Django almost exclusively these days and use the ORM in every project. The difference is that I am working with mostly normalized databases and datasets in the 100's of megabytes. This is a great place to use an ORM. You get all the simplicity and quick prototyping and pay very few penalties. When you have a slow query you can usually address it by adding some select_related() or prefetch_related() and you are done.

What this thread is talking about is when you have datasets that are several gigabytes in size and up, when you have many dozens of tables, all interrelated and when you need to squeeze as much performance out of your hardware as you can. When your code gets complex enough where you cannot just insert an object, but have to do all manner of verification through half the DB first. Or where you need to operate on 100,000 rows in an intelligent manner very fast (think using stored procedures and complex query parameters). This is a place that you might end up and if your entire codebase depends on an ORM, you are often out of luck.

In the long run, I think there several things to do: (1) know SQL well, (2) use an ORM when you know the project will be small, and (3) anticipate that if you are growing and using an ORM, you will need a re-write.

I agree, and my application is pretty much what you describe in your last paragraph.

80% of the queries are simple and use the ORM directly. A few need an extra clause for stuff that isn't handled by the ORM, and I have a couple that need to be high performance (and are abstract enough that they don't fit well with an ORM), which are done via the raw SQL.

I completely agree with your premise, don't get me wrong. I think the ORM is a great way to do a small project that you know won't grow big (there are plenty more of these than big projects and most of the time you can tell. For example, a small-ish e-commerce site I was involved in last year).

My experience was somewhat different in terms of ditching SQLAlchemy: the performance started to suffer as we kept adding clients. Also, developers started tripping over their own feet to do simple things as relationships got much more complex and the ORM did not allow us to impose certain rules easily (data did not always flow through the ORM as it was due to historical and performance reasons). Instead, a co-worker and I sat down and wrote a small library that did the common things for you. It was broken up into modules and we standardized on some specific methods for each "object": get, find, insert, delete, update. Each object also had non-standard more interesting methods that allowed us to manipulate relationships in a controlled fashion. The management did not mind this as it took very little time, was a gradual transition (we converted performance critical code first, then anytime we touched anything else).

Some things I learned from this:

1. Creating and maintaining a library like this is actually very little work. Who cares if it's raw SQL if it's super simple and you only write it once. Something like "SELECT * FROM vehicle WHERE id = ?" does not need unit testing, etc. and almost never changes. On the other hand the crazy complex queries we had were actually simpler in raw SQL than trying to express them in the ORM. To use an ORM efficiently you must know SQL already, except you can usually write better, clearer SQL than an ORM can.

2. The benefits in performance were huge. We could do all manner of dirty hacks inside the library functions since the library was self-contained.

3. Migrations were a pain, until we created a simple text file where we would append the latest migration. I used a tool that compared schemas to verify that the live database was in line with what a blank initialization of the DB looked like. This could definitely use some better tooling but practically this was a small pain.

4. The benefits from code that used this library were even bigger than the performance benefits. Other developers on the team were locked into what that could do with the data. If they needed to create a new method for a given table/object, the went ahead, but anything complex typically got reviewed by at least one other person and bugs were caught early. Moreover, all the data manipulation now happened in one single library so it was much harder to introduce inconsistencies.

5. We got to actually use our RDBMS. ORM's try to give you the lowest common denominator between the ORM's they support. They don't typically let you do views, sharding/partitioning (in the RDBMS sense, not in the application), SQL functions/procedures, etc. We got to do all of this and it often got to be the better solution.

I played around with sqlalchemy a bit this week, it seems you still have to dive down to the bare relational manual mapping (by creating a table by hand) to do something simple, like many-to-many mappings. Is there a better way to do it?