Hacker News new | ask | show | jobs
by metamicah 3361 days ago
Some of these functions (like SIGN, STUFF, and PARSENAME) set off little alarms in my head that sound vaguely like a full-stack developer furiously yelling "DON'T HANDLE THIS IN YOUR DATABASE LAYER."

Of course we don't always have that option, but I feel like I have to acknowledge those alarms if I'm going to use them.

2 comments

> DON'T HANDLE THIS IN YOUR DATABASE LAYER.

I implore you to reconsider. I used to hold this opinion very strongly, have since changed my mind, and began to question why I ever held it in the first place. Now I realize it was rather simple: the first few teams I worked on had a database administrator that wasn't one of the engineers. We would avoid doing anything in the database because the iteration cycle would take months, involve many emails, and we'd end up with something different than what we wanted. In other words, it was a social problem, not a technical one.

If you empower engineers to change the database, these problems go away. At MixRank, we deploy a handful of schema migrations every day— they're a normal part of our development flow. In particular, we embrace constraints (who doesn't like avoiding invalid data?), and triggers extensively. Once we started using triggers, we noticed all sorts of awkward and inefficient patterns in our code that were contortions to avoid adding a single function into the database.

I'm currently really feeling the limitations of MySQL's auto-increment and of InnoDB's inability to deal with constraints that include multiple tables for a complex data source I am trying to efficiently represent in a local schema. Having a trigger which calls a function seems so much more powerful (if admittedly more boilerplate for the simple case).

I've always shied away from UDF, but that's more from not wanting another language to deal with and since there's fairly poor support for them in MySQL (which I'm stuck with). For example, the default mysqldump options for all databases don't include UDF unless you specifically request them with--routines. :/

STUFF is actually needed in the database layer to not kill performance of your app. SQL server has no built in convenience function the equivalent of mysql's GROUP_CONCAT. Instead, you do a subquery in the selected field, interpret it as XML, pull out the specific XML elements you are looking for as a list, and then use STUFF to join them.

As much as you might recoil in horror at this (and I still do), it's actually fairly performant because the optimizer recognizes the subquery is dependent on the main query, and does the equivalent of a join under the covers or something.

E.g.

    # MSSQL
    SELECT
      movie.id,
      movie.name,
      STUFF( (SELECT ','+producer.last_name FROM actor WHERE producer.movie_id = movie.id FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(32)'),
    FROM movie;

    # MySQL
    SELECT
      movie.id,
      movie.name,
      GROUP_CONCAT(producer.last_name),
    FROM movie
      LEFT JOIN producer ON movie.id = producer.movie_id
Although SQL Server vNext (14.x) will get string_agg: https://msdn.microsoft.com/en-us/library/mt790580.aspx

In the latest version of the SQL standard (SQL:2016) the LISTAGG function was added for this purpose. I find it "disturbing" that the next SQL Server release gets a function for this, but the new function doesn't follow the new standard.

Even worse: the name "string_agg" was apperently borrowed from PostgreSQL. But if you think that SQL Server will stick to the syntax of PostgreSQL's string_agg — nope. They use the same function name, but a different syntax.

Microsoft double fail, I'd say.

I've just written an article about LISTAGG, btw: http://modern-sql.com/feature/listagg

That doesn't exactly surprise me. MSSQL feels like MySQL's old MyISAM table type technology that's been hacked and extended for decades rather than a shift to something more technologically competent. I still get errors on a regular basis about the deadlock manager killing queries because they became deadlocked. Because of course the way to deal with your system easily running into deadlocks is just to implement a checker that runs once a second and kills those queries that deadlock, and not to actually make the system work better....

This is even more ludicrous when you consider all the extremely cool stuff the team responsible for porting MSSQL to other platforms accomplished in order to get it functioning on Linux[1].

1: https://arstechnica.com/information-technology/2016/12/how-a...

> Because of course the way to deal with your system easily running into deadlocks is just to implement a checker that runs once a second and kills those queries that deadlock, and not to actually make the system work better....

If your database design and code frequently deadlocks, I'd argue it is not a sign of the database engine being incompetent. After all, it's doing what you asked it to do. The comparison to MySQL is simply not true and it ignores the years of new features and enhancements to the product.

I admit it's entirely possible most of my bad experiences with SQL Server is due to a horrible schema. I don't control the schema in this case, nor the majority of different types of queries that run against that schema.

What I do know is that I routinely get single, non-transactional select queried deadlocked and killed. That seems like something that shouldn't happen. If I had to guess, it's likely because of the subselects to approximate group_concat, as outlined above, and they have different lock times than the main query, allowing for this single query to attampt to get locks at different times. Unfortunately, this is an ORM generated query, so sticking NOLOCK all over the place is easier said than done, and since I don't control the DB, I can't change default transactional levels. Perhaps that would solve it too.

It just seems odd that what appears to be (from my admittedly non SQL server expert) eyes to be a fairly standard database schema is so easily susceptible to this problem.

I find their locking issues interesting: they are supporting snapshot isolation for quite some time now, but it's still not default and the community even hardly knows about it. Hint's like NOLOCK are more often suggested as "fix" as considering snapshot isolation.
Unfortunately, my experience with SQL Server is almost entirely in interfacing with a third party applications SQL Server back-end. As such, I have zero control over the many queries that are run from the application clients and the schema in use. I just get to play in their sandbox. Unfortunately, like many sandboxes, it's also a litter box. :/
Every time I've used STUFF that way I've had hopelessly atrocious performance. Simply dire. In the end I've found that re-implementing GROUP_CONCAT using SQLCLR is far more performant.

But ultimatley, the fact that queries are so tied to the flat resultset is infuriating, and that sucks for every SQL engine. I want a graph of results, not a goddamned glorified spreadsheet.

There are ORMs that mitigate that limitation, but ultimately they're getting compiled down to SQL so in the middle there's a big flat resultset coming back or a bunch of redundant queries.

> There are ORMs that mitigate that limitation, but ultimately they're getting compiled down to SQL so in the middle there's a big flat resultset coming back or a bunch of redundant queries.

Yes. It's important to be very aware of this. Letting the ORM do this for you automatically for a few chained joins can result in a massive resultset that is literally 90%+ redundant information depending on the strategy the ORM uses.

Which, imho, is why we need to stop making excuses for SQL. It's better than the terrible "NoSQL" alternatives for a lot of reasons, but it's still far outstayed its welcome.

We're past due for something new. Something that natively works with the typed-graph thinking of the client code instead of forcing everything into the flat resultset, but otherwise keeps the ACID guarantees and the relational model.

I think PostgreSQL makes this the clearest because it supports arrays:

    SELECT
      movie.id,
      movie.name,
      array_remove(array_agg(producer.last_name), NULL) AS last_names
    FROM movie
    LEFT JOIN producer ON movie.id = producer.movie_id
    GROUP BY
      movie.id,
      movie.name;
Just FYI: starting with 9.4, PostgreSQL supports the FILTER clause. This would change this:

> array_remove(array_agg(producer.last_name), NULL) AS last_names

to that

> array_agg(producer.last_name) FILTER(WHERE producer.last_name IS NOT NULL) AS last_names

I wrote about FILTER here: http://modern-sql.com/feature/filter

And how to use ARRAYs rather than concatenated strings here: http://modern-sql.com/feature/listagg#alternative-array

I wasn't aware of that feature. Now I really need to push for upgrading PostgreSQL to a newer version.
I used STUFF all the time until I found and implementation [https://groupconcat.codeplex.com/] of MySQL's GROUP_CONCAT as a SQLCLR function (well, actually they had to create multiple variants to allow sorting and custom delimiters).

It's certainly more ergonomic than STUFF, and I've found its performance to be comparable or better.