Hacker News new | ask | show | jobs
by kbenson 3365 days ago
STUFF is actually needed in the database layer to not kill performance of your app. SQL server has no built in convenience function the equivalent of mysql's GROUP_CONCAT. Instead, you do a subquery in the selected field, interpret it as XML, pull out the specific XML elements you are looking for as a list, and then use STUFF to join them.

As much as you might recoil in horror at this (and I still do), it's actually fairly performant because the optimizer recognizes the subquery is dependent on the main query, and does the equivalent of a join under the covers or something.

E.g.

    # MSSQL
    SELECT
      movie.id,
      movie.name,
      STUFF( (SELECT ','+producer.last_name FROM actor WHERE producer.movie_id = movie.id FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(32)'),
    FROM movie;

    # MySQL
    SELECT
      movie.id,
      movie.name,
      GROUP_CONCAT(producer.last_name),
    FROM movie
      LEFT JOIN producer ON movie.id = producer.movie_id
4 comments

Although SQL Server vNext (14.x) will get string_agg: https://msdn.microsoft.com/en-us/library/mt790580.aspx

In the latest version of the SQL standard (SQL:2016) the LISTAGG function was added for this purpose. I find it "disturbing" that the next SQL Server release gets a function for this, but the new function doesn't follow the new standard.

Even worse: the name "string_agg" was apperently borrowed from PostgreSQL. But if you think that SQL Server will stick to the syntax of PostgreSQL's string_agg — nope. They use the same function name, but a different syntax.

Microsoft double fail, I'd say.

I've just written an article about LISTAGG, btw: http://modern-sql.com/feature/listagg

That doesn't exactly surprise me. MSSQL feels like MySQL's old MyISAM table type technology that's been hacked and extended for decades rather than a shift to something more technologically competent. I still get errors on a regular basis about the deadlock manager killing queries because they became deadlocked. Because of course the way to deal with your system easily running into deadlocks is just to implement a checker that runs once a second and kills those queries that deadlock, and not to actually make the system work better....

This is even more ludicrous when you consider all the extremely cool stuff the team responsible for porting MSSQL to other platforms accomplished in order to get it functioning on Linux[1].

1: https://arstechnica.com/information-technology/2016/12/how-a...

> Because of course the way to deal with your system easily running into deadlocks is just to implement a checker that runs once a second and kills those queries that deadlock, and not to actually make the system work better....

If your database design and code frequently deadlocks, I'd argue it is not a sign of the database engine being incompetent. After all, it's doing what you asked it to do. The comparison to MySQL is simply not true and it ignores the years of new features and enhancements to the product.

I admit it's entirely possible most of my bad experiences with SQL Server is due to a horrible schema. I don't control the schema in this case, nor the majority of different types of queries that run against that schema.

What I do know is that I routinely get single, non-transactional select queried deadlocked and killed. That seems like something that shouldn't happen. If I had to guess, it's likely because of the subselects to approximate group_concat, as outlined above, and they have different lock times than the main query, allowing for this single query to attampt to get locks at different times. Unfortunately, this is an ORM generated query, so sticking NOLOCK all over the place is easier said than done, and since I don't control the DB, I can't change default transactional levels. Perhaps that would solve it too.

It just seems odd that what appears to be (from my admittedly non SQL server expert) eyes to be a fairly standard database schema is so easily susceptible to this problem.

I find their locking issues interesting: they are supporting snapshot isolation for quite some time now, but it's still not default and the community even hardly knows about it. Hint's like NOLOCK are more often suggested as "fix" as considering snapshot isolation.
Unfortunately, my experience with SQL Server is almost entirely in interfacing with a third party applications SQL Server back-end. As such, I have zero control over the many queries that are run from the application clients and the schema in use. I just get to play in their sandbox. Unfortunately, like many sandboxes, it's also a litter box. :/
Every time I've used STUFF that way I've had hopelessly atrocious performance. Simply dire. In the end I've found that re-implementing GROUP_CONCAT using SQLCLR is far more performant.

But ultimatley, the fact that queries are so tied to the flat resultset is infuriating, and that sucks for every SQL engine. I want a graph of results, not a goddamned glorified spreadsheet.

There are ORMs that mitigate that limitation, but ultimately they're getting compiled down to SQL so in the middle there's a big flat resultset coming back or a bunch of redundant queries.

> There are ORMs that mitigate that limitation, but ultimately they're getting compiled down to SQL so in the middle there's a big flat resultset coming back or a bunch of redundant queries.

Yes. It's important to be very aware of this. Letting the ORM do this for you automatically for a few chained joins can result in a massive resultset that is literally 90%+ redundant information depending on the strategy the ORM uses.

Which, imho, is why we need to stop making excuses for SQL. It's better than the terrible "NoSQL" alternatives for a lot of reasons, but it's still far outstayed its welcome.

We're past due for something new. Something that natively works with the typed-graph thinking of the client code instead of forcing everything into the flat resultset, but otherwise keeps the ACID guarantees and the relational model.

I think PostgreSQL makes this the clearest because it supports arrays:

    SELECT
      movie.id,
      movie.name,
      array_remove(array_agg(producer.last_name), NULL) AS last_names
    FROM movie
    LEFT JOIN producer ON movie.id = producer.movie_id
    GROUP BY
      movie.id,
      movie.name;
Just FYI: starting with 9.4, PostgreSQL supports the FILTER clause. This would change this:

> array_remove(array_agg(producer.last_name), NULL) AS last_names

to that

> array_agg(producer.last_name) FILTER(WHERE producer.last_name IS NOT NULL) AS last_names

I wrote about FILTER here: http://modern-sql.com/feature/filter

And how to use ARRAYs rather than concatenated strings here: http://modern-sql.com/feature/listagg#alternative-array

I wasn't aware of that feature. Now I really need to push for upgrading PostgreSQL to a newer version.
I used STUFF all the time until I found and implementation [https://groupconcat.codeplex.com/] of MySQL's GROUP_CONCAT as a SQLCLR function (well, actually they had to create multiple variants to allow sorting and custom delimiters).

It's certainly more ergonomic than STUFF, and I've found its performance to be comparable or better.