Hacker News new | ask | show | jobs
by dspillett 1751 days ago
> SQL has a steep learning curve.

Overall I don't think it is that steep, though maybe I'm blinded by having worked with various implementations of it for more than two decades. The key sticking point is jumping to thinking in a set based manner to get best results. The rest of the difficult parts are when you need to think about implementation details because the query planners are no perfect (index hints and such) or being aware of limitations (like postgres before the latest major version having optimisation fences around CTEs).

> It expose almost zero insights into the underlying query execution.

That is pretty much by design. It is intended that you say what you want and let the query planner worry about implementation details. Of course how you tell it what you want involves learning to express those intentions in SQL. It does fall apart a bit when implementation limitations become an issue, at which point you are forced to think about the underlying implementation and how you might prod this more imperative code so that it interprets and process your relational descriptions most efficiently.

> As a result, increasingly amount of inefficient queries are being written by ML engineers

That isn't specific to ML. I see a lot of inefficient data interaction from code written by other devs. This seems to be for two reasons:

1. People seem to have taken to heart “make it work, make it work correctly, only then worry about making it work fast” to heart but tend to skip that last part and assume because all is well with their test sets of data at hundreds or thousands of rows (or sometimes tens and singles) that it'll scale just find to the hundreds of thousands or more that the clients datasets will eventually contain.

2. People using further abstractions without much care for how they implement their directives (again, in an ideal world they shouldn't have to), resulting in massively overcomplex queries as the framework tries to be clever and helpful and preempt what might be needed, getting everything whether needed or not (effectively `SELECT `) meaning the query planner can't apply families of its internal tricks for better performance, or getting many rows individually instead of as a set which sometimes means a lot of extra work for each row.

There is a definite “we'll worry about that when it happens attitude in both cases which is dangerous. While a live system has practically ground to a halt and the client needs their report by EOP or someone will get it in the neck (and be sure: they will pass that on to you!) is not a good time to be optimising data access, or worse finding out the structure just doesn't support efficient generation of the required data. Another common failing is applying what would idealy be UI or BLL concerns (timezone conversions etc) in the SQL statements in a way that blocks index use.

> Should we think about an alternative, at least for ML ETL workloads?*

I don't work with ML so that is a little outside my day-to-day wexpertise, but I'd wager ETL there has the same problem as everywhere: the basics are all well known and very well optimised for already. The rest differ so much between applications that no one abstraction would be optimal for more than a small portion of real world needs.

I'd be wary of a separate team for optimising queries. I suggest a reasonable understanding in the whole dev team with a data expert embedded who is involved in design work and code reviews so issues are caught early and junior devs can be tutored as needed so by the time they are seniors they don't need the data expert except for really gnarly problems or long-term planning.