Hacker News new | ask | show | jobs
by fifilura 1756 days ago
I think one tip here is to always work with CTEs "WITH", and not nest queries. That way you can always go back and check "what is it I join with what", by querying the individual steps with LIMIT 10.

The other tip is to sketch the problem in excel/google sheets when it gets hairy. Not the actual code (I don't have a clue how to do that, others have), just the values in the different steps. In the end it is only about rows and columns.

But that said, these days a lot of it happens intuitively for me, I pretty much know the solution before I can spell it out. It certainly was not like that when I started.

When you begin, "programming without for loops" feels like programming with your right hand tied behind your back. But in hindsight you get a lot of exercise in the immutable paradigms of functional programming, working with comprehensions, sets, maps folds comes very natural.

6 comments

Be careful with blanket CTE recommendations. They tend to deoptimize with writes in really surprising ways. If you’re just reading data, they’re great.
I would still say they are not great as most engines just... write your sql out, instead of saving/memoizing anything.
All of my experience here is with Postgres, where (in recent versions) read-only CTEs are basically temporary views with all the built in optimizations that come from that.
This is what I thought would be a good tip when I first learned about CTEs. Turns out I use them sparingly and much prefer to create views to represent steps of data refinement. The benefit of doing it this way is that I can inspect intermediate results. Often I use *.sql files for this so I can keep the SQL formulations for later; also, as long as stuff is still experimental / under development, I start out with `begin transaction;` because then I'm guaranteed I do not permanently change anything in the DB I'm working on and also I can repeat all steps without having to care about `create or replace view ...`.

Oh and lest I forget you can't just re-use a CTE in another query. But you can of course re-use a view. Also given what another user here remarked, Postgres might internally treat (and optimize) CTEs like views, so to me that makes views superior to CTEs in more or less all respects.

Have some links on using CTEs to replace nested queries and how it helps ?

I've used SQL enough to have to write nested queries, haven't dove further than that.

Here's an article with some examples:

https://learnsql.com/blog/sql-subquery-cte-difference/

You can think of CTEs as a way to save an intermediate query as a sort of temp variable/table that can be used in the final statement following `WITH something AS (...)`. They are great for flattening your queries and giving descriptive names to subqueries, and you can chain multiple WITHs together as well.

This is good advice but sometimes you will need to nest things depending on database and optimizer and how it handles CTEs, especially if you ever mess with recursive things. An additional problem here is that there’s no standard way to represent EXPLAIN queries across systems so that makes an additional barrier to entry unfortunately.
> An additional problem here is that there’s no standard way to represent EXPLAIN queries across systems so that makes an additional barrier to entry unfortunately.

What so you mean by this?

The outputs from each DBMS are often specific to their implementation, even when there is a lot of overlap. The good news is they are usually similar enough and user friendly enough that an advanced user will have no issue but for beginners it can be daunting.
Anyway beginners learn by just using single RDBMS, isn't it?
Not necessarily. I often see people learn starting with SQLite because it's easy/installed (and quite a bit of tutorials start with it, especially for Python), and then move to MySQL/Postgres, but that is N=2 so it's not too bad.
Yep, CTEs are a huge boon to structure your SQL - use them where you can.
I've always found it hard to articulate the problems that i have with SQL and with the "WITH" CTEs, but let me try anyways.

For starters, i can never actually tests parts of those queries without rewriting the query up to the part that i want to test, for example:

  WITH 
    query_one AS (SELECT ...),
    query_two AS (SELECT ...),
    query_three AS (SELECT ...)
    SELECT ... /* main query */
If i want to test the second query, i need to take the first and second ones, copy them into a new worksheet and then rewrite the second one not to have the alias but instead be the main query. This is annoying when you have 5-10 CTEs and you need to test something in the middle.

Then, working with SQL and CTEs feels like going back from a language where functions are first class citizens to one where no such thing exists, just in regards to querying data. It would be nice if i could store parts of queries under packages, to be able to write dynamic SQL more easily, instead of having to use tools like myBatis for this purpose: https://mybatis.org/mybatis-3/sqlmap-xml.html (see the bit about SQL fragments)

So i'd like to do the following:

  PACKAGE my_snippets BODY IS
    SNIPPET query_one
      SELECT ... /* probably 500 lines long but often used snippet */
    END query_one;
  END my_snippets;
  
  /* and then, somewhere in code */
  WITH
    query_one AS my_snippets.query_one,
    query_two AS (SELECT ...),
    query_three AS (SELECT ...)
    SELECT ... /* main query */
Now, you might suggest that using views works for this intent, but what about most DBMSes out there having silly naming rules and restrictions? I don't want to work with v_mtz_wg_priv_prod_attr because someone thought that having just a few dozen characters makes sense as a restriction. Furthermore, you really can't group views into logical packages based on their intent, now can you? So, with views you end up with something that's very much like your cluttered list of tables, which gets really hard to get a good overview of when you have about 300 of them.

Next up, debugging in databases is just really bad. How am i supposed to put logging in the queries, without mixing the logging code with the other triggers and tables? What about debugging long running processes? What about adding breakpoints that i can trigger when a particular view or table is accessed? What about doing this on the server while i have a local app instance connected to the DB, or maybe even another app server? Why can't i step through the query execution and see how the filtered record count changes with each "step"?

Apart from that, my problems are largely with the tooling around databases. There are relatively few universal (cross language) DB migration solutions out there, for example dbmate, every framework seems to have its own approach. There seems to be this odd division between procedural SQL and regular SQL statements, where what you can do differs based on context, which is inconsistent. Procedural languages as a whole vary wildly in what they can do - you won't be doing complex logic with custom types on MySQL/MariaDB anytime soon, whereas Oracle or PostgreSQL will suffice. But even those two have different dialects, it's never "just SQL". There are oddities with selecting certain kinds of data, only pgAdmin seems to work nicely with geospatial data, but apart from that i've also seen problems with using lower level JDBC logic which you can't really test outside of the app, in something like SQL Developer. But even apart from that, as much as we like ER diagrams, MySQL Workbench is the only tool that i've seen which allows you to actually do model driven development properly and synchronize schemas and do forward/backward engineering - even pgAdmin fails at doing this. Oh, and the tools themselves are really inconsistent - you'll see a world of difference between MySQL Workbench, pgAdmin, SQL Developer, JetBrains DataGrip and others.

And now those DBMSes are attempting to add more functionality, such as exposing REST interfaces, instead of fixing the underlying and dated problems, because people out there are relying on those and therefore the logic is set into stone. It's no wonder that every year there's a new product or two that attempt to improve upon these, even if most of the time those products die out.

Perhaps the above is a stream of consciousness with some annoying things that i've dealt with over the years, but personally, relational databases are something that i use because they're often the least horrible tool for the job, even if they are not pleasant or easy to use, at least as easy as they should be. That's where i think the main problem lies - tools should be good for solving the problems on which they'll be used, these ones aren't.

Someone with 20 years of experience might have a different outlook, but personally i'd suggest that you utilize DBMSes for what they're good for - storing, retrieving and manipulating data and don't get too carried away with in database processing otherwise, since doing certain things within the app code seems to scale horizontally far more easier in some situations, has better auditability, debugging etc.

Long reply for a workday, but a quick reaction from your first example (with a slight risk that I misunderstood your problem).

What you do here is to also wrap the main query into a CTE and then end everything with SELECT * from main_query.

Then you can easily change that last clause to do SELECT FROM query_two while you keep everything as is, even the CTE you called main_query.

That's indeed a useful trick, but wouldn't it be better to be able to execute pieces of SQL without altering the actual code?

For example, selecting the CTEs up to a certain point and using a particular button or keyboard shortcut within a development tool, much like we can already do with executing selection?

Of course, adding "clever" functionality like that might as well create some risks and inconsistencies, so i'm not entirely sure about that.

Yes, this is only a quick trick and does not solve the general question about unit/integration testing that you are used to in other types of programming.

This is an entirely different subject and deserves a long discussion, and one which SQL is not ideal for.

As you said, breaking CTEs into separate views is the start, and then you can use a tool like getdbt.com to make your references into parameters. And then you also need to create the mock data (which you can do for example by writing it into csv:s)

From the role of an analyst I must say though, that once you have done all that work, the risk is that you forgot to worry about to the biggest risk here. What is actually inside that data you are querying? Maybe your biggest problem is not the logic of the query, but rather how dirty your input data is? Or for that matter, that you made completely incorrect assumptions on your input data, like that column X contains distinct values when it contains duplicates. That type of error wreaks havoc on your end result, with a big chance you'll never notice.

I have to agree with most of what you're saying. Another commenter also voiced concerns about how SQL is sometimes a bit difficult to work with because the underlying domains and concerns that are handled by it are also inherently complex in many ways.

In regards to the correctness of the data and the quality, there are at least constraints that you can put into the DB, but i've personally seen plenty of cases where that isn't even considered and is forgotten about, without even getting into the OTLT and EAV anti-patterns and the implications that they may have: https://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-bi...

In every case where i've seen someone attempt to introduce polymorphic foreign keys, a lot of those consistency guarantees and control mechanisms have gone out the window, since you can't really have conditional constraints or complex logic in there either. Though thankfully not everyone has to deal with things like that in their projects.

One cannot overstate how important modelling your data is, to the point where schema-first is a strong and reliable approach most of the time.

Postgres gives you named views, named schemas, and transactions that undo even table and view creations, so that's what I use to implement an iterative dev cycle.