Hacker News new | ask | show | jobs
by sdevonoes 1751 days ago
I takes days/weeks to pick up the core of SQL:

- create tables, update the schema, insert rows, add an index

- select, filters, joins, order by, limit, inner queries

It takes forever to be comfortable with:

- anything that involves summarizing, grouping, having, min, max, windows

7 comments

I think one tip here is to always work with CTEs "WITH", and not nest queries. That way you can always go back and check "what is it I join with what", by querying the individual steps with LIMIT 10.

The other tip is to sketch the problem in excel/google sheets when it gets hairy. Not the actual code (I don't have a clue how to do that, others have), just the values in the different steps. In the end it is only about rows and columns.

But that said, these days a lot of it happens intuitively for me, I pretty much know the solution before I can spell it out. It certainly was not like that when I started.

When you begin, "programming without for loops" feels like programming with your right hand tied behind your back. But in hindsight you get a lot of exercise in the immutable paradigms of functional programming, working with comprehensions, sets, maps folds comes very natural.

Be careful with blanket CTE recommendations. They tend to deoptimize with writes in really surprising ways. If you’re just reading data, they’re great.
I would still say they are not great as most engines just... write your sql out, instead of saving/memoizing anything.
All of my experience here is with Postgres, where (in recent versions) read-only CTEs are basically temporary views with all the built in optimizations that come from that.
This is what I thought would be a good tip when I first learned about CTEs. Turns out I use them sparingly and much prefer to create views to represent steps of data refinement. The benefit of doing it this way is that I can inspect intermediate results. Often I use *.sql files for this so I can keep the SQL formulations for later; also, as long as stuff is still experimental / under development, I start out with `begin transaction;` because then I'm guaranteed I do not permanently change anything in the DB I'm working on and also I can repeat all steps without having to care about `create or replace view ...`.

Oh and lest I forget you can't just re-use a CTE in another query. But you can of course re-use a view. Also given what another user here remarked, Postgres might internally treat (and optimize) CTEs like views, so to me that makes views superior to CTEs in more or less all respects.

Have some links on using CTEs to replace nested queries and how it helps ?

I've used SQL enough to have to write nested queries, haven't dove further than that.

Here's an article with some examples:

https://learnsql.com/blog/sql-subquery-cte-difference/

You can think of CTEs as a way to save an intermediate query as a sort of temp variable/table that can be used in the final statement following `WITH something AS (...)`. They are great for flattening your queries and giving descriptive names to subqueries, and you can chain multiple WITHs together as well.

This is good advice but sometimes you will need to nest things depending on database and optimizer and how it handles CTEs, especially if you ever mess with recursive things. An additional problem here is that there’s no standard way to represent EXPLAIN queries across systems so that makes an additional barrier to entry unfortunately.
> An additional problem here is that there’s no standard way to represent EXPLAIN queries across systems so that makes an additional barrier to entry unfortunately.

What so you mean by this?

The outputs from each DBMS are often specific to their implementation, even when there is a lot of overlap. The good news is they are usually similar enough and user friendly enough that an advanced user will have no issue but for beginners it can be daunting.
Anyway beginners learn by just using single RDBMS, isn't it?
Not necessarily. I often see people learn starting with SQLite because it's easy/installed (and quite a bit of tutorials start with it, especially for Python), and then move to MySQL/Postgres, but that is N=2 so it's not too bad.
Yep, CTEs are a huge boon to structure your SQL - use them where you can.
I've always found it hard to articulate the problems that i have with SQL and with the "WITH" CTEs, but let me try anyways.

For starters, i can never actually tests parts of those queries without rewriting the query up to the part that i want to test, for example:

  WITH 
    query_one AS (SELECT ...),
    query_two AS (SELECT ...),
    query_three AS (SELECT ...)
    SELECT ... /* main query */
If i want to test the second query, i need to take the first and second ones, copy them into a new worksheet and then rewrite the second one not to have the alias but instead be the main query. This is annoying when you have 5-10 CTEs and you need to test something in the middle.

Then, working with SQL and CTEs feels like going back from a language where functions are first class citizens to one where no such thing exists, just in regards to querying data. It would be nice if i could store parts of queries under packages, to be able to write dynamic SQL more easily, instead of having to use tools like myBatis for this purpose: https://mybatis.org/mybatis-3/sqlmap-xml.html (see the bit about SQL fragments)

So i'd like to do the following:

  PACKAGE my_snippets BODY IS
    SNIPPET query_one
      SELECT ... /* probably 500 lines long but often used snippet */
    END query_one;
  END my_snippets;
  
  /* and then, somewhere in code */
  WITH
    query_one AS my_snippets.query_one,
    query_two AS (SELECT ...),
    query_three AS (SELECT ...)
    SELECT ... /* main query */
Now, you might suggest that using views works for this intent, but what about most DBMSes out there having silly naming rules and restrictions? I don't want to work with v_mtz_wg_priv_prod_attr because someone thought that having just a few dozen characters makes sense as a restriction. Furthermore, you really can't group views into logical packages based on their intent, now can you? So, with views you end up with something that's very much like your cluttered list of tables, which gets really hard to get a good overview of when you have about 300 of them.

Next up, debugging in databases is just really bad. How am i supposed to put logging in the queries, without mixing the logging code with the other triggers and tables? What about debugging long running processes? What about adding breakpoints that i can trigger when a particular view or table is accessed? What about doing this on the server while i have a local app instance connected to the DB, or maybe even another app server? Why can't i step through the query execution and see how the filtered record count changes with each "step"?

Apart from that, my problems are largely with the tooling around databases. There are relatively few universal (cross language) DB migration solutions out there, for example dbmate, every framework seems to have its own approach. There seems to be this odd division between procedural SQL and regular SQL statements, where what you can do differs based on context, which is inconsistent. Procedural languages as a whole vary wildly in what they can do - you won't be doing complex logic with custom types on MySQL/MariaDB anytime soon, whereas Oracle or PostgreSQL will suffice. But even those two have different dialects, it's never "just SQL". There are oddities with selecting certain kinds of data, only pgAdmin seems to work nicely with geospatial data, but apart from that i've also seen problems with using lower level JDBC logic which you can't really test outside of the app, in something like SQL Developer. But even apart from that, as much as we like ER diagrams, MySQL Workbench is the only tool that i've seen which allows you to actually do model driven development properly and synchronize schemas and do forward/backward engineering - even pgAdmin fails at doing this. Oh, and the tools themselves are really inconsistent - you'll see a world of difference between MySQL Workbench, pgAdmin, SQL Developer, JetBrains DataGrip and others.

And now those DBMSes are attempting to add more functionality, such as exposing REST interfaces, instead of fixing the underlying and dated problems, because people out there are relying on those and therefore the logic is set into stone. It's no wonder that every year there's a new product or two that attempt to improve upon these, even if most of the time those products die out.

Perhaps the above is a stream of consciousness with some annoying things that i've dealt with over the years, but personally, relational databases are something that i use because they're often the least horrible tool for the job, even if they are not pleasant or easy to use, at least as easy as they should be. That's where i think the main problem lies - tools should be good for solving the problems on which they'll be used, these ones aren't.

Someone with 20 years of experience might have a different outlook, but personally i'd suggest that you utilize DBMSes for what they're good for - storing, retrieving and manipulating data and don't get too carried away with in database processing otherwise, since doing certain things within the app code seems to scale horizontally far more easier in some situations, has better auditability, debugging etc.

Long reply for a workday, but a quick reaction from your first example (with a slight risk that I misunderstood your problem).

What you do here is to also wrap the main query into a CTE and then end everything with SELECT * from main_query.

Then you can easily change that last clause to do SELECT FROM query_two while you keep everything as is, even the CTE you called main_query.

That's indeed a useful trick, but wouldn't it be better to be able to execute pieces of SQL without altering the actual code?

For example, selecting the CTEs up to a certain point and using a particular button or keyboard shortcut within a development tool, much like we can already do with executing selection?

Of course, adding "clever" functionality like that might as well create some risks and inconsistencies, so i'm not entirely sure about that.

Yes, this is only a quick trick and does not solve the general question about unit/integration testing that you are used to in other types of programming.

This is an entirely different subject and deserves a long discussion, and one which SQL is not ideal for.

As you said, breaking CTEs into separate views is the start, and then you can use a tool like getdbt.com to make your references into parameters. And then you also need to create the mock data (which you can do for example by writing it into csv:s)

From the role of an analyst I must say though, that once you have done all that work, the risk is that you forgot to worry about to the biggest risk here. What is actually inside that data you are querying? Maybe your biggest problem is not the logic of the query, but rather how dirty your input data is? Or for that matter, that you made completely incorrect assumptions on your input data, like that column X contains distinct values when it contains duplicates. That type of error wreaks havoc on your end result, with a big chance you'll never notice.

I have to agree with most of what you're saying. Another commenter also voiced concerns about how SQL is sometimes a bit difficult to work with because the underlying domains and concerns that are handled by it are also inherently complex in many ways.

In regards to the correctness of the data and the quality, there are at least constraints that you can put into the DB, but i've personally seen plenty of cases where that isn't even considered and is forgotten about, without even getting into the OTLT and EAV anti-patterns and the implications that they may have: https://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-bi...

In every case where i've seen someone attempt to introduce polymorphic foreign keys, a lot of those consistency guarantees and control mechanisms have gone out the window, since you can't really have conditional constraints or complex logic in there either. Though thankfully not everyone has to deal with things like that in their projects.

One cannot overstate how important modelling your data is, to the point where schema-first is a strong and reliable approach most of the time.

Postgres gives you named views, named schemas, and transactions that undo even table and view creations, so that's what I use to implement an iterative dev cycle.
Not so sure about aggregation, but what definitely takes forever is performance.

To make a non-trivial SQL query scale to non-trivial amounts of data, you have to understand the physical data organization and how query optimizer is likely to use it, which is kind of contradictory to the idea of the SQL as a "declarative" language where you just say what you want, and let the query optimizer figure out how to get it.

Instead, you have to design your indexes carefully to coax the optimizer into choosing a reasonable access path for your particular query. And do the same for all queries where performance is important.

Indexes are fundamentally not about data, but about access patterns. Which is what the developers are responsible for. That's why physical database design is a development task, not database administration task.

I think RDB is one of the most leaky abstraction. Despite that, it's still very useful.
Something other which is hard is writing performant queries. Using statements with subqueries/in syntax for example.

And I always forget which join does what.

Remembering which join does what is easy: inner joins strictly joins the tables, left takes all on left (first table), right takes all on right (second table), outer (or cross) join are so rarely used you don't need to memorize.
> outer (or cross) join are so rarely used you don't need to memorize.

I’d say they are more important to know than right joins.

same with the joins. on paper it makes sense, in practice it does not. if its more than a "select * from dbo.whatever where column abc = 'thing'" i have to refer back to notes and play with it.
Basically inner/left/right is a choice on how you want your NULLs: do you want all the rows that match, all the rows on the left, or all the rows on the right?
Meh, summarizing, grouping, etc aren't that hard.

However WINDOW queries definitely have a learning curve. Not the least because useful examples almost always require you to use a nested query.

I learned SQL on a need to know basis. For me, recursive queries were the ones that needed the most time to click.

Another one that caught me by surprise was NULL vs unknown[1]. That bit me in a couple of queries.

[1]: https://learnsql.com/blog/understanding-use-null-sql/

The problem with NULL in SQL is that the semantics are inconsistent in complex ways. See for example https://vettabase.com/blog/what-does-null-mean-in-sql/.
Recursive queries are definitely top of the complexity pile IMO. However I usually discount them because I am yet to actually need them in a production environment. Window functions are super useful on the other hand.
Heh, perhaps it's more accurate to say I've abused recursive queries.

One instance was to turn a column containing comma separated values into rows[1], so I could join on them. Wasn't for a query that needed performance of course.

[1]: https://news.ycombinator.com/item?id=28020321

I had this same issue in Redshift and ended up populating a table with values 1 to the maximum number of commas found (e.g. using max(regexp_count(...)) or something), then cross joining on the table with the csv column and calling split_part on the corresponding column and index (with the index coming from the numbers table). The cross join ensures that you index every value of the csv column.
What I think you'd do here (in modern implementations of SQL at least) is to split the values into an array using some convenient split function and self join on that array to pivot the arrays into rows.
> some convenient split function

If only the DB we're using had one of those :)

A lot of people try to do stuff that doesn't actually make sense when it comes to groups. Like selecting a column which isn't contained in the group by. And they are confused by the error but when you talk them through it "What did you actually want to see? There are multiple values for this column now" it starts to become clear to them.
Not really. I used to teach SQL not a long time ago and about 1/4 of the trainees were getting up to speed fast, about half in a reasonable time, the rest were there only because they were sent there by their managers.

I found that the most important success factors in learning SQL is the analytical thinking of the trainee and the way the trainer is explaining the concepts, in what order and what examples are used (the best examples are the ones the trainees meet in their regular work).

The functions are simple, the only difficulty is to remember the ones that are not used often enough (ex: some window functions). Even in that case, a quick check in the documentation is enough to get up to speed. The major difficulty with SQL is to write efficient queries on large data volumes, covered by the right indexes. This is very specific to each RDBMS, especially because of the tools helping with the work are specific (ex: SSMS, SQL Sentry Plan Explorer, statistics parser etc).

If all you do is SQL, anybody can learn it quick. If you do full stack, you're only gonna care about what gives you the data you need at that time.
No offence, but a full stack's job is not to write good SQL, it's to write enough SQL to get what is needed, then the development DBA's job is to make it fast and efficient. You don't need to be a great car mechanic to drive to the office and back.
I have never worked somewhere that has a dedicated database person. The SQL goes in the application code so it is the developers job to commit stuff which is performant and safe.

There is no point having developers write sql at all if someone else has to come in and redo it after.

I have a team of 4 people doing part time exactly this: improving performance of queries from various apps. My team does not know the business logic (many apps, not enough people), the developers don't have the knowledge to build good SQL code for databases with tables of hundreds of GB each. If you write small web apps it is not a problem, but if you have hundreds of servers with that size of databases, development DBAs are a must.
You assume a large company with lots of specialised roles. But lot's of full-stack developers work in small companies where they are the only DBAs. I for example have never worked at a company that employed DBAs.
At most places I've worked, there are no DBAs, and the devs are responsible for optimising all the SQL.

On the occasion that there have been DBAs at the company I worked for, they always refused to help with any SQL, on the grounds that all SQL is "application level", and insisting that they were only responsible for configuring / deploying / monitoring the DB infrastructure.

The configuring/deploying/monitoring the DB infrastructure DBAs are production DBAs. The ones tuning query performance are developer DBAs. If you put the large queries in stored procedures instead of the application, they can be easily touched and improved by DBAs.

No need of DBAs if your database is up to a few GB, you cannot live without DBAs if you exceed 100GB. I have several hundred SQL servers with databases exceeding 1 TB, on average several hundreds of GB each. This is where performance tuning is essential.

Personally I feel that it took me a while to get really comfortable with more complex joins. There’s an problem they used in the study that required joining a table with itself, and honestly I would probably take a while to come up with that answer, if at all.
A basic approach(probably what they are going for in a basics study) would be something like this off the top of my head

    select c.cid, c2.cid
    from customer as c
    inner join customer as c2 on c.street = c2.street
    where c.city <> c2.city
though that has reflective duplicates say (1, 5) would also have (5, 1) in the output. So I'm not sure if that's "allowed"
There are lots of cases where a join with yourself is applicable, although they are mostly superseded by window functions these days.

For example normalisation (join with a groupby/sum of yourself) or rank (join each row with all rows that have lower value than yourself and count those rows).

But as I mentioned above. A good start is to sketch that out in excel. You will realize that what you need is another column (e.g. total sum for this id). And from that you can work yourself backwards to figure out what is the table you need to join with to create that column.

I'm not sure if you are replying to the wrong person but the question has nothing to do with a total sum of ids...

the question was: "List all pairs of customer IDs who live on a street with the same name but in a different city." listed under self-join

that said i haven't wrangled with raw sql in a spell so the reading on window functions is interesting.

1. GP mentioned hairy self joins 2. You replied with an example 3. I pointed out that there are other examples.

That said, I have a feeling your duplicates can be fixed by adding the requirement that c.cid < c2.cid

Not sure a window function would help in this particular situation, but they are there to help in more mundane examples.

I see how it can be read that way yeah. they mentioned the study used a self join problem that they couldn't likely figure out so I looked at the paper and quickly did it off the top of my head. the less than predicate does indeed do the trick.
and worst of all, anything that involves vendor-specific keywords…