| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by m16ghost 1878 days ago

>Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.

SQL is very useful, but there are some data manipulations which are much easier to perform in pandas/dplyr/data.table than in SQL. For example, the article discusses how to perform a pivot table, which takes data in a "long" format, and makes it "wider".

In the article, the pandas version is:

>pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')

Compared to the SQL version:

>SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales" FROM emp GROUP BY role;

Not only does the SQL code require you to know up front how many distinct columns you are creating, it requires you to write a line out for each new column. This is okay in simple cases, but is untenable when you are pivoting on a column with hundreds or more distinct values, such as dates or zip codes.

There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.

There are other examples in the article where the SQL code is much longer and less flexible, such as binning, where the bins are hardcoded into the query.

6 comments

kbelder 1878 days ago

I've been doing a lot of data analysis in Pandas recently. I started off thinking that for efficiency's sake, I should do as much initial processing in the DB as possible, and use Pandas just for the higher level functions that were difficult to do in SQL.

But after some trial and error, I find it much faster to pull relatively large, unprocessed datasets and do everything in Pandas on the local client. Faster both in total analysis time, and faster in DB cycles.

It seems like a couple of simple "select * from cars" and "select * from drivers where age < 30", and doing all the joining, filtering, and summarizing on my machine, is often less burdensome on the db than doing it up-front in SQL.

Of course, this can change depending on the specific dataset, how big it is, how you're indexed, and all that jazz. Just wanted to mention how my initial intuition was misguided.

link

ziml77 1878 days ago

I've always been disappointed by the SQL pivot. It's hardly useful for me if I have to know up-front all of the columns it's going to pivot out into. The solution would be to use another SQL query to generate a dynamic SQL query, but at that point I would rather just use Pandas

link

mr_toad 1878 days ago

> There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.

And you need to define the columns in advance anyway, because query planners can’t handle columns that change at runtime.

link

hantusk 1878 days ago

Agreed. https://ibis-project.org/ and https://dbplyr.tidyverse.org/ can compile dataframe-like input to SQL, which might bridge the gap in tooling (although there still are small differences to the pure dataframe syntax)

link

michelpp 1878 days ago

Does `tablefunc.crosstab()` do what you want?

https://www.postgresql.org/docs/13/tablefunc.html

link

m16ghost 1878 days ago

It is not much better than the canonical example given in the article. It still has the following usability issues:

-You still need to enumerate and label each new column and their types. This particular problem is fixed by crosstabN().

-You need to know upfront how many columns are created before performing the pivot. In the context of data analysis, this is often dynamic or unknown.

-The input to the function is not a dataframe, but a text string that generates the pre-pivot results. This means your analysis up to this point needs to be converted into a string. Not only does this disrupt the flow of an analysis, you also have to worry about escape characters in your string.

-It is not standard across SQL dialects. This function is specific to Postgres, and other dialects have their own version of this function with their own limitations.

The article contains several examples like this where SQL is much more verbose and brittle than the equivalent pandas code.

link

papercrane 1878 days ago

That's one of the non-standard ways to do it. MSSQL and Oracle also have a pivot function to do this. Unfortunately there is no standard way to do this.

link

ellis-bell 1878 days ago

agreed that pivoting can be a pain.

a pattern that i converged on --- at least in postgres --- is to aggregate your data into json objects and then go from there. you don't need to know how many attributes (columns) should be in the result of your pivot. you can also do this in reverse (pivot from wide to long) with the same technique.

so for example if you have the schema `(obj_id, key, value)` in a long-formatted table, where an `obj_id` will have data spanning multiple rows, then you can issue a query like

``` SELECT obj_id, jsonb_object_agg(key, value) FROM table GROUP BY obj_id; ```

up to actual syntax...it's been awhile since i've had to do a task requiring this, so details are fuzzy but pattern's there.

so each row in your query result would look like a json document: `(obj_id, `{"key1": "value", "key2": "value", ...})`

see https://www.postgresql.org/docs/current/functions-json.html for more goodies.

link