| >Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job. SQL is very useful, but there are some data manipulations which are much easier to perform in pandas/dplyr/data.table than in SQL. For example, the article discusses how to perform a pivot table, which takes data in a "long" format, and makes it "wider". In the article, the pandas version is: >pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count') Compared to the SQL version: >SELECT
role,
SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D",
SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales"
FROM
emp
GROUP BY
role; Not only does the SQL code require you to know up front how many distinct columns you are creating, it requires you to write a line out for each new column. This is okay in simple cases, but is untenable when you are pivoting on a column with hundreds or more distinct values, such as dates or zip codes. There are some SQL dialects which provide pivot functions like in pandas, but they are not universal. There are other examples in the article where the SQL code is much longer and less flexible, such as binning, where the bins are hardcoded into the query. |
But after some trial and error, I find it much faster to pull relatively large, unprocessed datasets and do everything in Pandas on the local client. Faster both in total analysis time, and faster in DB cycles.
It seems like a couple of simple "select * from cars" and "select * from drivers where age < 30", and doing all the joining, filtering, and summarizing on my machine, is often less burdensome on the db than doing it up-front in SQL.
Of course, this can change depending on the specific dataset, how big it is, how you're indexed, and all that jazz. Just wanted to mention how my initial intuition was misguided.