|
I gained a a few years of experience in SQL-based OLAP systems at my current job. In this time I developed a strong appreciation for SQL, especially for its composability. Recently, I started a project in Google Colab, gluing together queries from several systems with Pandas DataFrames. I can honestly say that I've never been more frustrated learning an API than I have with Pandas. Need some window function like LAG() or LEAD()? Too bad, I hope you like writing Python "for i in range(...):" loops. My notebook is littered with ".reset_index()" calls, ".replace(np.nan, None)", "axis='columns'", "foo.assign(bar=lambda df: df.apply(lambda row: ...))". groupby is especially confusing to me, as a Pandas GroupBy is difficult to compose with a normal DataFrame until you call .reset_index(). Compare this to SQL, where a subquery is a subquery, whether or not it has a GROUP BY clause. The Pandas documentation also leaves a lot to be desired. Take the documentation of pandas.NaT[1] for example. "pandas.NaT: alias of NaT". Ok? That still doesn't tell me what NaT is, nor does it link to the thing that it aliases. The groupby documentation[2] also caused me some headaches, as it covers only the simplest aggregation use-cases. Pandas is clearly better for some use-cases, but mostly for simple operations that are well-supported by the API (perhaps numeric operations that are implemented with native numpy routines). But if I'm doing some interactive OLAP stuff, I'll reach for SQL. Perhaps the problem is I'm trying to use Pandas like it's SQL, when it's not. But for manipulating data, I'd rather use a language than a library. [1] https://pandas.pydata.org/docs/reference/api/pandas.NaT.html
[2] https://pandas.pydata.org/docs/user_guide/groupby.html edit: half a sentence |
Pandas is easily the worst dataframe api.
I'll never go back to SQL from polars, it's far superior in both composability and readability imo.
Not to mention complex transforms can be version controlled and unit tested, and then you can compose these together.
It also maps to/from SQL quite naturally.