I guess it depends on who you ask but personally I am able to write pandas much faster than loading data into a DB and then processing it. The reason is pandas defaults on from_ and to_ are very sane and you don’t need to think about things like escaping strings and stuff. It’s also easy to deal with nulls quickly in pandas and rapidly get some EDA graphs like in R.
The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.
My usual workflow is:
Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.
This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.
On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.
>>Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.
this is only true only for newbie python devs that learned about pandas from blogs on medium.com. I have pipelines that process terabytes per day in a serverless datalake, and it requires zero DBA work that usually comes if you use anything *Sql
I've processed TBs of CSV files with pandas. You can always read files in chunks and in the end, SQL will also need to read data somewhere from a disk.
The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.
My usual workflow is: Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.
This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.
On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.