| HN Mirror

I guess it depends on who you ask but personally I am able to write pandas much faster than loading data into a DB and then processing it. The reason is pandas defaults on from_ and to_ are very sane and you don’t need to think about things like escaping strings and stuff. It’s also easy to deal with nulls quickly in pandas and rapidly get some EDA graphs like in R.

The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.

My usual workflow is: Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.

This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.

On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.