| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by faizshah 1201 days ago

If you asked this question 6 or 8 years ago the answer would be it depends on the volume of data (10s of gb, 100s of gb etc.) and I could give you just a single tool that would help you in most cases.

Today honestly most tools are pretty capable, pandas is a great choice and if you have really high volumes of data you might try koalas (spark) or polars.

Honestly the biggest design considerations for data science today are things things external to your project: what do you and others on your team know, what tools does your company already have setup, what volume of data are you processing, what are your SLAs, who or what else needs to run this script/workflow, what softwares do you need to integrate with, how often does it need to be processed, how are you going to assure the quality of your data and what tools are you using for reporting?

I tend to use pandas and SQLite for most use cases cause I can cook up a script in 2 hours and be done, I just code it interactively in a notebook and most people are able to work on a pandas or SQLite script productively if it needs to be maintained even if they don't know python. If its a large volume of data or a rapid schedule (minutes, seconds) or tight SLAs on quality or processing time, then I start to consider whether pyspark, Apache beam, dask or bigquery might be a good fit.

So it really just depends but for most people who are processing < 100 GB on a 1+ day schedule or ad hoc I would recommend just using pandas or tidyverse in R and getting really good at writing those scripts fast. Today you’ll get the most mileage out of those two tools.