| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bakuninsbart 1201 days ago
	A bit off topic, but what would you use for data "mangling"? Like joining csvs on complex conditions, cleaning tables etc. Pandas seems to be the wrong tool for this, but I still often find myself using it as in contrast to something like Excel, my steps are at least clearly documented for future use or verification.

2 comments

faizshah 1201 days ago

If you asked this question 6 or 8 years ago the answer would be it depends on the volume of data (10s of gb, 100s of gb etc.) and I could give you just a single tool that would help you in most cases.

Today honestly most tools are pretty capable, pandas is a great choice and if you have really high volumes of data you might try koalas (spark) or polars.

Honestly the biggest design considerations for data science today are things things external to your project: what do you and others on your team know, what tools does your company already have setup, what volume of data are you processing, what are your SLAs, who or what else needs to run this script/workflow, what softwares do you need to integrate with, how often does it need to be processed, how are you going to assure the quality of your data and what tools are you using for reporting?

I tend to use pandas and SQLite for most use cases cause I can cook up a script in 2 hours and be done, I just code it interactively in a notebook and most people are able to work on a pandas or SQLite script productively if it needs to be maintained even if they don't know python. If its a large volume of data or a rapid schedule (minutes, seconds) or tight SLAs on quality or processing time, then I start to consider whether pyspark, Apache beam, dask or bigquery might be a good fit.

So it really just depends but for most people who are processing < 100 GB on a 1+ day schedule or ad hoc I would recommend just using pandas or tidyverse in R and getting really good at writing those scripts fast. Today you’ll get the most mileage out of those two tools.

link

bombcar 1201 days ago

I still use perl for some of that stuff, or even awk, but those are barely reusable or readable.

link

faizshah 1201 days ago

This is a letter to the general community: please stop writing these scripts in perl and bash one liners. That one off script you thought would only be used once or twice at this nonprofit has been in continuous use for 12 years and every year a biologist or journalist runs your script having no idea how it actually works. Eventually the script breaks after 8 years and some poor college student interns there and has to figure out how perl works, what your spaghetti is doing and eventually is tasked with rewriting it in python as an intern project (true story).

link

JohnFen 1201 days ago

I think your complaint isn't really about perl and bash. It's about knowing your audience.

When writing code that will be used by a particular sort of user base, the code should be written in whatever way best suits that user base. If your users are academics, researchers, journalists, etc. -- yes, avoid anything with complex or obscure semantics like perl or bash.

But if your code is going to be used by programmers or people who are already comfortable with perl/bash/whatever, those tools may be just the ticket.

link

tejtm 1201 days ago

one line spaghetti ... I remain unsympathetic.

link

JohnFen 1201 days ago

He has a valid point, though. I've seen (and written!) one-liners that were so complex that nobody, even devs, can deal with them without decoding them first.

They aren't technically "spaghetti", but they are technically impenetrable.

I argue that one-liners like that aren't good for anybody, dev or otherwise.

link