Hacker News new | ask | show | jobs
by cybersol 2663 days ago
'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.
3 comments

Amen.

I'm guessing regex's are beyond Data Scientist [0], but throw sed and vim into the mix and there are very few one off problems that can be managed by a single CPU that you can't do, and what's more do more efficiently than any other tool chain. The overhead of throwing it into a SQL database or whatever is do big, these simple tools simply blow them away is you are doing it just once.

[0] I'm guessing a "Data Scientist" is someone who knows a lot about the data and the scientific domain that created it, and to whom a computer is just a just another hammer you hit the data with. A hammer that someone deliberately made insanely and unnecessarily complex for job security, or something.

I can't tell you how many times a combination of sort and join, with a bit of awk has saved by bacon. Seems to be a rather rare skill to have among the various Unix admins I've worked with in the past.

One thing to note, set LANG=C before doing operations with sort and join. I'm not sure if it is a bug, or if it is in all versions, but if you have for example LANG=en_US.utf8 then sort will use one order ("_" comes before '-"), but join uses ASCII order. Note, you don't have to export LANG=C, just put it prior to the command your are launching to export it to just that one command.

  LANG=C sort ...
I know this is not a "official" unix tool, but for join/comm specifically for CSV I love this tool: https://github.com/BurntSushi/xsv/releases/tag/0.13.0

xsv help: https://i.imgur.com/yS8cen7.png