'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.
I'm guessing regex's are beyond Data Scientist [0], but throw sed and vim into the mix and there are very few one off problems that can be managed by a single CPU that you can't do, and what's more do more efficiently than any other tool chain. The overhead of throwing it into a SQL database or whatever is do big, these simple tools simply blow them away is you are doing it just once.
[0] I'm guessing a "Data Scientist" is someone who knows a lot about the data and the scientific domain that created it, and to whom a computer is just a just another hammer you hit the data with. A hammer that someone deliberately made insanely and unnecessarily complex for job security, or something.
I can't tell you how many times a combination of sort and join, with a bit of awk has saved by bacon. Seems to be a rather rare skill to have among the various Unix admins I've worked with in the past.
One thing to note, set LANG=C before doing operations with sort and join. I'm not sure if it is a bug, or if it is in all versions, but if you have for example LANG=en_US.utf8 then sort will use one order ("_" comes before '-"), but join uses ASCII order. Note, you don't have to export LANG=C, just put it prior to the command your are launching to export it to just that one command.
I'm guessing regex's are beyond Data Scientist [0], but throw sed and vim into the mix and there are very few one off problems that can be managed by a single CPU that you can't do, and what's more do more efficiently than any other tool chain. The overhead of throwing it into a SQL database or whatever is do big, these simple tools simply blow them away is you are doing it just once.
[0] I'm guessing a "Data Scientist" is someone who knows a lot about the data and the scientific domain that created it, and to whom a computer is just a just another hammer you hit the data with. A hammer that someone deliberately made insanely and unnecessarily complex for job security, or something.