| * If it's simple transforms, use cli tools. * If it requires aggregation and it's small, use cli tools. * If this is data you're using over and over again then load it in the database and then do the cleaning, ELT. * If it's 2tb of data and under, still use bzip2, get splittable streams and pass it to gnu parallel. * If it requires massive aggregations or windows, use spark|flink|bleam. * If you need to repeatedly process the same giant dataset use spark|flink|bleam. * If the data is highly structured and you mainly need aggregations and filtering on a few columns use columnar DBs. I've been using Dlang with ldc a lot because of how fast its compile time regex is, and its built in json support. Python3+pandas is also a good choice if you don't want to use awk. |
Sort is good for aggregations that fit on disk (TBs these days, I guess)
Perl does well too if the output fits in a hashtable in DRAM, so 10’s (or maybe 100’s?) of GBs