Hacker News new | ask | show | jobs
by gerner 3903 days ago
In my experience, and I suppose depending on the data, I've found that grep is often the bottleneck for data pipeline tasks like you describe. The silver searcher (https://github.com/ggreer/the_silver_searcher) is, in my experience, about 10x faster than grep for tasks like pulling out fields from json files. It's changed my life.

pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.