| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by NyxWulf 4530 days ago

wc -l is faster however it only works on an uncompressed file. The pipeline in the article was doing much more complicated work on the stream of output. I will often separate my pipeline logic from my "read this data and feed it into the pipeline" logic, since the logic around what to read in and where to write it can often change even when using the same logical pipeline.

As a simple example: cat somefiles.txt | pipeline_script > stored_output.txt

Other times I'm reading gzipped files or remote log files, I don't want that data mixed in with the pipeline logic. If I want to move the output files to a different directory on one set of servers, that may not impact my generalized pipeline.

I work with trillions of lines of log files and there are many ways to scale up pipelines. I wouldn't start optimizing the difference between 1.6s and 1.275s unless it made an economic impact on the problem. How often is it run? Can you process more lines in an economic unit of time with the faster version? If this is a job that runs once an hour or even once a minute how many lines are collected during that time?

Intermixing the grep or wc -l logic into the pipeline logic can have adverse support and maintenance costs and many times saves machine time while spending programmer time. Which one dominates in this scenario?

It's like the old saying...there is more than one way to skin a "cat". :)