|
|
|
|
|
by rosser
4530 days ago
|
|
The shell command cat file | grep 'expression'
is the shell analogue of the SQL query select * from (select * from table) as q where column like '%expression%'
...except that a decent query optimizer will collapse the extraneous inner query. Bash doesn't know how to do that.It's also more expensive: # wc -l /var/log/secure
97845 /var/log/secure
# time cat /var/log/secure | grep root > /dev/null
real 0m1.600s
user 0m1.517s
sys 0m0.294s
# time grep root /var/log/secure > /dev/null
real 0m1.275s
user 0m1.237s
sys 0m0.036s
That's an extra .3-and-change seconds, or over 25% longer, on a 98k line file. Scale that up to a multi-million line ngnix log file, and I'd actually say it's a worse than useless use of cat. |
|
As a simple example: cat somefiles.txt | pipeline_script > stored_output.txt
Other times I'm reading gzipped files or remote log files, I don't want that data mixed in with the pipeline logic. If I want to move the output files to a different directory on one set of servers, that may not impact my generalized pipeline.
I work with trillions of lines of log files and there are many ways to scale up pipelines. I wouldn't start optimizing the difference between 1.6s and 1.275s unless it made an economic impact on the problem. How often is it run? Can you process more lines in an economic unit of time with the faster version? If this is a job that runs once an hour or even once a minute how many lines are collected during that time?
Intermixing the grep or wc -l logic into the pipeline logic can have adverse support and maintenance costs and many times saves machine time while spending programmer time. Which one dominates in this scenario?
It's like the old saying...there is more than one way to skin a "cat". :)