| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rosser 4530 days ago

The shell command

  cat file | grep 'expression'

is the shell analogue of the SQL query

  select * from (select * from table) as q where column like '%expression%'

...except that a decent query optimizer will collapse the extraneous inner query. Bash doesn't know how to do that.

It's also more expensive:

  # wc -l /var/log/secure
  97845 /var/log/secure
  # time cat /var/log/secure | grep root > /dev/null

  real    0m1.600s
  user    0m1.517s
  sys     0m0.294s  
  # time grep root /var/log/secure > /dev/null

  real    0m1.275s
  user    0m1.237s
  sys     0m0.036s

That's an extra .3-and-change seconds, or over 25% longer, on a 98k line file. Scale that up to a multi-million line ngnix log file, and I'd actually say it's a worse than useless use of cat.

3 comments

NyxWulf 4530 days ago

wc -l is faster however it only works on an uncompressed file. The pipeline in the article was doing much more complicated work on the stream of output. I will often separate my pipeline logic from my "read this data and feed it into the pipeline" logic, since the logic around what to read in and where to write it can often change even when using the same logical pipeline.

As a simple example: cat somefiles.txt | pipeline_script > stored_output.txt

Other times I'm reading gzipped files or remote log files, I don't want that data mixed in with the pipeline logic. If I want to move the output files to a different directory on one set of servers, that may not impact my generalized pipeline.

I work with trillions of lines of log files and there are many ways to scale up pipelines. I wouldn't start optimizing the difference between 1.6s and 1.275s unless it made an economic impact on the problem. How often is it run? Can you process more lines in an economic unit of time with the faster version? If this is a job that runs once an hour or even once a minute how many lines are collected during that time?

Intermixing the grep or wc -l logic into the pipeline logic can have adverse support and maintenance costs and many times saves machine time while spending programmer time. Which one dominates in this scenario?

It's like the old saying...there is more than one way to skin a "cat". :)

link

dredmorbius 4530 days ago

over 25% longer

You're assuming a constant factor here. That might be the case. Or it could be one-time start-up overhead as the linker finds and opens library files.

Unless you profile the process over a range of input sizes, you don't know which you're observing. And you know what they say about premature optimization.

link

matdes 4530 days ago

is anyone using this particular blog's interpretation as production code??

No. And if you actually go to the link to the original question, you'll see that cat doesn't exist.

Last, I would disagree that `cat file` is equivalent to `select * from table`, his argument makes the comparison that `cat file` is equivalent to `table` itself, or `load data into table` which needs to happen before any relation can be performed against it.

link