Hacker News new | ask | show | jobs
by j_z_reeves 2246 days ago
nice, I finally took the time to read the man pages for awk. And whipped out a script to count the number of errors occurred for a particular day for a postgres log file.

   cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort
I just needed to know how awk programs are structured, the rest is just simple programming!

EDIT: I'm not sure if it's actually correct however...

3 comments

j_z_reeves 9 hours ago [-]

> cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort

Great first program! a bit less verbose could be

> awk '/ERROR:/ {counts[$1]++}END{...}' logfile

there are also ways of sorting the output but within (g)awk (asort & asorti) but sorting externally as you have is more flexible and engages another core which can be faster on large input

Without knowing the format of logfile, that still seems obviously correct to me.
I was surprised by the counts[$1] = counts[$1] + 1, since I didn't think it would correctly coerce a non-existing value to a 0.
Apart from the useless use of cat, since sort does the work here something like the following would probably suffice:

grep ERROR logfile | cut -f 1 -d ' ' | sort | uniq -c

There's really nothing useless about that use of cat: it makes the pipeline compose better from left to right. It's not like you have to pay 25 cents for each process you spawn.
So does the pipeline above.

It's not detrimental to performance since an empty cat is a no-op in a pipeline. You can have any number of them. But commands should be written for humans to understand, and inserting no-ops is a distraction to the reader.

In the trivial example, "grep needle haystack" reads better than "cat haystack | grep needle".

yes that would also work! I forgot about the `-c` argument for uniq.