Hacker News new | ask | show | jobs
by vlovich123 239 days ago
Why does this test against sort | uniq | sort? It’s kind of weird to sort twice no?
4 comments

The first "sort" sorts the input lines lexicographically (which is required for "uniq" to work); the second "sort" sorts the output of "uniq" numerically (so that lines are ordered from most-frequent to least-frequent):

  $ echo c a b c | tr ' ' '\n'
  c
  a
  b
  c
  
  $ echo c a b c | tr ' ' '\n' | sort
  a
  b
  c
  c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c
        1 a
        1 b
        2 c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c | sort -rn
        2 c
        1 b
        1 a
`uniq -c` introduces a "count" at the beginning of the line, so what we are then sorting is on frequency of the unique terms in the output, not sorting the unique terms again (which indeed would be kindof nonsensical)

  sort | uniq -c | sort -n
The second sort is sorting by frequency (the count output by `uniq -c`).
I often add `head` with `sort -rn` because I'm only interested in the largest.
It's something I've done myself in the past. First sort is because it needs to be sorted for uniq -c to count it proper, second sort because uniq doesn't always give the output in the right order.
more precisely, uniq produces output in the same order as the input to it, just collapsing runs / run-length encoding it