| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by decisiveness 3463 days ago

The parallel example will definitely not work as you expect, and will likely result in most duplicate lines still being present. When using --pipe this way, if you don't declare --block to be the size of the file (in which case there's no benefit to using parallel), each parallel execution will be run on a separate 1MB (default --block size) chunk of the file before outputting results all separately, then together in a single group (stdout), to the output file.

If you're looking to spread work across CPUs and correctly get the desired output, I'd do something like:

    parallel -a input.txt --pipepart mawk \'\!a[\$0]++\' | mawk '!a[$0]++' > output.txt

I used mawk because it is typically much more performant on large files.