|
|
|
|
|
by decisiveness
3463 days ago
|
|
The parallel example will definitely not work as you expect, and will likely result in most duplicate lines still being present. When using --pipe this way, if you don't declare --block to be the size of the file (in which case there's no benefit to using parallel), each parallel execution will be run on a separate 1MB (default --block size) chunk of the file before outputting results all separately, then together in a single group (stdout), to the output file. If you're looking to spread work across CPUs and correctly get the desired output, I'd do something like: parallel -a input.txt --pipepart mawk \'\!a[\$0]++\' | mawk '!a[$0]++' > output.txt
I used mawk because it is typically much more performant on large files. |
|