|
|
|
|
|
by MontyCarloHall
238 days ago
|
|
>I use nucgen to generate a random 100M line FASTQ file and pipe it into different tools to compare their throughput with hyperfine. This is a strange benchmark [0] -- here is what this random FASTQ looks like: $ nucgen -n 100000000 -l 20 | head -n8
>seq.0
TGGGGTAAATTGACAGTTGG
>seq.1
CTTCTGCTTATCGCCATGGC
>seq.2
AGCCATCGATTATATAGACA
>seq.3
ATACCCTAGGAGCTTGCGCA
There are going to be very few [*] repeated strings in this 100M line file, since each >seq.X will be unique and there are roughly a trillion random 4-letter (ACGT) strings of length 20. So this is really assessing the performance of how well a hashtable can deal with reallocating after being overloaded.I did not have enough RAM to run a 100M line benchmark, but the following simple `awk` command performed ~15x faster on a 10M line benchmark (using the same hyperfine setup) versus the naïve `sort | uniq -c`, which isn't bad for something that comes standard with every *nix system. awk '{ x[$0]++ } END { for(y in x) { print y, x[y] }}' <file> | sort -k2,2nr
[0] https://github.com/noamteyssier/hist-rs/blob/main/justfile[*] Birthday problem math says about 250, for 50M strings sampled from a pool of ~1T. |
|