Hacker News new | ask | show | jobs
by MontyCarloHall 238 days ago
>I use nucgen to generate a random 100M line FASTQ file and pipe it into different tools to compare their throughput with hyperfine.

This is a strange benchmark [0] -- here is what this random FASTQ looks like:

  $ nucgen -n 100000000 -l 20 | head -n8
  
  >seq.0
  TGGGGTAAATTGACAGTTGG
  >seq.1
  CTTCTGCTTATCGCCATGGC
  >seq.2
  AGCCATCGATTATATAGACA
  >seq.3
  ATACCCTAGGAGCTTGCGCA
There are going to be very few [*] repeated strings in this 100M line file, since each >seq.X will be unique and there are roughly a trillion random 4-letter (ACGT) strings of length 20. So this is really assessing the performance of how well a hashtable can deal with reallocating after being overloaded.

I did not have enough RAM to run a 100M line benchmark, but the following simple `awk` command performed ~15x faster on a 10M line benchmark (using the same hyperfine setup) versus the naïve `sort | uniq -c`, which isn't bad for something that comes standard with every *nix system.

  awk '{ x[$0]++ } END { for(y in x) { print y, x[y] }}' <file> | sort -k2,2nr
[0] https://github.com/noamteyssier/hist-rs/blob/main/justfile

[*] Birthday problem math says about 250, for 50M strings sampled from a pool of ~1T.

2 comments

The awk script is probably the fastest way to do this still, and it's faster if you use gawk or something similar rather than default awk. Most people also don't need ordering, so you can get away with only the awk part and you don't need the sort.
Totally agree it's a bit of weird benchmark - it was just the first thing that I thought of to generate a huge amount of lines to test throughput.

There are definitely other benchmarks that we could try as well to test other characteristics as well.

I've actually just added in this `awk` implementation you provided to the benchmarks well.

Cheers!