| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ashvardanian 238 days ago

Storage, strings, sorting, counting, bioinformatics... I got nerd-sniped! Can't resist a shameless plug here :)

Looking at the code, there are a few things I would consider optimizing. I'd start by trying (my) StringZilla for hashing and sorting.

HashBrown collections under the hood use aHash, which is an excellent hash function, but on both short and long inputs, on new CPUs, StringZilla seems faster [0]:

                               short               long
  aHash::hash_one         1.23 GiB/s         8.61 GiB/s
  stringzilla::hash       1.84 GiB/s        11.38 GiB/s

A similar story with sorting strings. Inner loops of arbitrary length string comparisons often dominate such workloads. Doing it in a more Radix-style fashion can 4x your performance [1]:

                                                    short                  long
  std::sort_unstable_by_key           ~54.35 M compares/s    57.70 M compares/s
  stringzilla::argsort_permutation   ~213.73 M compares/s    74.64 M compares/s

Bear in mind that "compares/s" is a made-up metric here; in reality, I'm comparing from the duration.

[0] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

[1] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

1 comments

noamteyssier 237 days ago

Cool suggestions! I definitely would be interested in exploring other hash functions for this (and other binf works) so I'll definitely take a look at your stringzilla lib.

link