| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by emcq 792 days ago

The 1.4s is _after_ having the file loaded into RAM by the kernel. Because this is mostly I/O bound, it's not a fair comparison to skip the read time. If you were running on a M3 mac you'd might get less than 100ms if the dataset was stored in RAM.

If you account for time loading from disk, the C implementation would be more like ~5s as reported in the blog post [1]. Speculating that their laptop's SSD may be in the 3GB/s range, perhaps there is another second or so of optimization left there (which would roughly work out to the 1.4s in-memory time).

Because you have a lot of variable width row reads this will be more difficult on a GPU than CPU.

[1] https://www.dannyvankooten.com/blog/2024/1brc/

2 comments

pama 792 days ago

The performance report followed the initial request: run 6 times and remove the best and worst outliers, so the mmap optimization is fair game. Agreed that the C code has room left for some additional optimization.

link

emcq 792 days ago

If we are going to consider using prior runs of the program having the file loaded in RAM by the kernel fair, why stop there?

Let's say I create a "cache" where I store the min/mean/max output for each city, mmap it, and read it at least once to make sure it is in RAM. If the cache is available I simply write it to standard out. I use whatever method to compute the first run, and I persist it to disk and then mmap it. The first run could take 20 hours and gets discarded.

By technicality it might fit the rules of the original request but it isn't an interesting solution. Feel free to submit it :)

link

gunnarmorling 792 days ago

This actually doesn't fit the rules. I've designed the challenge so that disk I/O is not part of the measured runtime (initially, by relying on the fact that the first run which would pull the file into the page cache will be the slowest and thus discarded, later on by loading the file from a RAM disk). But keeping state between runs is explicitly ruled out as per the README, for the reason you describe.

link

pama 792 days ago

Having write access to storage or spawning persistent daemons is an extra requirement and that is often not available in practice when evaluating contest code :-)

This is a fun project for learning CUDA and I enjoyed reading about it——I just wanted to point out that the performance tuning in this case is really on the parsing, hashing, memory transfers, and IO. Taking IO out of the picture using specialized hardware or Linux kernel caching still leaves an interesting problem to solve and the focus should be on minimizing the memory transfers, parsing, and hashing.

link

ww520 792 days ago

Also this uses 16 threads while the contest restricts to running in 8 cores. Needs to compare the benchmarks in the same environment to make a fair comparison.

link

pama 792 days ago

The AMD Ryzen 4800U has 8 cores total so the author follows the contest restriction. This CPU supports hyperthreading. (I’d be very interested in seeing hyperoptimized CUDA code using unlimited GPU cores FWIW.)

link

ww520 791 days ago

Good to know. I didn’t know the contest has no limit on hyperthread.

link

fulafel 791 days ago

1brc in the contest had SMT disabled [0]. (hyperthreading is an intel marketing name and trademark for their implementation of smt, but the benchmark was run on an amd cpu)

[0] https://github.com/gunnarmorling/1brc/issues/189#issuecommen...

link

ww520 790 days ago

Ok. So it's indeed restricted to 8 core (1 thread per core). Then the benchmark above using 16 threads was not really a fair comparison.

link