Hacker News new | ask | show | jobs
by bshlgrs 3351 days ago
"At around 1,000,000 elements our unsorted vector takes around 32 minutes to lookup a string."

I don't at all believe that C++ can only iterate through 2000 strings a second. Python takes an imperceptibly small time to iterate through an array of a million strings when I test it out in my repl, and I imagine C++ would either be as fast or somewhat faster. This number is so ridiculous that it makes me very skeptical of the rest of the article.

1 comments

The x-axis is the number of iterations, i.e. the number of lookups. The collection is always size 1,000,000. Also, each of those lookups is for the exact same entry, so after the first time, it'll be all cache hits. Well, except for the std::vector, which is what you're talking about... :)
Strings are length 10 to 100. A cache line is 64 bytes. A cache miss is < 1000 cycles < 10^3ns. That makes < 1s for 10^6 strings.

Furthermore adjacent strings are likely to be adjacent in memory, from which I would naively expect cache prefetching to succeed.

Additionally, the curve is quadratic (if at all), not linear. I bet the author looked up all n strings from the vector of n strings.

I think the curves are linear because the x-axis is log scale but the y-axis is not.
Gotcha. It's probably still quadratic, just judging from the values. You can't need 32 minutes for 10^6 100-byte comparisons (which typically abort after the first character compares unequal. Not that that matters).
It's definitely a strange set of graphs:

Mixed (log/linear) scales on the axes

Only two columns of points on each graph matter (the last two) because the left hand side is all zeros (an artifact of the y axis not being log scale)

The y-axis is sometimes in thousands-of-mega-nanoseconds (with two significant digits) or sometimes in giga-nanoseconds (with one).

Giga-nanoseconds are seconds!

The colors for each line sometimes changes between graphs.