| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by harrid 942 days ago
	Yes... But you don't need to memorize all this, and it leaves out a simple rule: vectors (or arrays) outperform everything else if the dataset is small. Small is usually in the order of 100-300, but can vary wildly. Also note that all <algorithm>s have built-in fully automatic parallelism via <execution>, a massively underused feature. In typical CPP fashion though, their newer views:: counterparts lack those overloads for the moment.

8 comments

csjh 942 days ago

Never heard of `execution`, great tip. Is there some big caveat or just generally unknown?

link

harrid 942 days ago

You need to be aware that these invocations are going to blast your cores full throttle as you obviously don't have fine grained control. But as long as your data is easily parallelized on a vector with computations that don't depend on each other, it's a game changer. I use it all the time to multithread things with literally a single line of code.

link

lionkor 942 days ago

I looked into it (my video, probably too long https://youtu.be/9oh66SF91LA?si=azDCSOAJKA9Gpzim), and the general result was that they make sense for non-small datasets and are a solid way to to parallelize something without having to pull in OpenMP or something.

link

coffeeaddict1 942 days ago

They're only supported in MSVC and GCC (for the latter you need to link against Intel's TBB to make it work). Support in libc++ (Clang) is work in progress.

link

pca006132 942 days ago

Clang does support parallel stl already (requires either TBB or OpenMP). Our project https://github.com/elalish/manifold made use of this to speed up mesh processing algorithms a lot.

link

coffeeaddict1 942 days ago

Does it? You mean if you link against libstdc++ instead of libc++?

link

pca006132 942 days ago

I remember it works for libc++ (partially, see https://libcxx.llvm.org/Status/PSTL.html), but forgot when I linked against libc++ last time...

link

cyber_kinetist 942 days ago

Incomplete support from Clang’s STL (especially in Apple Clang).

link

Arelius 942 days ago

Yes, but actually small can often be much larger then 100-300 depending on the specifics. Programmers often vastly underestimate how fast cache, and prefetch can go compared to complicated data structures.

link

jcelerier 942 days ago

... Or much smaller, I remember a benchmark for a case I had a few years ago, the cutoff I had for a map being faster than std::vector/array & linear probing was closer to N=10 that 100.

(Not std::map, at the time it must have been something like tsl:: hopscotch_map).

Note also that nowadays for instance boost comes with state-of-the-art flat_map and flat_unordered_map which gives both the cache coherency for small sizes and the algorithmic characteristics of various kinds of maps

link

gpderetta 942 days ago

Exactly. I measured vector vs an efficient linear probing hash map very recently and the cutoff was single digit. Even against unordered_map or plain std::map the cutoff was surprisingly low (although in this case I would trust a synthetic benchmark significantly less).

link

Arelius 942 days ago

Sure, the point here is with all the specific context specific details, your most likely comparing apples versus oranges. So simple complexity analysis or a general rule without benchmarks and a good understanding of the system and how it interacts with your details is not going to solve your problem.

link

menaerus 942 days ago

So you're saying that if I had had to store 100 elements in the memory, I would be better off using hashmap instead of vector/array? What type of elements did you use in your experiment or how large they were and what was your access pattern?

link

Someone 942 days ago

A successful search in a vector will do, on average, 50 comparisons, while the hash map version would hash the key, look up the bucket, typically find a single-item in that bucket (with only a 100 items in the hashtag, hash collisions will be highly unlikely), and do a single comparison.

For an unsuccessful search, the vector version would do 100 key comparisons, and the hashtag would do a single hash, lookup the bucket, and almost certainly find it empty.

So, if you make the comparison function relatively expensive, I can see the hash map being faster at search.

Even relatively short string keys might be sufficient here, if the string data isn’t stored inline. Then, the key comparisons are likely to cause more cache misses than accessing a single the bucket.

Of course, the moment you start iterating over all items often, the picture will change.

link

menaerus 941 days ago

Searching the vector is literally incrementing a pointer over the data. The number of instructions needed to do the search is very small - e.g. ~15. This means that it can very easily fit into the CPU uOp cache but also makes it a candidate for the LSD cache. Both of those will be a major factor in hiding the latencies or getting rid of them in the CPU frontend fetch-decode pipeline, effectively making all the for-loop iterations only left to be bound by the CPU backend, or more specifically, branch-mispredictions (aka ROB flushing) and memory latencies.

Given the predictable access nature of vectors and their contiguous layout in the memory, the CPU backend will be able to take advantage of those facts and will be able to hide the memory latency (even within the L1+L2+L3 cache) by pre-fetching the data on consecutive cache-lines just as you go through the loop. Accessing the data that resides in L1 cache is ~4 cycles.

The "non-branchiness" of such code will make it predictable and as such will make it a good use of BTB buffers. Predictability will prevent the CPU from having to flush the ROB and hence flushing the whole pipeline and starting all over again. The cost of this is one of the largest there are within the CPU and it is ~15 cycles.

OTOH searching the open-addressing hashmap is the super-set of that - e.g. almost as if you're searching over an array of vectors. So, only the search code is: (1) By several factors larger, (2) Much more branchy, (3) Less predictable and (4) Less cache-friendly.

Algorithmically speaking, yes, what you're saying makes sense, but I think the whole picture can only be made once the hardware details are also taken into account. Vector approach will literally be only bound by the number of cycles it takes to fetch the data from L1 cache and I don't see that happening for a hash-map.

link

maccard 942 days ago

No, benchmark it for your particular type, and decide based on that.

link

menaerus 942 days ago

I think you're missing my point. I'm highly suspicious, or let's say intrigued, under what conditions one can come up with such conclusion. Therefore I asked for a clarification.

link

CyberDildonics 941 days ago

vectors (or arrays) outperform everything else if the dataset is small. Small is usually in the order of 100-300, but can vary wildly.

This is a very poor way to choose a data structure. How many items you want to store is not what someone should be thinking about.

How you are going to access it is what is important. Looping through it - vector. Random access - hash map. These two data structures are what people need 90% of the time.

If you are putting data on the heap it is already because you don't know how many items you want to store.

link

maldev 942 days ago

Isn't that because view performs the procedure and access when the data is accessed? It's mainly meant so that you don't have to load all this memory in or stall when accessing the view, and if you don't need that capability, the regular algorithm stl is better.

link

papichulo2023 942 days ago

iirc, on msvc, execution parallel delegates to the OS (windows) to decide how many threads to create, at it is usually more than the total number of vcpu contrary to the usual recommendation.

link

harrid 942 days ago

It's very much implementation defined yes. I'm currently using this for something that runs for about 10 seconds, and even music playback and mouse cursor movement gets affected.

(But I'm about to move it to GPU)

link

varjag 942 days ago

Not quite. Many other datastructures can be shoehorned into contiguous, cache efficient representations.

link

01100011 942 days ago

I rarely meet a CS major who will accept that small things will always stay small. They will frequently talk you into more complex data structures with the assurance that you are just too stupid to realize your problem will suddenly need to scale several orders of magnitude. Do they teach you in CS-101 to interpret '+' as the exponential operator? It often feels that way.

link

pca006132 942 days ago

Are they fresh graduates? It is very important to understand the workload distribution for any optimization. Even if small things can sometimes get large, optimizing for the small case can often yield large gain as they may occur frequently. And complex data structures are usually worse in the small case...

link

baq 942 days ago

There’s computer science and there’s software engineering. The best developers are good at both.

…but in order for this to really matter, communication is required, since even the best developers don’t scale.

link

elromulous 942 days ago

It's even more true for larger collections. Stroustrup gave a talk on this back at GoingNative2012. Tl;Dr: "Use a Vectah" -Bjarne

https://youtu.be/YQs6IC-vgmo

link

CyberDildonics 941 days ago

A vector should be people's default data structure, but this presentation is bizarre, because it is based on looping through every element of a vector or a list to find the element you want.

This is never a scenario that should happen, because if you are going to retrieve an arbitrary element it should be in a hash map or sorted map.

link