Hacker News new | ask | show | jobs
by seanmcq 5154 days ago
Please profile tricks like this, as they may actually be significantly slower than their naive counterparts on modern hardware. The prefetcher knows what a linked list looks like, and it knows how to get it somewhere closer than main memory before the nodes are needed.
5 comments

> The prefetcher knows what a linked list looks like

That's some pretty advanced magic.

Even the compiler has very limited insight into what your code is actually doing without simulating it, the prefetcher might be able to look a bit ahead in the execution stream and do branch prediction but absolutely no way does that extend to knowing stuff about your data structures.

Unless I have just been transported by a time warp I really think this is fiction.

If you're thinking of cache pre-fetching that actually has a really hard time dealing with stuff like linked lists because it has absolutely no idea at all about the data structure it is looking at. The 'next' and 'previous' pointers in the linked list might actually simply be values without any significance at all. And if they are dereferenced as pointers then that memory could be just about anywhere within the valid address range.

For arrays on the other hand such pre-fetching can be useful.

> Unless I have just been transported by a time warp I really think this is fiction.

Nope, the Intel guys and gals do this kinda magic day in, day out. Or at least the chips they manufacture do. I'm not familiar with the internals of the prefetcher of any CPU at this level, but let me wave hands here. This is what the prefetcher could do:

All it takes is for the prefetcher to get a cache line when requested, and then observe what is inside and look at pointer-sized and aligned values that look like pointers and are pretty close to the original cache line's (virtual) address. Now if these values happen to be sane virtual addresses in the current process, the prefetcher might as well fetch them one cache closer to the CPU. If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity.

All modern CPUs do dirty little tricks like this if it helps them outshine their competitors.

Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.

"If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity."

Caches have limited size. If it misses, it also evicts something else from the cache. If that is what is actually needed, this costs performance.

"and are pretty close to the original cache line's (virtual) address"

Why does it have to be 'pretty close'?

"Now if these values happen to be sane virtual addresses in the current process"

That sanity check would involve visiting the paging tables, so it would require at least two indirections (http://lwn.net/Articles/253361/). If a cache line is 16 bytes, you would have at least 4 positions where a 32-bit pointer could be present. So, at least four times two memory lookups would be needed. I think all of them would go through the same cache, but even assuming that the CPU has ways of signalling that it should not recurse, I do not think it is practical to do what you describe (disclaimer: I am not an expert on CPU design)

What is possible is to guess at where data is to be found. That allows CPUs to read and speculatively execute instructions from the physical memory that they think backs the virtual address of the PC while they, in parallel, do the lookup to verify that. See http://dl.acm.org/citation.cfm?id=2000101&dl=ACM&col.... I do not know whether this has made it into actual CPUs, though.

Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?

I know a bit about prefetch, and it's variants, but haven't directly used in 10 years (I was on a software project that had one of the first Katmai's, back in 1999 to do some bilinear/bicubic texture filtering for a media/drawing application, and had to code extensively in mmx assembly back then).

> Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?

edit: try to look for performance tuning guides for your CPU architecture, there might be some details about systems like this.

Nope. Some might exist but this stuff is generally considered trade secrets. A patent search might yield something. You can find some material about speculative execution, branch prediction, prefetching, etc but the real beef is hidden somewhere in Intel's (and their competitors') vaults.

It's all supposed to be transparent to the programmer so there's no need to write and release detailed public documentation about it.

As I said earlier, I have no idea how they work (or do they even exist), but I'll give a handwaving example of how they _could_ work.

> I'm not familiar with the internals of the prefetcher of any CPU at this level

So better not speculate, what you think could be done when you are not familiar with the details typically tends to be a lot harder when you are familiar with the details and are tasked with the implementation. The devil is in the details and there is a very appreciable gap between theory and practice.

Further reading: http://www.futurechips.org/chip-design-for-all/prefetching.h...

Note the caution against using lists and trees and an advice to stick to arrays, which are typically accessed sequentially.

> Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.

And that is exactly the point, if you don't supply the knowledge neither the processor nor the compiler can do this for you. If the processor or the compiler could do this then those __builtin_prefetch calls would not be required.

hardware prefetching mechanism is in core i7 that i know for sure, it automatically identify memory referencing pattern and attempts to fetch those blocks into cache before they are accessed, the detailed algorithm of prefetching are not documented...
The prefetcher knows what a linked list looks like

That seems unlikely to me. Which processors do this?

Prefetching when a register contains what looks like a memory address at least seem possible
You can't tell whether a register contains a memory address or not after it has just been fetched right up until the moment that it is used. If the instruction stream contains instructions that reference that register that are within the look-ahead window of the instruction pipeline then maybe this might work, but if there is any conditional code in there or something else that causes the value to be used in a way that is not 'as a pointer' and to access the memory (and not for for instance pointer arithmetic prior to utilization) then such a pre-fetch would likely cause more harm than good by blowing good data out of the cache and replacing it with useless data (at a significant penalty).

There has been some research on this subject with respect to compiler optimizations but as far as I know this is not at all done at the hardware level.

Pretty much everything looks like a memory address. There is currently a memory address "hole" on x86_64 which doesn't, but any number between -140737488355328 and 140737488355327 can represent a virtual memory address. Prefetching any occurrance of any of those would be insane.

The prefetcher detects patterns. If you start accessing memory at N byte intervals, it will start prefetching N, 2N, 3N etc. ahead. So if you allocate your linked list elements from contiguous (virtual) memory, it will be able to prefetch elements. But once you start jumbling the list up, that will fall apart.

Yes, but imagine the bus traffic if you tried that: every commited instruction would require a TLB read to see if it "looked like" an address. The poor TLB is overcommited as it is (Intel CPUs can do three memory operations in a clock). All prefetch implementation I'm aware of simply do sequential access (positive or negative) prediction, sometimes with stride detection, and that's it.
Note: the CPU knows your virtual address spaces and a whole lot more. So "looking like a pointer" is not just a 64 bit hexadecimal value, but the CPU has the possibility of checking whether it's a sane virtual address and other heuristics.

Also, if you make an incorrect prefetch decision, it's not a big deal. Just one wasted cache line.

As best I understand it, while instruction prefetch is a complex affair, data prefetch is limited to detecting incrementing sequential address access, unless directed through a PREFETCH op. I can't a find a good reference, but I believe at best the CPU can only optimize by detecting stride.

In no circumstance will the CPU magically understand the semantics of some data structure.

Pre-fetchers rely on two metrics to determine what to cache - temporal locality, meaning it will cache things that it judges to be frequently accessed, and spatial locality, meaning that when you access something at memory address X, the cache will fill one cache line with the contents of consecutive memory locations i.e X,X+1,X+2,..,X+s.

Hence, they have no capability to understand a structure like a linked list and will not cache it intelligently. All you can rely on is that they will cache arrays and traverse them in a way that efficiently uses the cache.

Note though that speed isn't all that matter, memory usage does too, especially in constrained environments.