Hacker News new | ask | show | jobs
by jacquesm 5159 days ago
> The prefetcher knows what a linked list looks like

That's some pretty advanced magic.

Even the compiler has very limited insight into what your code is actually doing without simulating it, the prefetcher might be able to look a bit ahead in the execution stream and do branch prediction but absolutely no way does that extend to knowing stuff about your data structures.

Unless I have just been transported by a time warp I really think this is fiction.

If you're thinking of cache pre-fetching that actually has a really hard time dealing with stuff like linked lists because it has absolutely no idea at all about the data structure it is looking at. The 'next' and 'previous' pointers in the linked list might actually simply be values without any significance at all. And if they are dereferenced as pointers then that memory could be just about anywhere within the valid address range.

For arrays on the other hand such pre-fetching can be useful.

1 comments

> Unless I have just been transported by a time warp I really think this is fiction.

Nope, the Intel guys and gals do this kinda magic day in, day out. Or at least the chips they manufacture do. I'm not familiar with the internals of the prefetcher of any CPU at this level, but let me wave hands here. This is what the prefetcher could do:

All it takes is for the prefetcher to get a cache line when requested, and then observe what is inside and look at pointer-sized and aligned values that look like pointers and are pretty close to the original cache line's (virtual) address. Now if these values happen to be sane virtual addresses in the current process, the prefetcher might as well fetch them one cache closer to the CPU. If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity.

All modern CPUs do dirty little tricks like this if it helps them outshine their competitors.

Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.

"If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity."

Caches have limited size. If it misses, it also evicts something else from the cache. If that is what is actually needed, this costs performance.

"and are pretty close to the original cache line's (virtual) address"

Why does it have to be 'pretty close'?

"Now if these values happen to be sane virtual addresses in the current process"

That sanity check would involve visiting the paging tables, so it would require at least two indirections (http://lwn.net/Articles/253361/). If a cache line is 16 bytes, you would have at least 4 positions where a 32-bit pointer could be present. So, at least four times two memory lookups would be needed. I think all of them would go through the same cache, but even assuming that the CPU has ways of signalling that it should not recurse, I do not think it is practical to do what you describe (disclaimer: I am not an expert on CPU design)

What is possible is to guess at where data is to be found. That allows CPUs to read and speculatively execute instructions from the physical memory that they think backs the virtual address of the PC while they, in parallel, do the lookup to verify that. See http://dl.acm.org/citation.cfm?id=2000101&dl=ACM&col.... I do not know whether this has made it into actual CPUs, though.

Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?

I know a bit about prefetch, and it's variants, but haven't directly used in 10 years (I was on a software project that had one of the first Katmai's, back in 1999 to do some bilinear/bicubic texture filtering for a media/drawing application, and had to code extensively in mmx assembly back then).

> Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?

edit: try to look for performance tuning guides for your CPU architecture, there might be some details about systems like this.

Nope. Some might exist but this stuff is generally considered trade secrets. A patent search might yield something. You can find some material about speculative execution, branch prediction, prefetching, etc but the real beef is hidden somewhere in Intel's (and their competitors') vaults.

It's all supposed to be transparent to the programmer so there's no need to write and release detailed public documentation about it.

As I said earlier, I have no idea how they work (or do they even exist), but I'll give a handwaving example of how they _could_ work.

> I'm not familiar with the internals of the prefetcher of any CPU at this level

So better not speculate, what you think could be done when you are not familiar with the details typically tends to be a lot harder when you are familiar with the details and are tasked with the implementation. The devil is in the details and there is a very appreciable gap between theory and practice.

Further reading: http://www.futurechips.org/chip-design-for-all/prefetching.h...

Note the caution against using lists and trees and an advice to stick to arrays, which are typically accessed sequentially.

> Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.

And that is exactly the point, if you don't supply the knowledge neither the processor nor the compiler can do this for you. If the processor or the compiler could do this then those __builtin_prefetch calls would not be required.

hardware prefetching mechanism is in core i7 that i know for sure, it automatically identify memory referencing pattern and attempts to fetch those blocks into cache before they are accessed, the detailed algorithm of prefetching are not documented...