Hacker News new | ask | show | jobs
by twmb 1650 days ago
Are there performance worries about passing around two pointers for anything that needs to allocate, as well as storing these pointers in a struct? AFAICT this basically means two registers are eaten, and a lot of types have effectively 16 bytes of overhead. It seems like this could quickly change the calculus on what fits within cache lines and what doesn't, which people often care about for very high performance code.

I wonder if it's possible to change the compiler to detect that, if what is being used in arguments is the global default allocator, the first argument can be stripped and all references inside the function can be replaced with the global pointer. Potentially the same concept could apply to allocators that use thread local storage. (perhaps these optimizations already exist?)

2 comments

This is a worry, but it's not as bad as you might initially think. The first thing to notice is that even though the "interface pointer" got fatter, the implementation got much leaner, as it no longer contains the vtable. Vtables are now shared between instances of the same type, so total memory use has gone down, and implementations can be packed more densely. If you're worried about overall cache usage, this is a net positive. The first load, from the fat pointer, is very likely to be in cache. The second load, from the vtable, will be in cache if you have used the same type recently. Which is likely if you have thousands of objects, you probably do not have thousands of implementations.

There is some additional latency because the virtual function load is now two pointer dereferences instead of one. However, C++ and Go both use double-dereference models like this, and it seems to be working fine for them. Additionally, if virtual calls like this are on your critical fast path, you have bigger problems :P

I believe that "devirtualization"--the optimization mentioned in the OP--will do exactly what you're describing, by rewriting the virtual function call as a static call to the allocator when the vtable can be determined from the callsite at compile-time.