| Actually the author is building onto my suggestion https://github.com/soasis/idk/pull/17/files which I created several years ago, but adapted a bit too fast while submitting. And yes it looks very similar to GNU nested functions, since I started tinkering with these first. I am really not sure if all these observations mentioned in the article are 100% correct, though. First : Code seems to be compiled with clang. On Linux with gcc the native function one is way faster than the clang one. Second: The author does run the code on ARM64/MacOS . At least on my ryzen CPU on Linux with gcc the "normal C code" is way faster than anything else. Not that we do not need to thing about "closure" type functionality, but one should be careful to extrapolate implementations from one compiler on one platform to the rest of the pack. Regarding N3654 I am not sure how to benchmark it here, since C could potentially use __builtin_call_with_static_chain , but I am not sure how to write the function to use the chain for accessing the variables. I tried to estimate N3654 it by using "tinygo" which is AFAIK using the usual Calling ABI, but it was a factor of two slower than clang. Even "go" with its very specific ABI is still much slower. I discovered this isn't representative since runtime calling costs had been totally shadowed by costs of allocations. Even the rust example I am usually using http://www.reddit.com/r/rust/comments/2t80mw/the_man_or_boy_... is much slower than anything else, presumably because of the "Cell" needed TLDR: This micro benchmark might be misleading |
A trick one can do is to let it create the trampoline and then read off the two pointer from the position in the code where it is stored. Not portable and you still have the overhead for creating the trampoline, but you do not need the executable stack anymore.