| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oflebbe 162 days ago

Actually the author is building onto my suggestion https://github.com/soasis/idk/pull/17/files which I created several years ago, but adapted a bit too fast while submitting. And yes it looks very similar to GNU nested functions, since I started tinkering with these first.

I am really not sure if all these observations mentioned in the article are 100% correct, though.

First : Code seems to be compiled with clang. On Linux with gcc the native function one is way faster than the clang one.

Second: The author does run the code on ARM64/MacOS .

At least on my ryzen CPU on Linux with gcc the "normal C code" is way faster than anything else. Not that we do not need to thing about "closure" type functionality, but one should be careful to extrapolate implementations from one compiler on one platform to the rest of the pack.

Regarding N3654 I am not sure how to benchmark it here, since C could potentially use __builtin_call_with_static_chain , but I am not sure how to write the function to use the chain for accessing the variables.

I tried to estimate N3654 it by using "tinygo" which is AFAIK using the usual Calling ABI, but it was a factor of two slower than clang. Even "go" with its very specific ABI is still much slower. I discovered this isn't representative since runtime calling costs had been totally shadowed by costs of allocations.

Even the rust example I am usually using http://www.reddit.com/r/rust/comments/2t80mw/the_man_or_boy_... is much slower than anything else, presumably because of the "Cell" needed

TLDR: This micro benchmark might be misleading

1 comments

uecker 162 days ago

See my other comment in this thread with my preliminary benchmarking results. I have a patched GCC with two new builtins __builtin_static_chain and __builtin_nested_code (needs a better name), that give you the static chain pointer and the code pointer without creating a trampoline. Then I put both into a structure to simulate a wide pointer. Later I call it with __builtin_call_with_static_chain.

A trick one can do is to let it create the trampoline and then read off the two pointer from the position in the code where it is stored. Not portable and you still have the overhead for creating the trampoline, but you do not need the executable stack anymore.

link