Hacker News new | ask | show | jobs
by mraleph 5085 days ago
Optimized code can't deal with uint32 that is out of range so it deoptimizes and effectively you end up running in an entirely unoptimized code which is obviously much slower than optimized one... and of course this penalizes V8 even more because unboxed doubles and int32s are possible for V8 only in optimized code.

It is true that if V8 were using NaN-tagging it would suffer less from running entirely unoptimized code. But my point is: it doesn't have to use NaN-tagging to run this code efficiently, it just needs to ensure it doesn't deoptimize for nothing.

1 comments

Ok, I see.

So, out of the 10-15x slowdown, how much is due to deoptimizing and how much is due to not NaNboxing, if it's possible to estimate that? The distinction should be important in the case of a very large codebase whose performance is not focused in a few small loops, so presumably most of the time you will be running unoptimized code (and then if you NaNbox or not gets important).

I am not sure I entirely understand. If you are running in a cold code then performance does not matter and you can tolerate quickly allocating a small amount of boxes which will be as quickly reclaimed by scavenger once you are done with them. If you are running in an hot code --- then it should be optimized in a way that minimizes the number of boxes produced.

In other words: ideally application should be running unoptimized code if and only if it is either cold or cannot be improved by optimization; all other cases are bugs.

I can't split 10-15x between deoptimization and boxing because for V8 cost of "erroneous" deoptimization includes the cost of boxing as you can't have unboxed numbers in unoptimized code.

As I said earlier it is true that non-optimized code heavily manipulating doubles could become faster if V8 used NaN-tagging (or another technique that would allow it to maintain unboxed doubles on unoptimized frames). But speed of unoptimized code should not matter (see above).

Another thing to keep in mind is that for NaN-tagging on ia32 you pay with memory overhead: every object slot that can contain primitive number becomes twice as large on ia32. This is not nice if you don't have a lot of number floating around.

Overall, let me reiterate it, I am not arguing against NaN-tagging. I am just clarifying that the Issue 2097 is caused by the wrong decision in the hydrogen pipeline not by the fact that V8 does not use NaN-tagging.

I see now what you are saying about that issue, NaNboxing makes it worse but at core it is a deoptimization issue. Which is good, I hope this is fixed soon (so emscripten-compiled code runs more consistently across browsers).

> I am not sure I entirely understand. If you are running in a cold code then performance does not matter and you can tolerate quickly allocating a small amount of boxes which will be as quickly reclaimed by scavenger once you are done with them. If you are running in an hot code --- then it should be optimized in a way that minimizes the number of boxes produced.

Let's say that performance matters in the application, but it is huge in code size and all the code matters, not a few small parts. Would you call all the code hot, and would v8 optimize the entire application? (i.e., how is 'hot' defined in v8?)

Hot currently is defined as "function called more than X times" and "function contains a loop that took a backedge more than Y times". So it is defined per-function basis.

I can hardly speculate how V8 will behave on some abstract application. That is really highly dependent on how code looks like. But ultimately V8 will try to optimize everything that falls under criteria outlined above.

I see, thanks.

I ask because performance of small benchmarks has been quite good, except for the uint32 issue mentioned above, often around 3x slower than native code - but on very large codebases it is often much slower, and I do not know why.

Unfortunately (as you probably know yourself) I don't have a magical answer that would speed everything up. As for V8 there is quite a number of limits that you might be hitting with generated code (e.g. number of locals, size of the function etc) and there can be some bugs or non-done-yet thingies affecting performance. Profiling and looking at the generated code (and filing bugs) is the only suggesting I can give here.