Hacker News new | ask | show | jobs
by franciscop 583 days ago
I've flamegraph-debugged JS code from time to time, and it usually feels a lot more of a craft and "educated guesses" than the vast majority of programming things I do. I usually only get down to it when there's an actual perf problem so YMMV, but I'm curious, do I do JS flamegraph debugging wrong, or is it something like this for everyone?

- 20% of the times you get lucky and find a very easy win that speeds up things 90%+. Similar to this post, usually when a single method/call takes a huge chunk of the work.

- 50% of the times you grind at it and can get 30-50% speed up. I usually try many things, and only some of them do make a difference.

- 30% of the time absolutely no luck! Many small calls where each is unavoidable, no repeated code, etc.

2 comments

Keep in mind that there are many layers of complex systems between your JS code and what'll end up happening on your system when it's run.

The code defines what the state should look like after its done executing. It expresses your intent. But that code gets transformed several times on the way to being executed and then the hardware can apply mang different possible approaches to executing it when the time comes.

Moreso every year, many of those software transformations, as well as the hardware's execution technique, are quite aggressive about revisiting your program's intent with optimizations (of some kind) that make sense within that context.

The upshot is that the farther you are from your hardware, the more of these layers there are between your code and its execution, the less influence and insight you have over what actually "physically" happens during execution.

When it comes to profiling and optimization of high-level programs like those written in Javascript, this means that it can ve somewhere between hard and impossible to predict how your code changes will actually impact performance.

Radical algorithm redesign can often yield salient diffferences that feel largely predictable, but smaller "precision" changes are often going to be a crap shoot. All those layers between you and the hardware were making optimzations already anyway, and your "precision" change may just as easily confound those existing optimizations as well as it might trigger some other. The results are tricky.

This is even true in lower-level code, where we're encouraged to do things like inspect compiler output on godbolt or in our compilation output and always confirm our expectations with a profiler (which often proves our guesses wrong). But it's all that much more pronounced in high-level ones.

So ultimately, yes, assuming your prevailing algorithms are generally optimal, profiling and optimization is almost always going to feel like a guess-and-test process. But that's okay, because you can test and those tests are usually (not always) telling you if you've made a meaningful difference or not.

I do low level systems programming, so pretty different from JS-land, but I feel the techniques you should apply when doing optimization generally apply at any level/language.

0) algorithmic improvement. Obvious shit like do a quick sort instead of bubble sort (assuming N > 64, or whatever), not doing unnecessary work in a hot loop, etc

1) reduce memory footprint. The slowest part of your program is almost always just waiting for memory, unless you're doing something that's heavily CPU bound. Web applications are probably always memory bound. Reducing the amount of memory the function you're optimizing operates on reduces DCache misses, which are expensive.

2) Do batch operations. Once I've got something to a point where it's not completely braindead (which, honestly, is where I stop most of the time), I look to start batching things. Usually look to do 8 or 16 at a time in the hopes the compiler/runtime can make some use of SIMD. Use STATIC LOOPS ie (for 0..8) so the compiler can unroll the loop. That's extremely important.

3) probably unavailable (unless you want to/can drop into WASM), but the next step is usually SIMD. This is a rabbit hole, but if you want/need another ~8x perf improvement, this is how to get it

4) once all that's done, it's probably close to optimal in terms of cycles per element (unless I did something boneheaded, which is common). Last step is to multithread it if it needs even more juice. This can range from trivial to completely impossible depending on the algorithm. In JS land, you need to make sure you operate on SharedAreayBufferrs when doing multithreading for performance, because web workers copy the input/output values by default.

Anywhoo.. maybe that helps.

When I try to optimize something lightly, it's not uncommon for me to get 10x improvement fairly easily. When I optimize something to within an inch of it's life, I can sometimes get three or even four orders of magnitude faster.

EDIT: I forgot to mention that for tight performance, avoid branches. This means ifs, switchs, loops, goto, etc. Sometimes you need branches, but mispredicted branches can be extremely costly, causing pipeline stalls and flushes. This is why using a static loop is important; so the compiler can unroll it and not use a branch.

I also should mention that I hate flamegraphs. They only give you a bare minimum amount of information for doing performance work. I'm not sure of a good JS profiler, but what you want to be able to do is mark up the sections of code you want profiled, instead of the profiler taking random samples and squashing them all together. Look at the tracy profiler for an example

> Web applications are probably always memory bound

IO bound and particularly network bound code is common too. The first fix I'd try with network bound code is to either eliminate the network call (local cache? turn a microservice into a library?) or to batch operations.

> Last step is to multithread it if it needs even more juice

In web app land, this is fraught with peril if you're doing it on the server, because it means your code is now competing for n times the resources. Often it's better for one request to take a long time if it means it's using a more predictable amount of memory, not causing other requests to time out, not exhausting your DB connection pool, etc.

I imagine that systems programming is similar in some ways and that's why multithreading is the last resort, just mentioning it because it's easy to shoot oneself in the foot with parallelism.

Some good points, thanks for the insight :)

Yeah, when I multithread something I pretty much assume that I can hog the whole machine for the time slice the job is going to be running. Said another way, I assume there will be a small number of large jobs running on the machine at any given time, which attempt to saturate the machine. Typically dispatched to a core-locked threadpool of some sort.

Definitely agree with the sentiment that multithreading is hard. Especially when trying to get every last drop..