Hacker News new | ask | show | jobs
by cthalupa 3341 days ago
I'm not a developer, so my answers are different, but I've got a handful:

- Consulting for a customer where they were deploying to new hardware with a new processor architecture, I received a report that an application was running slower on the new servers than it was on the old ones. I started out looking at things with strace and ltrace, had to move deeper and pull out perf and systemtap, but found that it looked like memory access was slower than on the old hardware. I did research on the processor, and found that it was due to the 'Intel Scalable Memory Buffers'. Since memory first had to be loaded into the buffer before the CPU could access it, things not in the buffer already had higher latency, but things already in the buffer were much more quickly accessed than they would have been previously. I worked with the developers to make up for this performance decrease in other ways. Their application was well suited for using hugepages, but they were not, and TLB pressure was causing performance bottlenecks in other areas. Switching to hugepages prevented TLB pressure, and the application ended up being even more performant on the new platform due to the increased amount of available memory allowing for a large amount of hugepage allocations.

- I was consulting for a customer that was running instances on a xen platform. They were having performance issues vs. their old bare metal deployment, and had already done some analysis. They gave me a perf report that was showing a massive amount of time being spent with a specific xen hypercall. I had to dig into the xen source code to figure out exactly what that hypercall was doing, as general public documentation about it was somewhat vague. I was able to determine that it bundled up a bunch of different operations, so it wasn't conclusive from that, but it did narrow down the possibilities. It was enough to point me in the right direction, however, and I was able to determine with a little bit of trial and error with some tweaking that it was ultimately related to decisions NUMA was making. It turned out that the customer had thought they were doing NUMA node pinning, and ultimately weren't. Interestingly enough, even with pinning, we still saw some of this, and completely disabling NUMA (all the way - not just balancing) actually ended up being needed to fully reclaim the lost performance. I also learned an important lesson in trusting customers - even the ones that know what they're doing aren't always right, and while I should trust them in general, verifying their answers is important. I discounted investigating NUMA as early on they told me they had their applications pinned to nodes, and I would have otherwise investigated that more quickly and probably solved the issue in less time.