Hacker News new | ask | show | jobs
by MrCroxx 4 days ago
Author here. This post is a write-up of a performance-debugging rabbit hole I hit while trying to saturate NICs with NVMe reads using io_uring and RDMA.

The short version: READ_FIXED fixed the obvious per-I/O GUP overhead in a small demo, but the larger deployment still got stuck at roughly half of line rate. After ruling out io-wq backlog, request splitting, fd lookup, and CRC arithmetic, the actual wall turned out to be dTLB misses from scanning 1,028 KiB buffers backed by 4 KiB pages. Moving the read arena to hugepages brought the system close to NIC saturation.

The funny part is that an AI agent suggested hugepages early and got the optimization right, but its explanation was wrong. This post is mostly about reconstructing the evidence for why it worked.

I’d be very interested in feedback from people who have used AI to debug performance issues in a complex system.