Hacker News new | ask | show | jobs
by j42 3880 days ago
Can anyone comment on whether it would be a good or bad idea to try to implement this in production ASAP?

I'm a bit far removed from the Linux kernel to be comfortable auditing that myself, but I run some high-volume/low-latency exchange clusters and the limiting bottleneck on requests/box has always been due to SYN/ACK negotiation.

I solved that currently by having hundreds of smaller servers, which distributes the network load quite nicely but isn't even coming close to maximal utilization of CPU. Realistically since moving from PHP to Haskell we're seeing about 1k req/s/box, but without the network slowing things down we're looking at a magnitude of increase on the current hardware.

Just FYI I've already handled the obvious, such as intelligent caching, nginx split upstreams, TIME_WAIT and reuse adjustments (1s), et al. Qualitatively, we're looking for an assurance on <= 100ms TTFB for the 99th percentile, in a way that allows us to use the most of our hardware via green threads.

2 comments

Consider the UDP-based QUIC protocol that Chrome now supports.

From http://blog.chromium.org/2015/04/a-quic-update-on-googles-ex...:

"For latency-sensitive services like web search, the largest gains come from zero-round-trip connection establishment. The standard way to do secure web browsing involves communicating over TCP + TLS, which requires 2 to 3 round trips with a server to establish a secure connection before the browser can request the actual web page. QUIC is designed so that if a client has talked to a given server before, it can can start sending data without any round trips, which makes web pages load faster. The data shows that 75% percent of connections can take advantage of QUIC’s zero-round-trip feature. Even on a well-optimized site like Google Search, where connections are often pre-established, we still see a 3% improvement in mean page load time with QUIC."

From https://www.chromium.org/quic:

"Key features of QUIC over existing TCP+TLS+SPDY include

* Dramatically reduced connection establishment time

* Improved congestion control

* Multiplexing without head of line blocking

* Forward error correction

* Connection migration"

https://en.wikipedia.org/wiki/QUIC

Since it sounds like your workload is nicely distributed, why not try it one a few nodes and compare?
I was actually planning on testing a 33% deployment this evening, I guess what I'm really wondering is that because this is a low-level networking adjustment that modifies locking behavior, are there 'gotchas' I should be aware of beforehand?

I'll test and report back regardless; only afraid of the scenario where the test goes well, and the entire network goes down a week later (e.g., a wonderfully-fun time we had previously chasing down rogue epoll queues and zombie processes that only occurred after a threshold of sustained load).

As with any new code, it may have bugs. The kernel could deadlock or crash due to insufficient locking.

You may want to start small and report any bugs that you find to help improve the code.

did you get anything working ?