Hacker News new | ask | show | jobs
by IgorPartola 4030 days ago
The first issue mentioned is the disparity between maxconn as a global setting and as a backend setting, and it requires some more clarification. First, the article mentions that if global maxconn is reached, the new connections will remain queued in the socket's listening queue. The length of this queue is finite and is specified by the listen() syscall's backlog parameter. It appears that for recent versions of Linux this value defaults to 128, though in the past I have seen it much lower (on the order of 5). This means that if you set your maxconn to be, say, 256 and don't change the default backlog value, you'll get 384 connections connected or queued before the next client is refused.

To me it makes sense to set this value fairly high. There is not a great cost to holding a connection open: typically a few dozen bytes, but this way you can simply have your service working slower and catching up once the number of requests decreases. Say you set maxconn to 4096 globally and say the backend can process only 32 requests at a time. In this case you essentially buffer the client connections in the HAProxy (or whatever you use on the front-end) queue instead of outright refusing them. You still get all the benefits from the backend's maxconn so the backend doesn't start thrashing, but you accept a lot more connections before your users start seeing "Server refused connection" errors. Of course if you routinely need to process more than 32 concurrent requests your backend will never catch up, so in that case you want to increase its performance or add more backends.

Additionally, HAProxy is, AFAIK, the only HTTP/TCP server that has the option to log when a connection is first established, not when it is finalized, making it a lot easier to debug certain types of problems, as well as detecting Slowlaris attacks.

1 comments

Large queues can have problems. Perhaps your queue is effectively 30 seconds deep at times. But your clients time out around 10 seconds. Now every request in the queue is useless. If your clients retry, then now you have a feedback loop generating uselessness. You'd be better off with a short queue, and rejecting requests much faster.
Browsers don't automatically retry AFAIK, and users are not likely to hit refresh after seeing a hard error like connection refused. If you reach a queue that gets to be 30 seconds long, you most likely need more/more powerful backends.