| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 10000truths 1512 days ago

A bit of a digression, but I’d love to see how much further one could go with a memory-optimized userland TCP stack, and storing the send and receive buffers on disk.

A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.

A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.

2 comments

mike_hearn 1512 days ago

Presumably at 100M simultaneous connections the machine CPU would be saturated with setting up and closing them, without getting much actual work done. TCP connections seem too fragile to make it worth trying to keep them open for really long periods.

It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?

There are some obvious ones. Others in the thread have pointed out network bandwidth. Some use cases don't need much bandwidth but do need intense routability of data between connections, like chat apps, and it seems ideal for those. Still, you're going to face other problems:

1. If that process is restarted for any reason that's a lot of clients that get disrupted. JVMs are quite good at hot-reloading code on the fly, so it's not inherently the case that this is problematic because you could make restarts very rare. But it's still a problem.

2. Your CPU may be sufficient for the steady state but on restart the clients will all try to reconnect at once. Adding jitter doesn't really solve the issue, as users will still have to wait. Handling 5M connections is great unless it takes a long time to reach that level of connectivity and you are depending on it.

3. TCP is rarely used alone now, it usually comes with SSL. Doing SSL handshakes is more expensive than setting up a TCP connection (probably!). Do you need to use something like QUIC instead? Or can you offload that to the NIC making this a non-issue? I don't know. BTW the Java SSL stack is written in Java itself so it's fully Loom compatible.

link

toast0 1512 days ago

You're totally spot on that connection establishment is much more challenging than steady state; with TLS or just TCP.

I don't think QUIC helps with that at all. Afaik, QUIC is all userland, so you'd skip kernel processing, but that doesn't really make establishment cheaper. And TCP+TLS establishes the connection before doing crypto, so that saves effort on spoofing (otoh, it increases the round trips, so pick your tradeoffs).

One nice thing about TCP though is it's trivial to determine if packets are establishing or connected; you can easily drop incoming SYNs when CPU is saturated to put back pressure on clients. That will work enough when crypto setup is the issue as well. Operating systems will essentially do this for you if you get behind on accepting on your listen sockets. (Edit) syncookies help somewhat if your system gets overwelmed and can't keep state for all of them half-established connections, although not without tradeoffs.

In the before times, accelerator cards for TLS handshakes were common (or at least available), but I think current NIC acceleration is mainly the bulk ciphering which IMHO is more useful for sending files than sending small data that I'd expect in a large connection count machine. With file sending, having the CPU do bulk ciphers is a RAM bottleneck: the CPU needs to read the data, cipher it, and write to RAM then tell the NIC to send it; if the NIC can do the bulk cipher that's a read and write omitted. If it's chat data, the CPU probably was already processing it, so a few cycles with AES instructions to cipher it before sending it to send buffers is not very expensive.

link

Matthias247 1511 days ago

QUIC will help with some things, and make others worse. With QUIC you don't need a file descriptor per connection anymore. A single file descriptor for one UDP socket will be sufficient to handle an arbitrary amount of connections (although you might want more to actually exploit concurrency). That fact will help limiting resources that the kernel uses. However the state that needs to be tracked per established connection is likely way larger than for TCP, due to being a more complex and featureful protocol. E.g. QUIC needs state for tracking sub-streams on a connection, while TCP does not. And of course there's all the mandatory crypto state. I am fairly familiar with QUIC implementations, and made a multitude of changes in various libraries (e.g. Quinn and s2n-quic). I wouldn't be surprised if the baseline memory usage of a QUIC connection in most libraries is > 10x of what the Linux TCP stack requires for tracking a connection

link

adra 1512 days ago

I'm pretty sure the exercise was to show the absolute extremes that could be achieved in a toy application and possibly how easy one could achieve some level of IO blocking scaling that has been harder than most other tasks in java of late. More and more, heap allocations are cheaper, often with sub-milli collector locks, CPU scaling has more to do with what you're doing instead of the platform, but java have enough tools to make your application fast.

For extremely IO wait bound workloads though, there was always a LOT if hoops to jump through to make performance strong since OS threads always have a notable stack memory footprint that just doesn't scale well when you could have thousands of OS threads waiting around just taking up RAM.

link

natdempk 1511 days ago

It depends on what you do, but I think GC/memory pressure can become an issue rather quickly with the default programming models Java leads you towards. I end up seeing this a lot in somewhat high throughput services/workers I own where fetching a lot of data to handle requests and discarding it afterwards leads to a lot of GC time. Curious if anyone has any sage advice on this front.

link

charcircuit 1512 days ago

I think you meant to say TLS. Not SSL.

link

toast0 1512 days ago

It's easy to just get 4TB of ram if that's what you need; I haven't scoped out what you can shove into a cheap off the shelf server these days, but I'd guess around 16TB before you need to get fancy servers (Edit: maybe 8TB is more realistic after looking at SuperMicro's 'Ultra' servers). I think you'd need a very specialized applicatjon for 100M connections per server to make sense, but if you've got one, that sounds like a fun challenge; my email is in my profile.

Moving 100M connections for maintenance will be a giant pain though. You would want to spend a good amount of time on a test suite so you can have confidence in the new deploys when you make them. Also, the client side of testing will probably be harder to scale than the server side... but you can do things like run 1000 test clients with 100k outgoing connections each to help with that.

link