Hacker News new | ask | show | jobs
by asd4 987 days ago
"What they mean by IO bound is actually that their system doesn’t use enough work to saturate a single core when written in Rust: if that’s the case, of course write a single threaded system."

Many of the applications I write are like this, a daemon sitting in the background reacting to events. Making them single threaded means I can get rid of all the Arc and Mutex overhead (which is mostly syntactic at that point, but makes debugging and maintenance easier). Being able to do this is one of the things I love about Rust: only pay for what you need.

The article that this one is responding to calls out tokio and other async libraries for making it harder to get back to a simple single threaded architecture. Sure there is some hyperbole but I generally agree with the criticism.

Making everything more complex by default because its better for high throughput applications seems to be opposite of Rust's ideals.

5 comments

I’ve written services like this, and I would never have called them IO bound. They’re not throughput-bound at all. They mostly sit idle, then they do work and try to get it done quickly to minimize use of system resources. Unless they sometimes get huge bursts of work and something else cares quite a lot about latency during those bursts, using more than one thread adds complexity and overhead for no gain.
A lot of people on the internet are confused about what "IO bound" means, and use it in this incorrect way.
In an era of 10Gb NICs in every server very few things are really IO bound.
The NIC does not really have a lot to do with being IO bound.

IO bound means you spend most of your time waiting on an IO operation to complete. Usually writes are bound by the hardware (how fast your NIC is, how fast your storage is, ...), but reads are bounds by the hardware, but mostly by the "thing" that sends the data. So it's great you have a 10Gbps NIC, but if your database takes 10ms to run your query, you'll still be sitting for 10ms on your arse to read 1KB of data.

In this context, we're talking about things for which the throughput is IO-bound. You're talking about the latency of an individual request.

Throughput being IO-bound is indeed about the hardware, and the truth is that at the high end it's increasingly uncommon for things to be IO-bound, because our NICs and disks continue to improve while our CPU cycles have stagnated.

In purely practical terms the old system interfaces are sufficiently problematic that for any workload with necessarily smaller buffers than tens of kb, most implementations will get stuck being syscall bound first. Spectre really didn’t help here either.
I think this is where we have to really move towards the io_uring/FlexSC approach.
The speed of your NIC doesn't matter when you are waiting for an INSERT on a DB with a bad schema. Heck, your DB could be on localhost and you are not even hitting the NIC card. Still the same.
Although NVMe/SSD drives have changed things a lot, any media creation software is still IO bound in the sense that:

a. you cannot plan to read data from disk on demand, because it will take too long (still!), and it will almost certainly block

b. you cannot plan to write data to disk on demand, because it will take too long (still!) and it will almost certainly block

c. the bandwidth is still a limit on the amount of data that can be handled. It is much higher than it was with spinners, but there is still a limit.

There are plenty of applications that do not run on servers. Lots of IO bound stuff in mobile or desktop apps - waiting for network responses, reading data files on startup, etc.
> In an era of 10Gb NICs in every server very few things are really IO bound.

for my data crunching project, one core processes about 500MB/s = 4Gb/s, and I have 64 cores..

10gb nics and their respective connections are quite expensive. Not many servers have these at all.
As a person with a sysadmin + HPc background having built several clusters recently, this is not true(anymore). 10G NICs are almost as common as Gigabit NICs(both in availability and cost). To give you an idea, we commonly use 10G NICs on all compute nodes, and they connect to a 10G top of the rack switch which connects to services like file servers via 100G connections. The 10G connections are all 10GBase-T simple Ethernet connections. The 100G connections are DACs that are more expensive but not prohibitively so.

What cloud providers give you for VMs is not the norm in the datacenters anymore.

Everything is relative. If you are a cloud provider it’s one thing. I’m speaking from the perspective of the small medium business that rents these physical or virtual servers.
my $700 Mac Mini has a 10gb NIC. 2.5gb and 5gb NICs are very common on modern PC motherboards. Modern servers from Dell and HP are shipping with 25gb or even 100gb NICs.
The cost of 10g is much higher than a single computer. The entire networking stack must be upgraded to 10g. At the very least the Internet device, and possibly the Internet connection as well. It will be cheaper in the cloud than on site.
Well, it depends on what your use case for "10g" is. If all you care about is fast file transfers between your PC and your NAS, you can get a small 5-8 port 10gb switch for under $300 that will easily handle line-rate traffic (at least for large packet sizes)

If you want 10g line-rate bandwidth between hundreds or thousands of servers? Yeah, I used to help build those fabrics at Google. It's not cheap or easy.

10g to the internet is more about aggregate bandwidth for a bunch of clients than throughput to any single client. Except for very specialized use cases you're going to have a hard time pushing anywhere close to 10g over the internet with a single client.

10Gb ethernet is 20+ year old tech and and used these days in applications that don't have high bandwidth demands. 100 Gb (and 40 Gb for mid range) NICs came around 2014. People were building affordable home 40 Gb setups in 2019 or so[1]. But I can believe you that the low-end makes up a lot of the volume in the server market.

[1] https://forums.servethehome.com/index.php?threads/cheap-40gb...

In my experience, 40gb and 100gb are still mostly used for interconnects (switch/switch links, peering connections, etc.). Mostly due to the cost of NICs and optics. 25gb or Nx10gb seems to be the sweet spot for server/ToR uplinks, both for cost, but also because it's non-trivial to push even a 10gb NIC to line rate (which is ultimately what this entire thread is about).

There's some interesting reading in the Maglev paper from Google about the work they did to push 10gb line rate on commodity Linux hardware.

I guess it'll also depend a lot on what size of server you have. You'd pick a different NIC for a 384-vCPU EPYC box running a zillion VMs in a on-prem server room than a small business $500 1u colo rack web server.

The 2016 Maglev paper was an interesting read, but note that the 10G line rate was with tiny packets and without stuff like TCP send offload (because it's a software router that handles each packet on CPU). Generally if you browe around there isn't issue with saturating a 100G nic when using multiple concurrent TCP connections.

Yes exactly. Not everything seeking concurrency is a web server. In an OS, every single system service must concurrently serve IPC requests, but the vast majority of them do so single threaded to reduce overall CPU consumption. Making dozens of services thread per core on a four core device would be a waste of CPU and RAM.
> Not everything seeking concurrency is a web server.

Web servers should be overwhelmingly synchronous.

They are the one easiest kind of application to just launch a lot more. Even on different machines. There are some limits on how many you can achieve but they aren't anything near low. (And when you finally reach them, you are much better rearchitecting your system than squeezing a marginal improvement due with asynchronous code.)

There's a lot to gain from non-blocking IO, so you can serve lots and lots of idle clients. But not much from asynchronous code. Honestly, I feel like the world has gone crazy.

tokio supports a single threaded executor when you really need it, and its not even hard. It's called a LocalSet in tokio's API:

https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html...

This is true but the rest of the ecosystem is not built for it.

If you try to use axum in this way you'd still need to use send and sync all over the place.

I was going to comment on the same quote.

The problem is that one may still want concurrency even when a single thread on a single CPU is enough.

Instead of Arc and Mutex you'd be using Rc and RefCell. Wouldn't it be just as complex and verbose code-wise?

I understand that it is less efficient but in the case you describe wouldn't paying for a few extra atomics be negligible anyway?

I've found that practically I'm more likely to simply use Box, Vec, and just regular data on the stack rather than Rc and RefCell when I esque Arc and Mutex by using a single context. The data modeling is different enough that you generally don't have to share multiple references to the same data in the first place. That's where the real efficiencies come to play.