> The problem is that threads just don’t work in practice for massive concurrency.
That's an assumption that is repeated very often recently, and measured very rarely. Truth is that they amount of applications for which they don't work is surprisingly low. I'm working at a well known cloud provider, and lots of people would really be suprised which applications at largest scale are working fine with a thread-per-request model. 50k OS threads are not really an issue on modern server hardware. While it might not be the most efficient [1], it will not perform so bad that it causes an availaiblity impact either.
There's obviously some exceptions to that [2] - but I encourage people to measure instead of making assumptions. Unless one finds themselves in a weekly meeting about server efficiency or scaling cliffs both models probably work.
[1] it really depends on the workload, but people might find an efficiency degradation (e.g. measured as BYTES_TRANSFERRED/CPU_CORES_USED) of 20% at a concurrency level of 1000, or maybe only at a concurrency level of 10k. Coarse-grained work items (e.g. send a large file to a socket) will show a lower degradation.
[2] Load balancers, CDN services, and e.g. chat applications which maintain a massive amount of mostly idle client connections can be such environments. They have a high amount of concurrency that needs to be managed, but less so of "active concurrency". If all clients would be active at the same time, those environments would run out of disk IO or network bandwidth far before CPU or memory become an issue.
> While it might not be the most efficient, it will not perform so bad that it causes an availaiblity impact either.
Performance is important, but the biggest performance gain happens when a program goes from not working to working correctly.
Debugging is another corner case which async makes it intolerably hard to get backtrace and make sense out of what is going on.
It's not like debugging threads is easy, but in a low contention environment which is entirely "1 thread holds state of one request" and there are few interlocking threads in it, threading is a fair bit better than async execution. Plus the logs which indicate thread-names make it possible to draw out something like a post-processed Catapult timing diagram (open chrome://tracing and look at an example, it is a great UI for dropping in your own multi-threaded event log as JSON).
I'm a big fan of executor thread-groups and work queues, but damn does it make hard to mentally walk through a bug when the stack traces are scattered across multiple places.
> > The problem is that threads just don’t work in practice for massive concurrency.
> That's an assumption that is repeated very often recently, and measured very rarely.
I would go further--there is a whole infrastructure that needs to appear when massive concurrency is involved and very few times is that taken into account.
For those people interested in genuine massive concurrency, I encourage people to investigate Erlang. In my opinion, the language itself is just "meh", but OTP, the infrastructure around managing, upgrading, restarting, etc. processes/threads, is extremely on point.
I really wish the Rust people would pick something like the Erlang Bit Sytax up and integrate it with their pattern matching (probably necessitating some pattern matching language fixes) rather than the amount of effort they continue to piddle on async/await.
Erlang pattern matching is awesome. Matching on binaries makes it very easy to parse protocols.
Re concurrency. I learned Erlang before Akka. It took me a bit but I find Akka more ergonomic. Akka will easily handle millions of actors on a single machine, too. But I always miss matching on binaries.
Another good one is protoactor for golang. That will also do a million actors no problem. Comes really close to Erlang in terms of how concise the syntax is. But again, no binary matching.
Why would you assume that all software is written for servers in datacenters? Rust tends to be used in embedded devices, WASM, and other weird contexts where there might not be as many resources available.
If you're writing a CRUD app, sure, do it in PHP and spin up a thread per request.
Not necessarily. A lot of embedded projects use realtime operating systems (RTOS). And those make use of preemptive schedulers in order to actually provide realtime guarantees.
There's obviously also some projects which just use a bare-metal loop to do everything - that probably counts as cooperative.
I agree and this article seems pretty misinformed. Creating and managing threads on Linux is extremely cheap, especially when a lot of them are idle, and a lot of big companies (Google, Facebook, Amazon) have tons of huge C++ applications that have thousands of threads and it's fine. I also think a lot of people who don't work on these problems at these kinds of companies assume that it must be incredibly difficult to write code like this and debug it, but that's not really true. For one thing, generally the tricky parts to write are abstracted away so that regular engineers don't have to think much about threading concurrency issues. And when they come up, tsan and lock annotations[1] will catch 99.9% of these problems in testing and make it easy to understand why things are breaking.
In the real world here are the kinds of problems that people at Google etc. care about when it comes to performance or scalability issues with hugely concurrent programs:
- Noisy neighbor problems from other threads messing with your TLB and L1 cache
- High cost of context switches
- Unpredictable scheduling/priority inversion in the scheduler
The first problem isn't actually made any better by using async coroutines or green threads/fibers, if you switch to another coroutine or fiber and it does something naughty (e.g. munmaps memory, which will cause a TLB shootdown) it's going to degrade performance for your unrelated coroutine/fiber.
The second and third problems can be solved in some cases by things like fibers and userspace scheduling, but this is a fairly advanced topic and "just use async" is definitely not the solution. If you're interested in learning more about how these problems are actually solved at Google for example I recommend [2] and [3].
> - Noisy neighbor problems from other threads messing with your TLB and L1 cache
Switching between threads within the same process doesn't require a TLB or L1 cache flush. Not sure if you were implying this, just wanted to point that out.
> - High cost of context switches
Userspace schedulers (like rust's tokio) do make context switching cheaper, however, most of the context switching in the case of a web server is due to blocking I/O and the most expensive part of the switch, entering the kernel, is already accounted for by the I/O request. Kernel context switching is unlikely to be your bottleneck.
> Unpredictable scheduling/priority inversion in the scheduler
This can definitely be an issue at scale, but a general purpose async scheduler like most use is unlikely to be any better.
Anything using the green/lightweight or OS thread model is usually easier to use at the cost of some runtime performance. Whether the runtime performance matters for your use case can only be determined by measuring stuff.
The perception that async rust is where you should start for concurrent rust because it's built in and everyone uses it perhaps should be revisited. I would argue that the other options are worth consideration first and dropping down to low level async code might be warranted when you need the performance it gives and that justifies the increase in development costs.
Rust used to have green threads before 1.0 (libgreen). Early Rust was meant to be more like Erlang[1]. The problem with them wasn't only the overhead, but also interoperability and how they affect every interaction of the language with the OS and other libraries. It made the whole language dependent on its own custom runtime.
Rust isn't meant to be a language for CRUD apps (despite making inroads in this space). It's meant to be a C/C++ alternative that can work every difficult niche where these two can, including processes that already have their own runtimes, kernel space, microcontrollers, and other situations where any overhead or bringing custom threads with magic I/O and special stack handling is unacceptable.
Rust's async is designed to be separate from the core language, and work on top of arbitrary runtimes. Most people use tokio, but it can also work with your custom loop on microcontrollers, or on top of another runtime, e.g. WASM + browser's event loop, or gtk-rs that can work on top of GTK's event loop.
I'm aware of the history there. I think the decision not to ship a builtin async runtime was probably correct. I also think shipping async syntax sugar and allowing people to build their own custom runtimes is just fine.
I just think that the cultural decision in the wider ecosystem to make, practically speaking, everything io related, async is possibly a mistake.
Well I think it happened because a large number of Rust committers, core-devs doubled down on multi-year Rust async effort. What larger ecosystem would take away from this?
IMO the message was Async is the future so everyone better hop on this train.
I didn't get that message at all. The length of time it took to add async sugar made sense given what they were trying to do. It was not a statement regarding the suitability of it for every use case not should it have been.
the problem with async is it makes easy things much more complex if you don't need the performance. granted it should be easy for library/API designers to provide sync versions of all async calls, but I don't know if this happens.
Too many major packages in the ecosystem only support an async model now. It's pretty frustrating if you are just writing a synchronous program, or one with a straightforward OS threading model.
That will only allow to run futures which have no IO dependency. Other typically expect a certain runtime to be running, because they eg use the epoll loop of that runtime to make progress.
This "async virality" syndrome is the main reason why async is harmful imho. _Some_ async can be very useful in certain constrained circumstances, I believe. However forcing the async execution model on all code is a terrible idea.
Yes. I've been saying this for some time. I call it "async contamination".
The async model assumes you spend most of your time waiting for your slow users to do something. (Why a web site, which is inherently stateless, should be doing that routinely is another issue.) I'm writing a metaverse client that has about 10-20 threads, many of them compute bound, running at different priorities. Works fine, but is totally different from the async model. Trying to keep async out of the networking has been difficult. I don't use "hyper" any more. I look at builds to see if "tokio" somehow got pulled in.
> Why a web site, which is inherently stateless, should be doing that routinely is another issue.
Because most web sites that would be doing this are not stateless? Any dynamic site will need to access a database, which means that the will be IO blocking, which means that given enough traffic the server will run out of available threads before being able to service the IO operations for all of these users. And because different parts of the website will likely have different DB load, you could easily cause a DoS by hitting an expensive endpoint repeatedly.
Sorry, offtopic, but what do you mean by "metaverse client"? I've seen you mention this in a couple comments now and I'm intrigued. I don't imagine you mean something to do with Facebook, right?
A metaverse client is the program you run on your machine to talk to a metaverse server. There are several clients for Second Life, a client for VRchat, a client for SineSpace, and so forth. There are web-based clients running in a web browser in WebAssembly, such as the one for Decentraland. All of these are 3D graphics programs.
They're halfway between MMO game clients and web browsers. They have to do most of the things a game client does, but they don't have built-in assets or game logic. Rather than a giant download at install (the biggest AAA titles have passed 100GB), all content is coming from the servers as needed, as with a web browser. The client's job is to present a good-looking 3D world while busily downloading content as the user moves round the world. Hopefully before the user gets close enough to see it in detail. So they have the performance problems of a 3D game with the content-handling problems of a web browser.
An existing open source metaverse client is Firestorm, a viewer for Second Life and Open Simulator.[1] Here's the source code.[2] It's mostly single-thread and OpenGL based. I've made some small contributions to that.
I am working on a replacement, in Rust, with more concurrency. About 20-30 threads, not thousands.
Thread priority matters. Top priority is refresh, keeping the frame rate up. Next is servicing the network and user inputs. Then comes content decompression and preprocessing for adding to the scene. Much of this is compute-bound. Rust is a huge help in keeping the concurrency straight. This would be a much harder job in C++.
As the metaverse moves from hype to implementation, this will be a bigger area of activity. Right now, it's a niche.
A great example of this would be in javascript testing frameworks. There must be dozens of frontend test frameworks that shoehorn inherently synchronous, procedural tasks into awkward syntax of sugared promise chains.
I'll use an embedded analogy. I'm not as familiar with concurrency on GPOS, but consider this:
I have an I/O task that might take long, compared to CPU operations:
- Start the task, but don't wait for its result.
- Your program continues as normal
- When the IO task is complete, its hardware sends an interrupt (at a specific priority) to the CPU. The CPU stops what it's doing (assuming there isn't a higher priority task in progress). Here, you can read the now-ready IO data, and do something with it. Or maybe cue another task.
You could also examine the case of DMA. Ie, your peripheral (Maybe your network chip in the case of a desktop PC?) commands an IO task. It runs in the background on your network hardware. You then read from, or write to the buffer that's associated with the DMA transfer as required. (Sometimes using DMA-related interrupts)
Could you apply this model to GPOS networking? Of note, some people are trying to do the opposite: Use Async on embedded, to wrap interrupts and DMA.
I have no idea what GPOS stands for, but the analogy isn't really necessary.
The high level algorithm you describe is basically how async programs work. Glossing over the low level details, you usually implement things in terms of polling. Interrupts and their analogs are far too slow at scale (switching async tasks is in the nanoseconds, these days).
The problem is when there is logic downstream of the task that needs its results and mixed with the results of some synchronous code in between. This is the "function coloring" problem.
Async semantics are designed to insert the logic for handling this (merging of async task results) seamlessly. There are two issues with this, the first is that synchronous code has no way of knowing what to do with asynchronous results (meaningfully), and the second that there has to exist some executor program that handles the merging and scheduling logic.
The thing that makes async "hard" in a language like Rust is that dealing with this problem is extremely difficult when you have no GC, lifetimes, call-by-move, closures that capture by move, and ownership semantics - it makes it verbose to write sound, non-trivial async code. For example, you're forced to introduce the notion of "pinned" data in memory to prevent it from being moved while tasks are switched. Lifetimes become a lot less clear. "Async destructors" don't really exist (what other languages would call finalizers that don't run at the end of lexical scope).
As for the mixing of sync/async code, that's not actually an issue if everything is async. It's trivial to write an executor that makes async calls blocking anyway.
I started writing rust ~6mo ago and while I agree with your sentiment, the issue I've run into is that so many packages I need to use, because there isn't an alternative and I don't want to build it myself, already uses async. I then have to either heavily wall off that part of my code or at a certain threshold realize I may have to adopt async myself because keeping two concurrency models going is really a lot of overhead.
Async has really taken over anything networking-related because, well, it offers much better scaling and performance. If you're a package author you're going to get more people asking for async than people that don't want it. There is no sane way to make async optional in a library and reuse code.
Myth. Performance won't be better. Scaling arguably is better, but usually the use-case doesn't require the level of scaling where async is superior to OS threads.
I suspect you might be arguing semantics but in practice for certain types of applications performance will in all likelihood offer better performance. Scale and performance are linked when scaling up when you start to hit limits async can make it easier to get more out of your compute than otherwise which is a performance consideration. Calling his statement a myth ignores the context it was made in.
The point of the parent was that better performance is not guaranteed, and it's totally true.
E.g. go ahead and implement a RPC server which e.g. only has to deal with 10 concurrent requests - then measure latencies. The synchronous version might be faster, due to not requiring any epoll calls. The different might get even bigger if e.g. the server is serving static files, and you are measuring throughput - the synchronous version will likely provide higher performance since no extra context-switch from the async-runtime-of-your-choice to threadpool-for-file-io thread and back is required.
You are also right in that once one moves beyond a certain scale the async version might offer better performance. But the scale that is required would be different per application, and not every application requires the scale.
You will absolutely get "more" performance out of async. I'm not sure I could call it much more. It's hard to get an exact number because there isn't exactly a whole lot of pairs of "async" vs "greenthreaded" options out there, but I'd guess you're looking at 20%-30% tops. For most people, and even most people writing async code, this is irrelevant. They are never going to write code that absolutely needs that last 20-30% and that alone is the difference between the problem being solved and not solved.
It certainly isn't like you use a green thread model and you unconditionally throw away a 5x performance factor or something.
There are absolutely cases where that does matter. To name just one, a game engine would not want to throw away that level of performance out of the box. (That's the game engine user's job, to "spend" the quality of the game engine on their task.) But I think there's a lot more programmers who have, without analysis, assumed they're in that class and made a lot of decisions based on that, when in fact they are plural orders of magnitude away from it. To pick a number out thin air, 4 full CPU cores running Rust code that someone has at least glanced at and spent a bit of time optimizing is a loooooot of power.
(The closest current comparison is Rust vs. Go, but Rust works much harder at compile-time optimization and doesn't have GC, and I expect those two things account for the majority of the delta between them, with Go being greenthreaded being non-trivial, but in the clear minority. Stay tuned for Java with Project Loom versus Rust, which has its own rather major differences but will at least be another relevant data point.)
Interesting, but there are other issues. A big one is resource exhaustion attacks. A thread per connection means that someone can trivially exhaust system memory, while async pseudo-threads (tiny bits of state) take up virtually no space.
Edit: also this only tests 500, not 500000.
Also when doing threaded I/O as soon as you want to support bidirectional traffic you will have to implement select/poll/etc. since you can't do a blocking read and a blocking write at the same time on one thread. At that point you're already giving up a lot of the advantages of threads.
> There is no sane way to make async optional in a library and reuse code.
FWIW, there's an effort to do exactly that, but because it will require language level changes and it is just on the drawing board phase, it will likely be a while before it can be widely used.
The "optionality" of `async` while sharing code also applies for `const` and mutability (why do we need `Deref` and `DerefMut`?). Finding a solution that can work for these three (and maybe others?) parts of the language will be a welcome improvement.
Rust async code can be a bit challenging until you get it, but I can't think of a way to make it that much simpler without sacrificing the whole "systems programming language" concept or support for embedded. The only good alternative is Go-like fibers and that requires a fat runtime.
We use both Rust and Go at ZeroTier and find that they both have their own niches. (We are slowly moving ZeroTier from C++ to Rust to use a more modern and more importantly safe language.)
Personally, once I grokked async rust, I found it much easier to use and reason about than threads. Things just seem to map better without any messy stuff to think about.
Yes, async is hard. It adds lots of complexity, both to the code and in your mental model. That slows development. I'd rather have faster development most times. It's why I prefer to use Go over Rust whenever possible. That's why I'm really interested in what lunatic is doing here. It might narrow the gap a little.
Yes, async is hard. It adds lots of complexity, both to the code and in your mental model. That slows development.
Nodejs devs seem to be doing fine? and I would say their development is faster than most devs working on other stacks. Nodejs is also a top 3 server stack and growing.
Imo, 99% of the time, ergonomics should take precedence over power. Power can always be added later with clever hacks, without ruining an ergonomic interface. But adding ergonomics to power is a much more broken process.
> Power can always be added later with clever hacks, without ruining an ergonomic interface.
This puts limits on what can be accomplished. Starting with a more restricted set of code allowed, and then expanding it over time can be more successful in many cases, without locking you into a perhaps more ergonomic looking interface that needs to be coddled with no tooling support to avoid the "slow path". For examples in Rust: `impl Trait` used not to exist, which meant you had to use `Box<dyn Trait>` instead, which can be slower and certainly ads some verbosity. Then `impl Trait` was added and a bunch of code was now representable, and soon `type Alias = impl Trait;` will be stabilized which will allow even more code to be representable, in a way that is both performant and easier to use. A language that instead says "just use `-> Trait` and the compiler will figure out what to do" would have increased the user's perf without intervention, but for anyone that really cares about FFI stability or wants to keep on top of heap allocations would be out in the cold.
It is the same reason that you can complain about the complexity of the String/&str distinction in Rust[1], but avoiding lingering references to big strings in JS (effectively a memory leak) becomes much harder.
That's a reasonable choice of priorities to have, but it's the opposite of Rust's. Rust prioritizes (1) safety, (2) performance, (3) ergonomics, in that order. There are other languages that make put ergonomics before performance but they are generally unsuitable for Rust's niche.
I use Rust for the amazing types, map/filter/reduce, and, even if I never write macros myself, beautiful libraries like serde and clap. I do need to often use async to wait for multiple network requests at once, although I'm not quite comfortable with it.
Sometime ago I was comparing go, python and Rust to do some GET request asynchronous.
At first, I noticed that the go version was actually faster than the Rust one, and then I saw that in `reqwest`, they recommend you if you're doing multiple GET request, to create a `Client` and then use that to get better performance[1]. After changing my code, the Rust version was effectively a bit faster (not by much, to be honest, which was a bit disappointing considering go's version was way easier to write, and I say this as a generally rust shill).
reqwest::get() is even worse than not having a connection pool. It will also reload the full content of system certificate stores on each invocation - since it creates a new reqwest client. On some hosts that can take 10-100ms alone.
Always create a client explicitly. And also always add a timeout.
The Go http.Get() function uses a shared global client, so making a request doesn't have high initialization costs, and requests can make sure of a shared connection pool.
In heavily IO bound workloads for a compiled language like Rust and Go the bulk of the time will be spent waiting for IO. In that world the optimzations of the compiler for CPU bound operations will fade into the background so it's not suprising that Go is competitive with Rust for that kind of workload. If your workload is this type and Go is equally supported Rust then Go may be a better choice.
Yes - `DefaultClient` in `net/http` is what the various package level methods operate on. This is constitutionally bad as global state that dependencies can mutate at will during init (or any other time), hence go-cleanhttp [1].
> However, if you are doing web apps or any networking stuff, massive concurrency benefits are almost always too important to ignore
My problem is more that even if I don't need massive concurrency (say in a client that only talks to a single server, in a serial manner), I'm still more or less forced into async code because that's what the ecosystem switched to. No matter if you benefit from async or not, not using it is going against the grain and generally makes your life harder, despite threads being much better from a language-ergonomics point of view
As much as I agree, and it's a mess: you can very much use the tokio runtime's block_on function to do as little async as possible. Rust is in general a much nicer language, with lots of good tooling, when you pretend async stuff is blocking like that.
There are still good synchronous alternatives, e.g. tiny_http for serving, and just binding libcurl for requests, but I agree it is becoming harder to avoid async.
Why isn’t imperative event loop programming more widely used? It’s a reasonably common pattern for games networking libraries like Enet, and has the added bonus that you get to design exactly how you lay out the memory of all your in flight work and therefore have it be easily debuggable.
For me async is about ergonomics first of all. When you perform parallel tasks on multiple threads it is hard and ugly (in cross platform Rust at least) to implement any sort of intricate cross communication, as communication between threads is asynchronous by nature. And it is very much impossible to stop a thread externally.
Async rust lets you implement different combinators on async tasks and cancel them effortlessly.
As for performance, tokio is not exactly a zero cost abstraction. Just run perf on a tokio program to see how big of overhead it introduces. It has claimed to be zero cost from the start, and since then it has done at least two major performance overhauls, to prove the point. That being said I love tokio and its ecosystem, but it is ergonomics, not speed that I love. That being said async-std was much slower for the networking use case that I had, so overall tokio is as good as it gets.
Well, at least it was slower a while ago, for the networking code I was working on. Could be outperforming for other use cases. As for perf, when you run it you can see that a lot of cpu is sent on work stealing and other bookkeeping. Which is fine, and the single threaded runtime doesn’t have some of that, so I use that a lot.
I've done some beginner Rust and Go programming (read "the books" on both, written small programs) and I'm wondering which one to spend more time on or try to get a job with in the future. When I see discussions like this one about Rust, I start to worry that it's unnecessarily complicated and difficult to work with and that this will only get worse in the future to the point that it won't be a good fit for many of the use cases that it's pitched for. Am I wrong to think this?
If you're trying to break into the industry you're not going to be working on problems that the language really matters. Pick a popular language, learn enough to be dangerous and specialize once you find categories of problems that interest you. Go and rust are only compared a lot because they're sexy buzzwords - they do not target the same problems and they aren't competing languages. Learning both at some point could prove valuable, but personally I'd never recommend go to anyone for anything anyways.
> they do not target the same problems and they aren't competing languages.
Is this really true? All the problems that are solvable in Go should be solvable in Rust too right (but not vice versa because Go is GCed)? They might not compete on every front but there definitely should be overlap in the use cases.
There is a ton of overlap with every general purpose programming language. So it's correct to say you can solve the vast majority of problems with the majority of programming languages, but languages differentiate themselves in many different ways. One of the main differences between rust and go is that go is a garbage collected language. That feature alone typically creates a large divide in what languages are trying to achieve.
I feel like you’re ignoring my point. The GC prevents Go from competing with Rust on some use cases. But that still leaves a lot of overlap of potential use cases and if your work falls in that region of use cases, you get to pick between the two languages (and a bunch of others too).
I feel like I covered the fact that there's a large overlap between most languages, but to clarify more - when a software team chooses a language they aren't choosing a language based on the overlap, they are choosing it based on the language specific features they think will benefit them the most across the problems they most often try to solve. Go and rusts unique features do no compete with each other like rust/c++ would or like go/elixir (which I feel these two languages are much more comparable in the problems they focus on solving.
Go generally targets small microservices and can be quite limiting (imho) to larger projects because the language semantics are extremely simple and are not geared towards projects with numerous business domains.
Rust on the other hand has extremely powerful generics which enable sophisticated code sharing and composition to enable large projects. Rust also very purposefully targets embedded systems and low level systems programming. You can do these things in go, but it is not something the language is designed to do as a first class priority.
Go is very easy to learn. You can be up and running with Go very quickly and it's fantastic for simple applications. The time investment to be reasonably good at writing Go is low. There's little reason not to learn Go.
Rust is difficult to learn, unless you already have a lot of experience with existing low level languages. Getting complex programs up and running with Rust is cumbersome. But the performance is excellent, you can have a high degree that your program is rock solid, and there are entire classes of security issues don't happen in Rust. For the types of applications where Rust does well, it does very well indeed. The time investment to become a decent Rust programmer is high, but this higher barrier to entry can make your programming skills even more valuable since there's less supply to meet the demand.
Rust is hard without a doubt. I'm suspicious it's really only hard because it front loads the effort over, say, c++, but it is hard. Expect maybe a year to be proficient, but there's a point long before that it becomes a delight.
Async is hard again, taking more months to feel proficient. I've again a suspicion that much of the resistance to async is due to people who have done the first effort to feel comfortable in rust and expect async to fit right in, but it doesn't, because it's hard too.
Threads are also hard, but under rust they better map to existing thread models, so pre-existing skills are useful and so someone skilled in threads and rust will be skilled in threads with rust.
For sure, there are missing pieces of the async world like async traits, but they will come.
Does anyone else get the feeling that we (as a field) are missing something basic about concurrency? Like there's a really elegant solution just around the corner, that has the low overhead of async/await without the complexity. Or otherwise put, the ease of goroutines but without GC.
I know it sounds crazy. I recently dove into the area, and was pretty surprised at how many interesting building blocks there are out there. It feels like if we just combine them in the right way, we'll discover something that works a lot better.
Off the top of my head:
Google discovered a way to switch between OS threads without the syscall overhead. All it needs is to solve the memory overhead. [0]
Zig discovered a way to use monomorphization to enable colorless async/await. If someone could figure out how to make it work through polymorphism / virtual dispatch, that would be amazing. [1]
Vale discovered a possible way to make structured concurrency in a memory safe way that's easier than existing methods. [2]
Go [3] and Loom [4] show us that we can move stacks around. Loom is particularly interesting as it shows we can move the stack to its original location, a unique mechanism that could solve some other approaches' problems with pointer invalidation.
Cone is designing a unique blend of actors and async await, to enable simpler architectures. [5]
We're close to solving the problem, I can feel it.
[0] No public docs on it, but TL;DR: we tell the OS the thread is blocked, and manually switch over to it by saving/manipulating registers.
I just want to say there are mountains of research on this, and recent development is exciting, but some of the techniques (like stack switching and moving) are very old. Project Loom is very intriguing because of how it solves the practical problems of introducing old concurrency techniques into existing language implementations that were not designed around them.
A lot of this stuff is intriguing from the implementation side, but where we're really lacking is in the syntax and semantic side to make concurrency "make sense" to programmers. I don't think we're close to solving that problem (for example, call/cc isn't the answer, it's the problem).
imho the issue isn't function coloring, threads, whatever. It's a compiler that defaults to async code in the calling convention and then optimization passes to de-async-ify (remove unnecessary yield points) the code at compile time. The result would be code that looks synchronous but is async where it matters (i/o).
A lot of the symptoms of the sync/async problem are caused by the explicit decoupling of sync/async APIs in source code. If you remove that and force it to be implicit internal to the language implementation, the issue goes away. It would take a lot of work to determine if that was worth it.
Basically as we've now accepted garbage collection to be an acceptable part of language implementation, one day I think we'll accept async executors to be a part of that too. We're halfway there on the impl side (Go, Java through Loom, NodeJS, etc). The other half is removing the explicit syntax for it.
> imho the issue isn't function coloring, threads, whatever. It's a compiler that defaults to async code in the calling convention and then optimization passes to de-async-ify (remove unnecessary yield points) the code at compile time. The result would be code that looks synchronous but is async where it matters (i/o).
Safepoints for garbage collection are somewhat similar, but for preemption one wants to interrupt threads on a timer, rather than before the collector takes over. Despite occurring very frequently (at around 100 _million_ checks per second), the time overhead is only about 2.5% or so, according to a study by Blackburn et al [0]. It appears, I think, that as long as the fast not-interrupting path is fast enough, eliminating safepoints isn't too important.
> imho the issue isn't function coloring, threads, whatever. It's a compiler that defaults to async code in the calling convention and then optimization passes to de-async-ify (remove unnecessary yield points) the code at compile time. The result would be code that looks synchronous but is async where it matters (i/o).
Sounds like Erlang and single assignment languages.
Jokes aside, part of the problem seems to be the computer model and cpu architectures themselves.
We need something that is designed from scratch to run things concurrently.
Concurrency is mostly a higher level abstraction than the ISA, they don't care what the stack pointer is pointing to or what the return address is. Actually implementing concurrency efficiently is a solved problem, both in the trivial (stack less) and more complex (stackful) cases.
And that's sort of my point, concurrency primitives are really easy to define and implement but pretty hard to use by programmers up the stack.
> Does anyone else get the feeling that we (as a field) are missing something basic about concurrency? Like there's a really elegant solution just around the corner, that has the low overhead of async/await without the complexity. Or otherwise put, the ease of goroutines but without GC.
Algebraic Effects promise a return to non-colored functions, as AE can abstract over exceptions, continuations, async and other control-flow mechanisms.
A decade ago the simple thing we were missing about threaded concurrency was Rust's ownership and borrowing model and Send/Sync. Before that, the simple thing was to use early Java, which had a mandatory garbage collector and monitor objects. If you didn't have or use those, then you were subject to memory safety problems. And moving from heap-scanning GC to ownership and borrowing gave a genuine performance advantage.
Now, we want to remove threading from the concurrency story, in the hopes of getting another performance boost. This itself is the problem, because threads were giving us automatic preemption, akin to how GCs were giving us automatic memory safety. Now we have to statically determine a "good time" for the program to yield. I/O yielding is the easy part, and the reason why people are flocking to async; but we also need to support yielding for fairness reasons. Kernels can do this because they have interrupt timers; but there's no lower-overhead equivalent for userspace code that I'm aware of.
The other problems mentioned with async Rust are particular to Rust itself. The language has a policy that heap allocations only ever happen in `std`, because they want to support embedding Rust into applications where heaps don't exist. This means that futures need to be structs. Rust does support structs of indeterminate size, but barely; and there's no support for structs that can grow. Such a thing is likely unsound without a way for the compiler to check growth limits, and the memory is pinned, so we can't grow beyond a preset limit set at the start of the future[0].
Async infects everything it touches because it's a total pain to write networking library code that's preemption-agnostic. Monad<T> would fix that, but higher-kinded traits aren't a thing in Rust yet and we would need lots of language tooling (akin to `?`) to make this ergonomic to use.
There's also just the possibility that we've been engineering the wrong fix, and we should be trying to get OS threads to be as lightweight as possible rather than trying to move the entire threading system into userspace. There's no particular reason why we need 8MB stacks, other than the fact that compilers don't check stack growth themselves. (Which, BTW, is also a soundness hole in Rust as far as I know.)
[0] Go gets around this with a linked list of stacks, which adds its own overhead.
I'm betting on structured concurrency. I think it will be the same sort of revolution for concurrent programming that structured programming was for single-threaded programming.
This sounds like what I'm looking for for building a set of networking/pentest tools. Ie, being able to spawn an arbitrary number of IO bound processes without the overhead of OS threads, and the contagion and fracturing of Async.
There may still be some fracturing here, ie in the first example (but not the others, inexplicably?) `lunatic::net` vice `std::net`.
Hi, author here. All examples should have used `lunatic::net`, I fixed it now.
The reason why we provide `lunatic::net` and you can't just use `std::net` is that WASI (system interface for WebAssembly) still doesn't have support for sockets[0]. `lunatic::net::TcpStream` is for now just a drop in replacement for `std::net::TcpStream` and once sockets get standardised you will be able to use the standard library types instead.
Has anyone seen any recent solid benchmarks of thread per connection architecture web application? What is actually the break-point load where it's perf starts to regress and async really becomes useful?
What does "stream.write_all(&number_as_bytes).unwrap();" do if the socket buffer is full? Does it block this virtual thread running this function? Or does the stream keep buffering? or is it sending the message to some other process which is accumulating those messages. What if I don't wait this thread to block and instead do something else?
I believe all of these are handled. I just cannot find sufficient documentation to understand the details of how this works.
Same as the synchronous version: It will block until more data can be written, and then go on and write as much as possible using another async .write() call. It's the same as:
let mut offset = 0;
while offset != number_as_bytes.length() {
let written = stream.write(&number_as_bytes[offset..]).await.unwrap();
offset += written;
}
The synchronous version would be the same without the .await, and offers stronger guarantees that either all bytes are written to the socket or the socket errored and is dead. The async version could be cancelled in the middle of the invocation after some segments have already been written.
Yes, it does matter. It has excellent throughput and latency for certain classes of systems, while others are impossible to build. Rust may not impose this constraint while meeting its goals.
Seems like Rust is better in every way right? I can't help but wonder why is it that Go is so much more popular when it comes to language of choice for networked and multi-threaded applications.
Yes, you have to manually insert yield points. Exactly the same as with every cooperative threading system, including Lunatic and Rust's native async/await.
Kind of feels like we need user space preemptive threading somehow.
> However, if you are doing web apps or any networking stuff, massive concurrency benefits are almost always too important to ignore
No, you will benefit from parallelism/multithreading. Why only use 1 core? Multitasking as it was once called, or "async" as it is now, is fundamentally _synchronous_ because everything still happens on one core. Just that the order of execution may be a bit wonky, which technically all code already suffers from at the microscopic level with instruction reordering and out of order execution. You almost certainly don't need multitasking unless you are writing an OS for embedded.
Even if you use N cores, you still get a massive benefit from being able to let >N threads wait on IO events simultaneously using concurrency/multitasking/async.
There's only so much a "apache" or "nginx" can do though in between io operations right? And there's only so much io per second a whole system can do. Basically, from disk to memory, maybe run a language interpreter if the site is not static, then from memory to the internet. Maybe if your pages are very dynamic and involve a lot of scripts it could be worthwhile. Do you have any numbers to back up your claim?
I don’t have a reference offhand to hard numbers, but I’ve definitely run webservers which have a significantly higher number of concurrent in-flight requests than number of cores.
Even for a static site, what you’re basically doing is
page = readFile(“foo.txt”)
response.write(page)
That’s no CPU usage at all. ~Zero time spent in process. All the time is spent waiting on the data to be loaded into memory from disk, and then copied from memory out onto the network. If you use concurrency for those two functions, then you can handle ~100s of in-flight requests at the same time.
That's an assumption that is repeated very often recently, and measured very rarely. Truth is that they amount of applications for which they don't work is surprisingly low. I'm working at a well known cloud provider, and lots of people would really be suprised which applications at largest scale are working fine with a thread-per-request model. 50k OS threads are not really an issue on modern server hardware. While it might not be the most efficient [1], it will not perform so bad that it causes an availaiblity impact either.
There's obviously some exceptions to that [2] - but I encourage people to measure instead of making assumptions. Unless one finds themselves in a weekly meeting about server efficiency or scaling cliffs both models probably work.
[1] it really depends on the workload, but people might find an efficiency degradation (e.g. measured as BYTES_TRANSFERRED/CPU_CORES_USED) of 20% at a concurrency level of 1000, or maybe only at a concurrency level of 10k. Coarse-grained work items (e.g. send a large file to a socket) will show a lower degradation.
[2] Load balancers, CDN services, and e.g. chat applications which maintain a massive amount of mostly idle client connections can be such environments. They have a high amount of concurrency that needs to be managed, but less so of "active concurrency". If all clients would be active at the same time, those environments would run out of disk IO or network bandwidth far before CPU or memory become an issue.