Hacker News new | ask | show | jobs
by kentonv 1333 days ago
> Direct read/write from memory for all intents and purposes doesn't block.

Sure it does! Main memory is much slower than cache so on a cache miss the CPU has to stop and wait for main memory to respond. The CPU may even switch to executing some other thread in the meantime (that's what hyperthreading is). But if there isn't another hyperthread ready, the CPU sits idle, wasting resources.

It's not a form of blocking implemented by the OS scheduler, but it's pretty similar conceptually.

> The reason memory blocks is because it needs to page in or out from secondary storage

Nope, that's not what I was referring to (other than in the line mentioning swap).

3 comments

> Sure it does! Main memory is much slower than cache so on a cache miss the CPU has to stop and wait for main memory to respond. The CPU may even switch to executing some other thread in the meantime (that's what hyperthreading is).

Cache is a memory. And which cache, by the way? Even L1 cache on modern processors doesn't have 0 latency. And this is a rather poor way of describing hyperthreading - the CPU doesn't really "switch" - the context for the alternate process is already available and the resource stealing can occur for any kind of stall (including cache loads), not just memory. Calling this a "switch" suggesting it is like a context switch is very misleading. It's not similar conceptually.

In any event, by this definition even a mispredicted branch or a divide becomes "blocking" - which sort of tortures any meaningful definition of blocking.

The essential difference is - memory accesses to paged in memory (and branch mispredictions, cache misses) are not something you typically or reasonably trap outside of debugging. mmaps, swaps, disk I/O, network accesses are all something delegated to an OS - and at that point are orders of magnitude more expensive than even most NUMA memory accesses. I sort of see where you're coming from - but I don't think it's a useful point.

None of this seems to contradict my point?

My argument is that disk I/O is more like memory I/O than it is like network I/O, and so for concurrency purposes it may make more sense to treat it like you would memory I/O (use threads) than like you would network I/O (where you'd use non-blocking APIs and event queues).

> My argument is that disk I/O is more like memory I/O than it is like network I/O

It depends on your network and disk - and yes SSD and "slow" ethernet are the common case, but there is enough variation (say an relatively sluggish embedded eMMC on one end and 100 GbE for the networking case), that there's no point in making some distinction about disk vs network latency - for a general concurrency abstraction they're both slow IO and you might as well have a common abstraction like IOCP or io_uring.

> concurrency purposes it may make more sense to treat it like you would memory I/O (use threads) than like you would network I/O (where you'd use non-blocking APIs and event queues).

No, case in point, Windows had IOCP for years such that you could use the same common abstraction for network and disk. The fact that the POSIX/UNIX world was far behind the times in getting its shit together doesn't mean much.

And why, fundamentally, can you not use blocking APIs with threads for networking?

> distinction about disk vs network latency

Ah but that's not the distinction I'm making at all. This has nothing to do with the physical latency of the network, which can easily be less than that of a disk. It's about the application-level semantics of your communication.

When you are waiting for a connection to arrive on a listen socket, you have no idea how long you might wait. It might never happen. Someone elsewhere on the network has to take the initiative to connect. Depending on the protocol, the same can be true of waiting for a read or even a write.

> Windows had IOCP

I think Windows is over-engineered in this respect. It doesn't seem to have given Windows any recognized fundamental advantage in running databases.

> And why, fundamentally, can you not use blocking APIs with threads for networking?

Well, you can, especially if you're only talking to a single peer. But when you are talking to multiple parties (especially as part of the same logical task, e.g. talking to multiple back-ends to handle one incoming request) it gets unwieldy to manage a thread for each connection.

I think the important practical distinction is whether or not these stalls imply a context switch. Software that avoids blocking calls is largely trying to minimize context switches, with measurable adverse effects being increasingly common due to improvements in hardware. Stalls that do not imply context switches, such as filling a cache line, are not "blocking" as a matter of practical semantics because there is no context switch that has to be accounted for. Of course, this gets into a gray area with things like hyper-threading that have some of the side-effects of a context switch without an actual context switch.
With the utmost respect, I’ve never heard “blocking” described as “takes some measurable amount of time” (which is how I’m reading your above statement); by that definition, async blocks to a degree too.

You’re throwing traditional blocking/non-blocking distinctions on their ear.

Blocking in this case is referring to the CPU thread sitting idle whilst the operation is performed. This is what it means when your blocked on a network request, blocked on a disk operation, or blocked on a memory request. It's all blocking.

A cache miss and going to RAM is usually fast enough that we as software engineers don't care about it, and in fact our programming language of choice may not even give us a way of telling the difference between a piece of data coming from a CPU register or L1 cache vs going to RAM, but that doesn't mean the blocking isn't happening.

EDIT: to maybe make this a little clearer for those who might not be aware the CPU doesn't go fetch something from RAM. It puts in a request to the memory controller (handwaving modern architecture a bit here) then has to wait ~100-1000 CPU cycles before the controller gets back to it with the data. Depending on the circumstances the kernel may let that core sit idle, or it may do a context switch to another thread. The only difference between this process and say a network request is how many CPU cycles before you get your results. In the meantime the thread isn't progressing and is blocked.

> A cache miss and going to RAM is usually fast enough that we as software engineers don't care about it, and in fact our programming language of choice may not even give us a way of telling the difference between a piece of data coming from a CPU register or L1 cache vs going to RAM, but that doesn't mean the blocking isn't happening.

Yes, this is the line being discussed, and I guess (historically) I’ve just considered “a cost” without dragging “blocking” into the equation. We know that not accessing memory is cheaper than accessing it, and we can tune (pack our structs, mind thrashing the cache), but calling that blocking is still new to me. I’ll have to consider what it means. And also, does it imply the existence of non-blocking memory (I don’t think DMA is typically in a developers toolkit, but…)?

> And also, does it imply the existence of non-blocking memory

Prefetching instructions, to tell the processor to load before you use it!

The first google hit [0] even calls it non-blocking memory access!

In [1] you can see some of the available prefetching instructions, and in [2] some analysis on how they deal with TLB misses (another extremely expensive way memory access can be blocking short of a page fault).

Another thing not mentioned above is that accessing a page of newly allocated memory often causes a page fault, since allocation is often delayed until use of each page, for overcommitting behavior - same for writing to memory that is copy-on-write from a fork!

[0] https://www.sciencedirect.com/topics/computer-science/prefet....

[1] https://docs.oracle.com/cd/E36784_01/html/E36859/epmpw.html

[2] https://stackoverflow.com/a/52377359/435796

> And also, does it imply the existence of non-blocking memory

Yes actually! If you know your going to need a block of memory before you actually need it, you can put in a request to the memory controller before you need it, then proceed to do some other work and check back in when your ready for the data or when the memory controller signals you it's done. It's just that this kind of thing is usually the scope of compiler optimizations or hyper optimized software like Varnish cache rather than something your average web developer thinks about. It's again conceptually the same as an async network request, but you bother with one while considering the other just "a cost" because of the different timescales.

Is that the same thing as a prefetch?
Yep!
OK but whether or not this is a proper definition of "blocking" doesn't really change my point.

Alternatively, maybe my point is that disk I/O is not "blocking" either, it just "takes some measurable amount of time". :)

EDIT: What this says: https://news.ycombinator.com/item?id=33302587

Historically I/O or peripherals were generally distinct from different types of memory.

Technically the CPU also "blocks" when executing an instruction, let's say adding two numbers, because obviously the sum isn't available in 0 cycles. One might imagine semantics where you explicitly ask some CPU unit to add some numbers and then wait for the result and having blocking or non-blocking wrappers around it but it sort of becomes moot. Generally async/non-blocking operations are abstractions over I/O peripherals that have an async nature, e.g. you submit something and you get some reply later, or you wait for a network packet to come in etc. Reply is often an interrupt.

I agree the lines have blurred a little with modern CPUs (and arguably were always blurry with some peripherals being memory mapped) but something like waiting on a packet to come in or a disk operation to complete is outside the box that you'd draw around the CPU and its more tightly integrated components.