Hacker News new | ask | show | jobs
by abiloe 1333 days ago
> If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block.

This is a somewhat confusing analysis you have here. Direct read/write from memory for all intents and purposes doesn't block. Why do you say that reads and writes may also block?

The reason memory blocks is because it needs to page in or out from secondary storage - which makes this statement "the only real difference between main memory blocking and disk blocking is the amount of time they may block." not really true

5 comments

> Direct read/write from memory for all intents and purposes doesn't block.

Sure it does! Main memory is much slower than cache so on a cache miss the CPU has to stop and wait for main memory to respond. The CPU may even switch to executing some other thread in the meantime (that's what hyperthreading is). But if there isn't another hyperthread ready, the CPU sits idle, wasting resources.

It's not a form of blocking implemented by the OS scheduler, but it's pretty similar conceptually.

> The reason memory blocks is because it needs to page in or out from secondary storage

Nope, that's not what I was referring to (other than in the line mentioning swap).

> Sure it does! Main memory is much slower than cache so on a cache miss the CPU has to stop and wait for main memory to respond. The CPU may even switch to executing some other thread in the meantime (that's what hyperthreading is).

Cache is a memory. And which cache, by the way? Even L1 cache on modern processors doesn't have 0 latency. And this is a rather poor way of describing hyperthreading - the CPU doesn't really "switch" - the context for the alternate process is already available and the resource stealing can occur for any kind of stall (including cache loads), not just memory. Calling this a "switch" suggesting it is like a context switch is very misleading. It's not similar conceptually.

In any event, by this definition even a mispredicted branch or a divide becomes "blocking" - which sort of tortures any meaningful definition of blocking.

The essential difference is - memory accesses to paged in memory (and branch mispredictions, cache misses) are not something you typically or reasonably trap outside of debugging. mmaps, swaps, disk I/O, network accesses are all something delegated to an OS - and at that point are orders of magnitude more expensive than even most NUMA memory accesses. I sort of see where you're coming from - but I don't think it's a useful point.

None of this seems to contradict my point?

My argument is that disk I/O is more like memory I/O than it is like network I/O, and so for concurrency purposes it may make more sense to treat it like you would memory I/O (use threads) than like you would network I/O (where you'd use non-blocking APIs and event queues).

> My argument is that disk I/O is more like memory I/O than it is like network I/O

It depends on your network and disk - and yes SSD and "slow" ethernet are the common case, but there is enough variation (say an relatively sluggish embedded eMMC on one end and 100 GbE for the networking case), that there's no point in making some distinction about disk vs network latency - for a general concurrency abstraction they're both slow IO and you might as well have a common abstraction like IOCP or io_uring.

> concurrency purposes it may make more sense to treat it like you would memory I/O (use threads) than like you would network I/O (where you'd use non-blocking APIs and event queues).

No, case in point, Windows had IOCP for years such that you could use the same common abstraction for network and disk. The fact that the POSIX/UNIX world was far behind the times in getting its shit together doesn't mean much.

And why, fundamentally, can you not use blocking APIs with threads for networking?

> distinction about disk vs network latency

Ah but that's not the distinction I'm making at all. This has nothing to do with the physical latency of the network, which can easily be less than that of a disk. It's about the application-level semantics of your communication.

When you are waiting for a connection to arrive on a listen socket, you have no idea how long you might wait. It might never happen. Someone elsewhere on the network has to take the initiative to connect. Depending on the protocol, the same can be true of waiting for a read or even a write.

> Windows had IOCP

I think Windows is over-engineered in this respect. It doesn't seem to have given Windows any recognized fundamental advantage in running databases.

> And why, fundamentally, can you not use blocking APIs with threads for networking?

Well, you can, especially if you're only talking to a single peer. But when you are talking to multiple parties (especially as part of the same logical task, e.g. talking to multiple back-ends to handle one incoming request) it gets unwieldy to manage a thread for each connection.

I think the important practical distinction is whether or not these stalls imply a context switch. Software that avoids blocking calls is largely trying to minimize context switches, with measurable adverse effects being increasingly common due to improvements in hardware. Stalls that do not imply context switches, such as filling a cache line, are not "blocking" as a matter of practical semantics because there is no context switch that has to be accounted for. Of course, this gets into a gray area with things like hyper-threading that have some of the side-effects of a context switch without an actual context switch.
With the utmost respect, I’ve never heard “blocking” described as “takes some measurable amount of time” (which is how I’m reading your above statement); by that definition, async blocks to a degree too.

You’re throwing traditional blocking/non-blocking distinctions on their ear.

Blocking in this case is referring to the CPU thread sitting idle whilst the operation is performed. This is what it means when your blocked on a network request, blocked on a disk operation, or blocked on a memory request. It's all blocking.

A cache miss and going to RAM is usually fast enough that we as software engineers don't care about it, and in fact our programming language of choice may not even give us a way of telling the difference between a piece of data coming from a CPU register or L1 cache vs going to RAM, but that doesn't mean the blocking isn't happening.

EDIT: to maybe make this a little clearer for those who might not be aware the CPU doesn't go fetch something from RAM. It puts in a request to the memory controller (handwaving modern architecture a bit here) then has to wait ~100-1000 CPU cycles before the controller gets back to it with the data. Depending on the circumstances the kernel may let that core sit idle, or it may do a context switch to another thread. The only difference between this process and say a network request is how many CPU cycles before you get your results. In the meantime the thread isn't progressing and is blocked.

> A cache miss and going to RAM is usually fast enough that we as software engineers don't care about it, and in fact our programming language of choice may not even give us a way of telling the difference between a piece of data coming from a CPU register or L1 cache vs going to RAM, but that doesn't mean the blocking isn't happening.

Yes, this is the line being discussed, and I guess (historically) I’ve just considered “a cost” without dragging “blocking” into the equation. We know that not accessing memory is cheaper than accessing it, and we can tune (pack our structs, mind thrashing the cache), but calling that blocking is still new to me. I’ll have to consider what it means. And also, does it imply the existence of non-blocking memory (I don’t think DMA is typically in a developers toolkit, but…)?

> And also, does it imply the existence of non-blocking memory

Prefetching instructions, to tell the processor to load before you use it!

The first google hit [0] even calls it non-blocking memory access!

In [1] you can see some of the available prefetching instructions, and in [2] some analysis on how they deal with TLB misses (another extremely expensive way memory access can be blocking short of a page fault).

Another thing not mentioned above is that accessing a page of newly allocated memory often causes a page fault, since allocation is often delayed until use of each page, for overcommitting behavior - same for writing to memory that is copy-on-write from a fork!

[0] https://www.sciencedirect.com/topics/computer-science/prefet....

[1] https://docs.oracle.com/cd/E36784_01/html/E36859/epmpw.html

[2] https://stackoverflow.com/a/52377359/435796

> And also, does it imply the existence of non-blocking memory

Yes actually! If you know your going to need a block of memory before you actually need it, you can put in a request to the memory controller before you need it, then proceed to do some other work and check back in when your ready for the data or when the memory controller signals you it's done. It's just that this kind of thing is usually the scope of compiler optimizations or hyper optimized software like Varnish cache rather than something your average web developer thinks about. It's again conceptually the same as an async network request, but you bother with one while considering the other just "a cost" because of the different timescales.

Is that the same thing as a prefetch?
OK but whether or not this is a proper definition of "blocking" doesn't really change my point.

Alternatively, maybe my point is that disk I/O is not "blocking" either, it just "takes some measurable amount of time". :)

EDIT: What this says: https://news.ycombinator.com/item?id=33302587

Historically I/O or peripherals were generally distinct from different types of memory.

Technically the CPU also "blocks" when executing an instruction, let's say adding two numbers, because obviously the sum isn't available in 0 cycles. One might imagine semantics where you explicitly ask some CPU unit to add some numbers and then wait for the result and having blocking or non-blocking wrappers around it but it sort of becomes moot. Generally async/non-blocking operations are abstractions over I/O peripherals that have an async nature, e.g. you submit something and you get some reply later, or you wait for a network packet to come in etc. Reply is often an interrupt.

I agree the lines have blurred a little with modern CPUs (and arguably were always blurry with some peripherals being memory mapped) but something like waiting on a packet to come in or a disk operation to complete is outside the box that you'd draw around the CPU and its more tightly integrated components.

Why do you say that reads and writes may also block?

Let's define "may block" first, perhaps? What do we mean when we say "network I/O may block"? Usually, this means that the kernel may see your network request and raise you a context switch while it waits for the network response on your behalf. In your last sentence you appear to argue that the reason why the kernel performs a context switch is relevant in determining if an operation "may block", and the GP is arguing that that's a distinction without a difference.

If the definition of "may block" is really just "the kernel may decide to context-switch away from your program", then yes, the GP's assertion that file I/O, memory I/O (mmap) and memory access (swap) are all operations that may block is correct -- the only difference is in degree: from microsecond delays for nvm-backed swap to multi-second delays for network transfers.

Or, of course, I may have misunderstood the GP's train of thought.

>> If you really think about it, the only real difference between main memory blocking and disk blocking is the amount of time they may block. > > This is a somewhat confusing analysis you have here. Direct read/write from memory for all intents and purposes doesn't block. Why do you say that reads and writes may also block?

Reads and writes from actual, physical, hardware memory might block, depending on how you define "block", in the sense that some reads may miss CPU cache. But once you get to that point, you could argue that every branch might block if the branch misprediction causes a pipeline stall. This is not a useful definition of "block".

The thing is, most programs are almost never low-level enough to be dealing with memory in that sense: they read and write virtual memory. And virtual memory can block for any number of reasons, including some pretty non-obvious ones like. For example:

- the system is under memory pressure and that page is no longer in RAM because it got written to a swap file

- the system is under memory pressure and that page is no longer in RAM because it was a read-only mapping from a file and could be purged

-- e.g. it's part of your executable's code

- this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now

- you're in a VM and the VMM can do whatever it wants

- the page is COW from another process

> This is not a useful definition of "block".

I think what I'm saying is that calling file I/O "blocking" is also not a useful definition of "block". Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".

> this is your first access to a page of anonymous virtual memory and the kernel hadn't needed to allocate a physical page until now

And said allocation could block on all sorts of things you might not expect. Once upon a time I helped debug a problem where memory allocation would block waiting for the XFS filesystem driver to flush dirty inodes to disk. Our system generated lots of dirty inodes, and we were seeing programs randomly hang on allocation for minutes at a time.

> I think what I'm saying is that calling file I/O "blocking" is also not a useful definition of "block". Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".

In addition to the point elsewhere made that you're sort of implicitly denying the magnitude of the differences here - the latency differences are on the order of 1000s.

The other way of separating is if the OS (or some kind of software trap handler more generally) has to get involved. A main memory read to a non-faulting address doesn't involve the OS - ie it doesn't ever block. However faulting reads, calls to "disk" IO, and networking IO (ie just I/O in general) involving the OS/monitor/what have you are all potentially blocking operations.

It does not matter whether the OS is involved. Consider a spinlock; if it is spinning, waiting on the lock to be released, then it is blocking.

What matters is whether control returns to the process before the operation is complete. If the process waits, it is blocking (aka synchronous); if the process does not wait, it is non-blocking (and possibly also asynchronous if it checks later to see if the operation succeeded).

> Because I don't really see the fundamental difference between "we have to wait for main memory to respond" and "we have to wait for disk to respond".

The difference, conservatively, is a factor of 1000.

There are plenty of times in software engineering where scaling 1000x will force you to reconsider your architecture.

Sure, fair enough.

To be clear I do not believe that async disk I/O is never useful, I just think that it's not as useful as people at first imagine when they learn about async I/O.

Yes, it may be 1000 times slower than memory. But there's a fundamental paradigm difference from network events, in that with network events you are waiting for some other entity to take action, with no implicit expectation that they will do so in any particular timeframe. Like, if you're waiting for connections on a listen socket, there's no telling how long you will be waiting.

Disk I/O is fundamentally different in that once you submit an operation, you expect it to complete within a reasonable, finite time period.

Async disk I/O is primarily useful for implementing read-ahead / write-behind scheduling behaviors. While databases tend to be the obvious use case, the OS is often so poor at this that there are large performance improvements even for much simpler use cases that are otherwise disk I/O intensive.
I'm not sure that's the primary use case any more. Fast SSDs require high queue depths to use their full throughput, so async IO is desirable to use any time an application knows it has several IO requests to issue in parallel—one thread per request has too much overhead.
The reason memory access can block is because it can cause the page fault handler to be invoked (https://www.kernel.org/doc/gorman/html/understand/understand...). There are many reasons the page fault handler might cause the process to block. The kernel made need to swap the page in from disk, or it might be a copy-on-write page that was just written to.

To copy the page, the kernel needs to first find a free page frame. If there are no free page frames, the kernel will attempt to reclaim pages that are in use (https://www.kernel.org/doc/gorman/html/understand/understand...). This may cause in process-mapped pages being swapped to disk, but it also may result in disk writeback activity (https://lwn.net/Articles/396561/). In either case, control cannot be returned to the process until there is a free page to map into the process.

It's complicated, memory accesses can really block for relatively long periods of time.

Consider that regular memory access via cache takes around 1 nanosecond.

If the data is not in top-level cache, then we're looking at roughly 10 nanoseconds access latency.

If the data is not in cache at all, we are looking into 50-150 nanoseconds access latency.

If the data is in memory, but that memory is attached to another CPU socket, it's even more latency.

Finally, if the data access is via atomic instruction and there are many other CPUs accessing the same memory location, then the latency can be as high as 3000 nanoseconds.

It's not very hard to find NVMe attached storage that has latencies of tens of microseconds, which is not very far off memory access speeds.

I just want to add to your explanation, that even in the absence of hard paging from disk, you can have soft page faults where the kernel modifies the page table entries or assigns a memory page, or copies a copy on write page, etc.

In addition to the cache misses you mention there's also TLB misses.

Memory is not actually random access, locality matters a lot. SSDs reads, on the other hand, are much closer to random access, but much more expensive.