Hacker News new | ask | show | jobs
by jim90 1093 days ago
There is a cool trick using virtual memory allowing fast and gapless ring buffers.

See this page (about half way down) https://ruby0x1.github.io/machinery_blog_archive/post/virtua...

4 comments

It is a cool trick and even without virtual memory if the maximum read size has a low upper bound just copying a the first few bytes from the beginning to the end can be worth it to avoid the complexity and overhead of indexing into the ring buffer modulo its size.

This clever ring buffer has hidden costs that don't show up in microbenchmarks testing only the push() and pop() operations on an initialized ring buffer. These can be very high and impact other software as well.

The initialisation requires multiple system calls to setup the aliased memory mapping and it has to be a multiple of the page size. This can be very expensive on big systems e.g. shooting down the TLBs on all other CPUs with interprocessor interrupts to invalidate any potentially conflicting mappings. On other end small systems often using variations of virtual indexed data caches and may be forced to set the aliased ring buffer pages as uncacheable unless you can get the cache coloring correct. And even for very small ring buffers you also use at least two TLB entries. A "dumber" ring buffer implementation can share part of a large page with other data massively reducing the TLB pressure.

Abusing the MMU to implement an otherwise missing ring addressing mode is tempting and can be worth it for some use-cases on several common platforms, but be careful and measure the impact on the whole system.

It sprang from old mainframe implementations - there's a 2002 (??) Virtual Ring Buffer in Phil Howard's LibH collection (2007 snapshots of 2002 copyright code based on circa 1980 implementations IIRC):

* https://web.archive.org/web/20140625200016/http://libh.slash...

VRB:- These functions implement a basic virtual ring buffer system which allows the caller to put data in, and take data out, of a ring buffer, and always access the data directly in the buffer contiguously, without the caller or the functions doing any data copying to accomplish it.

* The "Hero" code

http://libh.slashusr.org/source/vrb/src/lib/h/vrb_init.c

* Front & back end source directories:

https://web.archive.org/web/20100706121709/http://libh.slash...

If anyone is considering to use this, I think the code has a couple of concurrency bugs but I do not know if this was ever intended to be used in a multithreaded setting. vrb_get() contains the following code.

    //-- Limit request to available data.
    if ( arg_size > vrb_data_len( arg_vrb ) ) {
       arg_size = vrb_data_len( arg_vrb );
    }

    //-- If nothing to get, then just return now.
    if ( arg_size == 0 ) return 0;

    //-- Copy data to caller space.
    memcpy( arg_data, arg_vrb->first_ptr, arg_size );
If at the time of the check for the amount of available data there is less data available than requested but additional data gets added to the buffer before arg_size gets updated, then this might get more data than requested and overflow the target buffer. At least vrb_read() and vrb_write() have the same bug.
Sadly I can't ask the author anymore, I suspect this was a least feature possible stripped down port of a more battle tested cross platform library that dated to before the times of modern multi multi multi core chips.

Concurrency issues existed in the days of yore .. but they arose in different ways with different timings.

It's an odd bit of code archeology recalling a concept ( mmap ring buffers ) from decades past and then hunting to find the best remaining example - much of LibH was 'clean' rewrites of code from the authors past work.

Are the memory semantics programmer-friendly on all (or most) platforms when using aliasing mappings like this? Eg if two CPUs concurrnetly update the memory through different VM address aliases, is the observed semantics identical vs accessing through only one address or not?
I think POSIX requires it to work. On all sane architectures that use physically indexed caches (or VIPT) that's easy. On other architectures the OS has to bend over backward to preserve the fiction.
VIPT caches only allow this if all virtual aliases for the physical memory backing them map to the same cache set. Let's look at an example: Writing to the first cache line in both aliases of such a double mapped ring buffer. The data cache indexes the cache by the virtual addresses while the MMU translates them. If the virtual addresses have the correct cache coloring to always hit the same set of cache to compare the physical address to the tag it works, but if virtual addresses don't index to the same set of cache lines you get data corruption because there will be two dirty cache lines with conflicting data for the same physical address waiting to be written back in an unknowable order. Good luck debugging this should your system allow you to create such a memory mapping. At least Linux and FreeBSD used to allow you to setup this time bomb on ARMv5 and you have only two bad options to choose from as kernel: break software relying on this or make it unbearable slow by making aliased memory uncacheable.
Nobody (not even IBM AFAIK) uses coloring anymore so this is moot. ARMv5 isn't even supported by Linux anymore.

All modern caches PIPT, VIPT, or even VIVT must work with this scheme as it's semantically transparent to virtual memory. The performance of line-crossing accesses is a totally different issue and I would never used this "trick".

You mean virtually indexed caches. VIPT is a transparent improvement on physically indexed caches, which is used by most architectures today.

Otherwise, yes, that's what I mean by the OS bending over backward.

Linus has choice words for such architectures.

AKA the Magic Ring Buffer.
Always worth linking to ryg / Fabian Giesen: https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff...
Incidentally the non-portable remap_file_page(2) was great for doing this sort of page manipulations, but it has been deprecated and this days it is equivalent of doing the separate mmaps.