Hacker News new | ask | show | jobs
by nickpsecurity 3949 days ago
+1 on Azul. They've pretty much solved it by improving on past methods combining hardware and software. Go could do the same thing. I keep wondering about putting a dedicated FPGA on the memory bus that does nothing but concurrent GC. Have a mechanism to keep the processor (s) and it from stepping on each others' toes. Might work wonders.
4 comments

Two factors require any GC support to be behind the CPU's MMU:

1) GC itself operates on virtual addresses.

2) If you want concurrent collection you're probably going to need a read barrier, and that will require some GC / MMU interaction.

The Azul Vega had a lot of interesting features to support GC (and other Java constructs), but the most important by far is the HW read barrier.

1. I expect, if we use standard OS, that a modification might be needed where the FPGA will know the real address. The FPGA might be given enough info to figure it out with DMA or it might be something like a custom, syscall that gives FPGA details.

2. Barriers are one method. They're among the most expensive. An IBM prototype used careful lock management (supported by lock registers) and scheduling to avoid barriers. There's actually quite a few ways in the literature to avoid barriers in concurrency. I figure a team would have to leverage them plus asynchronous I/O from CPU to FPGA for optimal performance. I see it as a series of hand-offs to FPGA which, once it has necessary information, acts on those hand-offs with CPU assuming it completed after certain time, seeing a part of memory saying so, or receiving an interrupt.

That's a rough sketch.

The virtual addressing part might be fine now that the GPUs are doing it, and hypervisors support programmable passthrough using the x86 IOMMU (VT-d etc) features.

Though I'm convinced custom hardware is doomed here for the usual custom hardware reasons. Maybe GPUs have gotten good enough at pointer chasing to be usable here?

Custom hardware is actually flourishing in datacenters. My preferred architecture is Cavium Octeon III style of many-core RISC, accelerators for everything, plenty I/O, and hypervisor support. Selling like hotcakes. Adapteva's stuff outperforms CPU's & GPU's at performance-per-watt-per-area with sales to HPC people. There's similarly at least a few custom hardware companies in each segment doing something that's hard or not cost-effective with existing hardware or software.

I agree that the risk is high, though, to the point that one shouldn't depend on it. So, I'd advice selling system w/ services that's profitable which just happens to use such custom hardware. A high-performance, easy-to-manage, easy-to-integrate... already worth buying... platform that also has hardware-supported GC and/or memory safety. The sales of the system & licensing of the software subsidize hardware costs, which are structured to be cheap anyway. Start with FPGA's, then S-ASIC's, then advanced S-ASIC's or finally ASIC's. The NRE stays as low as volume can support.

Relevant example of this model (and evidence for my GC idea) is Azul Systems Vega machines. Those are custom hardware for Java supporting native bytecodes, a bunch of RAM, a pauseless GC, and easy enterprise integration. So, while we're all speculating, they're selling custom hardware w/ pauseless GC's. I'm just trying to work out a different, cheaper design hopefully integrating with Intel/AMD.

http://www.azulsystems.com/products/vega/overview

Note that they support a whole range of hardware, software, and services to diversify income. Any one thing shouldn't sink them, esp unfavorable hardware. That's the model to copy.

> Maybe GPUs have gotten good enough at pointer chasing to be usable here?

They haven't.

I enjoy comments such as this because I enjoy putting a thought, an idea in my head and rolling it around, seeing where it goes and what problems it has. This reply may be a bit rambling as a result.

If the FPGA on the bus is doing concurrent GC, we'd need a way to mark pointers and integers reliably. It means breaking all C code that does pointer manipulation, and breaking custom tagged pointer implementations. Niche languages and applications wouldn't have the buy in, and justifiably so, to set global policy on memory. On a developer machine you could experiment, but on a consumer device you would need to use whatever the OS and the bulk of applications have decided on.

Maybe that's not a bad thing though? The OS would need to provide implementations of malloc and free, and other primitives, that are tied to the hardware. But I suppose that's not unreasonable.

Except when it comes to virtualization. The FPGA isn't fixed function, but on the timescales that modern VMMs provide slices of CPU time to virtual machines, reprogramming an FPGA is an eternity. So we would gain in potential performance but lose in flexibility, substantially.

The biggest problem would be legacy C code. Code that treats "void*" and "long" as interchangeable values. And that includes a lot of hand-written JITs, where coercing values between pointers and word-sized integers is prevalent.

Rambling aside on potential upsides and downsides: The world would probably be a better place if pointers and integers were separated and strongly typed down to the hardware level. It would be a pain, but if the null pointer is a billion dollar mistake, placing all bets on the Princeton architecture may yet be a trillion dollar mistake this century.

This isn't working with just any C or OS code. I'd assume that. It would take a modified or clean-slate runtime whose language can work with it. It might not tolerate virtualization, either. As I was heading for bed, I decided to at least Google to see if anyone did anything with FPGA's and GC's to not leave your head with nothing to think on tonight. ;) Here's what I found:

FPGA-aware garbage collection in Java (2005) https://buytaert.net/files/fpl05-paper.pdf

(Modified Jikes VM to use FPGA coprocessor for collection. Result was good performance with around 2.32% overhead on memory-intensive benchmarks.)

Fine-Grained Parallel Compacting Garbage Collection through Hardware-Supported Synchronization (2010) http://www.ikr.uni-stuttgart.de/Content/Publications/Archive...

Stall-free, real-time collector for FPGA's (2012) http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

(This is on one chip with tiny resources for GC around 1% and 4-17% overhead.)

Maybe I can get something better for you after some rest. Note that my comment to the other person has more details of where I was going with this.

https://news.ycombinator.com/item?id=10150480

Azul have a similar concept with their Java co-processors. I think they originally went the co-processor route because up until recently (i.e. last few years) the CPU instructions they take advantage of didn't exist on commodity chips.

http://www.azulsystems.com/products/vega/overview

I'm aware of it. My main recommendation, partly to eliminate x86 risks, was to switch to better, multi-core RISC processors while adding onboard garbage collection to them. One of best was a Scheme machine that had it in the hardware memory subsystem transparent to the CPU or application. I wanted to see about building something like that with Java compatibility and accidentally found the awesome Vega systems in process.

The advantage of implementing it in raw hardware are many: easy to build into a state machines with high clock-rate (even HLS should work); won't waste CPU time or cache; can be setup to analyse segments of memory in parallel because it's hardware; latency with right algorithm can be lower. That combined with multicore RISC and some concurrency extensions I know of would give plenty performance while knocking out tons of errors. Can't exactly do that with Intel/AMD chips unless we buy their semi-custom designs for untold millions, now can we?

So, next idea (for Intel/AMD) customers is to put it on the memory bus with some synchronization mechanism between CPU task and FPGA. This will use far less CPU time than a CPU that's doing GC stuff non-stop. That's on top of above benefits. Plus, if a workload doesn't need low-latency GC, it has a whole FPGA to use for acceleration. :)

Why would an FPGA be a better solution than software running on another core?
.. while keeping in mind that for some workloads GC is a large chunk of the app's work. For example, 10 Xeon cores worth of GC throughput (out of say 24) would be a pretty tall order for a FPGA, and as a fixed resource it easily becomes an Amdahl's law bottleneck.

It would be a cool thing to try still, and maybe doable with COTS hw: https://www-ssl.intel.com/content/www/us/en/embedded/technol... + http://www.hotchips.org/wp-content/uploads/hc_archives/hc21/...

It's a tall order because you set it up to be. Real system design would call for a balancing act, as usual. Remember that you can put a bunch of GC's on one FPGA that all run concurrently with access to shitloads of I/O and/or fast memory bus. Amdahl's law shouldn't kick in any more than with concurrent GC's in general. The parallelism, simplicity, and tech like in your link should make it faster than an on-board collector. The concept isn't speculation as it's already been done in two different ways:

Fine-Grained Parallel Compacting Garbage Collection through Hardware-Supported Synchronization (2010) http://www.ikr.uni-stuttgart.de/Content/Publications/Archive...

Stall-free, real-time collector for FPGA's (2012) http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

The question is, "Can modern CPU's and off-chip FPGA's keep in sync without performance getting dragged down?" The FPGA's have gotten faster. The CPU's I/O have gotten faster. So, I'm sure it can be done but it might be difficult enough to be someone's Master thesis. ;)

Besides, I call for replacing current chips with open ones easy to modify for acceleration and security. Gaisler LEON4 SPARC, Rocket RISC-V, Cambrige's BERI/CHERI MIPS64... these all come to mind. Plan was to put them onto a high-end FPGA w/ concurrent GC's to test the scheme. Once it worked, ASIC conversion time baby. S-ASIC's are $200-500k on average with resulting production & packaging being way cheaper after that. Just hoping there's a few companies that would split the cost to eliminate most memory and control flow issues forever. ;)

I'm guessing since it's sitting on the memory bus it could intercept pointer modifications and synchronously update it's graph?
Memory bus is just for speed. You don't want it doing stuff like that lol. See these two LISP machines for where my inspiration of putting it on memory bus came from:

http://diyhpl.us/~bryan/papers2/paperbot/Design%20of%20a%20L...

(See section 7 for a radical... err realy old... way to do concurrent GC. Full paper available at ACM/IEEE or if you Google LISP processors Guy Steele enough.)

Scheme machine by Burger 1995

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=569...

(See the main graphic and specifics in storage section later. Once again, GC-like stuff is handled in memory management part of the processor. This processor knows that, though, to assist GC a bit. Also different in that it was specified and then synthesized to heterogenous hardware with DDD "correct-by-construction" toolkit.)

So, have fun with those. Plus, Google hardware-assisted or hardware garbage collection to get lots of interesting results already done.

See my reply to anarcticpuffin.