Hacker News new | ask | show | jobs
by ZuLuuuuuu 726 days ago
As somebody who is not familiar with how garbage collection is implemented at the low level, can somebody explain why WasmGC is needed on top of Wasm?

For example, isn't CPython a C program and hence can just be compiled to Wasm, including its garbage collection part? Does garbage collection usually depend on OS specific calls, which are not part of C standard?

5 comments

WasmGC allows you to reuse the native V8 garbage collector instead of having to bundle and run a virtualized garbage collector.
Is Wasm performance that far off of native that the difference between bundled GC and native GC is noticeable?
There are several problems with bringing your own GC. Some that come to mind:

* Significantly increased binary size

* No easy way to trace heap objects shared amongst many modules

* Efficient GC needs parallelism, currently limited in Wasm

For a more thorough explanation, see https://wingolog.org/archives/2023/03/20/a-world-to-win-weba...

Performance is less of a concern than binary size. Without WasmGC, you need to ship a compiled garbage collector to the user along with every WASM module written in a GC'd language. That's a lot of wasted bandwidth to duplicate built-in functionality, so avoiding it is a big win! And performance will always be a bit better with a native GC, plus you can share its scheduling with the rest of the browser process.
I think it’s also that the V8 garbage collector already has a stupidly high bar when it comes to optimizations that shipping your own even without any consideration to WASM performance would be a step backwards for most languages running on the web.
To quote the summary of wasmgc feature from Chrome:

> Managed languages do not fit the model of "linear" Wasm with memories residing in one ArrayBuffer. In order to enable such functionality, they need to ship their own runtime which has several drawbacks: (1) it substantially increases binary size of these modules and (2) it's unable to properly deal with links crossing the boundary between the module, other modules and the embedder that all implement their own GC who is not aware of the others.

> WasmGC aims at providing a managed heap that Wasm modules can use to store their own data models while all garbage collection is handled by the embedder.

It's a big help in interop if everyone uses the same GC. Otherwise it becomes a huge headache to do memory management across every module boundary, with different custom strategies in each.
Definitely! Vanessa Freudenberg's SqueakJS Smalltalk VM written in JavaScript took a hybrid approach of using the JavaScript GC instead of the pure Smalltalk GC. WasmGC should make it easier to implement Smalltalk and other VMs in WebAssembly, without resorting to such tricky hybrid garbage collection schemes.

https://news.ycombinator.com/item?id=29019992

One thing that's amazing about SqueakJS (and one reason this VM inside another VM runs so fast) is the way Vanessa Freudenberg elegantly and efficiently created a hybrid Smalltalk garbage collector that works with the JavaScript garbage collector.

SqueakJS: A Modern and Practical Smalltalk That Runs in Any Browser

http://www.freudenbergs.de/bert/publications/Freudenberg-201...

>The fact that SqueakJS represents Squeak objects as plain JavaScript objects and integrates with the JavaScript garbage collection (GC) allows existing JavaScript code to interact with Squeak objects. This has proven useful during development as we could re-use existing JavaScript tools to inspect and manipulate Squeak objects as they appear in the VM. This means that SqueakJS is not only a “Squeak in the browser”, but also that it provides practical support for using Smalltalk in a JavaScript environment.

>[...] a hybrid garbage collection scheme to allow Squeak object enumeration without a dedicated object table, while delegating as much work as possible to the JavaScript GC, [...]

>2.3 Cleaning up Garbage

>Many core functions in Squeak depend on the ability to enumerate objects of a specific class using the firstInstance and nextInstance primitive methods. In Squeak, this is easily implemented since all objects are contiguous in memory, so one can simply scan from the beginning and return the next available instance. This is not possible in a hosted implementation where the host does not provide enumeration, as is the case for Java and JavaScript. Potato used a weak-key object table to keep track of objects to enumerate them. Other implementations, like the R/SqueakVM, use the host garbage collector to trigger a full GC and yield all objects of a certain type. These are then temporarily kept in a list for enumeration. In JavaScript, neither weak references, nor access to the GC is generally available, so neither option was possible for SqueakJS. Instead, we designed a hybrid GC scheme that provides enumeration while not requiring weak pointer support, and still retaining the benefit of the native host GC.

>SqueakJS manages objects in an old and new space, akin to a semi-space GC. When an image is loaded, all objects are created in the old space. Because an image is just a snapshot of the object memory when it was saved, all objects are consecutive in the image. When we convert them into JavaScript objects, we create a linked list of all objects. This means, that as long as an object is in the SqueakJS old-space, it cannot be garbage collected by the JavaScript VM. New objects are created in a virtual new space. However, this space does not really exist for the SqueakJS VM, because it simply consists of Squeak objects that are not part of the old-space linked list. New objects that are dereferenced are simply collected by the JavaScript GC.

>When full GC is triggered in SqueakJS (for example because the nextInstance primitive has been called on an object that does not have a next link) a two-phase collection is started. In the first pass, any new objects that are referenced from surviving objects are added to the end of the linked list, and thus become part of the old space. In a second pass, any objects that are already in the linked list, but were not referenced from surviving objects are removed from the list, and thus become eligible for ordinary JavaScript GC. Note also, that we append objects to the old list in the order of their creation, simply by ordering them by their object identifiers (IDs). In Squeak, these are the memory offsets of the object. To be able to save images that can again be opened with the standard Squeak VM, we generate object IDs that correspond to the offset the object would have in an image. This way, we can serialize our old object space and thus save binary compatible Squeak images from SqueakJS.

>To implement Squeak’s weak references, a similar scheme can be employed: any weak container is simply added to a special list of root objects that do not let their references survive. If, during a full GC, a Squeak object is found to be only referenced from one of those weak roots, that reference is removed, and the Squeak object is again garbage collected by the JavaScript GC.

Yes you can compile CPython and utilize its GC.

The idea of WasmGC is to make objects which are available in the browser environment (like window) available in the wasm module. It is a spec which allows you to pass objects which are actually managed by the browsers GC to your wasm code.

Python has two garbage collectors - reference counting, and tracing

Reference counting does not handle circular references, and tracing does

Tracing collectors have to be able to read all the references of objects to work. The difficulty is that some of those objects or references to objects are in the stack - or in simpler terms, these are objects and references that only one function can read and understand. In C there are some non-standard extensions that let you do this to varying degree of success. In WASM, this is prohibited by design, because it’s a safety issue