Hacker News new | ask | show | jobs
by wlll 1677 days ago
Related, and impressive: https://github.com/elfshaker/manyclangs

> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.

> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.

3 comments

The clever idea that makes manyclangs compress well is to store object files before they are linked, with each function and each variable in its own elf section so that changes are mostly local; addresses will indirect through sections and a change to one item won't cascade into moving every address.

I'm not sure the linking step they provide is deterministic/hermetic, if it is that would prove a decent way to compress the final binaries while shaving most of the compilation time. Maybe the manyclangs repo could store hashes of the linked binaries if so?

I'm not seeing any particular tricks done in elfshaker itself to enable this, the packfile system orders objects by size as a heuristic for grouping similar objects together and compresses everything (using zstd and parallel streams for, well, parallelism). Sorting by size seems to be part of the Git heuristic for delta packing: https://git-scm.com/docs/pack-heuristics

I'd like to see a comparison with Git and others listed here (same unlinked clang artifacts, compare packing and access): https://github.com/elfshaker/elfshaker/discussions/58#discus...

Author here, I'd like to see such a comparison too actually, but I'm not in the position to do the work at the moment. We did some preliminary experiments at the beginning, but a lot changed over the course of the project and I don't know how well elfshaker fares ultimately against all the options out there. Some basic tests against git found that git is quite a bit slower (10s vs 100ms) during 'git add' and git checkout. Maybe that can be fixed with some tuning or finding appropriate options.
It would be interesting to compare to gitoxide tweaked to use zstd compression for packs.
Reminds me of how Microsoft packages the Windows installer actually. If you’ve ever unpacked Microsoft’s install.esd it’s interestingly insane how heavily it’s compressed. I assume it’s full of a lot of stuff that provides semi redundant binaries for compatibility to a lot of different systems, because the unpacked esd container goes from a few GiBs to I think around 40-50 iirc.
The emulation community also has "ROMsets" — collections of game ROM images, where the ROM images for a given game title are all grouped together into an archive. So you'd have one archive for e.g. "every release, dump, and ROMhack of Super Mario Bros 1."

These ROM-set archives — especially when using more modern compression algorithms, like LZMA/7zip — end up about 1.1x the size of a single one of the contained game ROM images, despite sometimes containing literally hundreds of variant images.

How does this work? Do all the game series use the same engine code and assets?
I think you're slightly misinterpreting what the parent said. Take the game Super Mario World for the console Super Nintendo. It was released in Japan. It was released in the US. It was released in Europe. It was released in Korea. It was released in Australia. It was probably released in various minor regions and given unique translations. There are almost certainly re-releases of the game on Super Nintendo that issued new ROM files to correct minor bugs. Maybe there's a Greatest Hits version which might be the same game, but with an updated copyright date to reflect the re-release. This might amount to 10-12 versions of the same game, but 99.99% of what's in the ROM file is the same across all of them, so they can be represented compressed very well.

A copy of Super Mario Advance 2 for Game Boy Advance, which is also a re-release of Super Mario World, almost surely uses its own engine and would not be part of the same rom set. Likewise, other Mario games (like Mario 64, Super Mario Bros, etc.) would not be part of the same rom set. So it's nothing about the series using the same engine code or assets.

We're talking bugfixes and different regions for the same game on the same console. But this still has the effect of dropping the size for complete console collections by 50% or more, because most consoles have 2-3 regions per game for most games.

You're generally correct. But there are interesting exceptions!

Sometimes, ROM-image-based game titles were based on the same "engine" (i.e. the same core set of assembler source-files with fixed address-space target locations, and so fixed locations in a generated ROM image), but with a few engine modifications, and entirely different assets.

In a sense, this makes these different games effectively into mutual "full conversion ROMhacks" of one-another.

You'll usually find these different game titles compressed together into the same ROMset (with one game title — usually the one with the oldest official release — being considered the prototype for the others, and so naming the ROMset), because they do compress together very well — not near-totally, the way bugfix patches do, but adding only the total amount to the archive size that you'd expect for the additional new assets.

Well-known examples of this are Doki Doki Panic vs. Super Mario Bros 2; Panel de Pon vs. Tetris Attack; Gradius III vs. Parodius; and any game with editions, e.g. Pokemon or Megaman Battle Network.

But there are more "complete" examples as well, where you'd never even suspect the two titles are related, with the games perhaps existing in entirely-different genres. (I don't have a ROMset library on-hand to dig out examples, but if you dig through one, you'll find some amazing examples of engine reuse.)

Sort of. ROMHacks are modified ROM images of a certain game.

If you knew where in the ROM image the level data was contained, you could modify it. As long as you didn't violate any constraints, the game would run fine.

You could also potentially influence game behavior as well.

The Game Genie and Gameshark were kind based on this concept. Except, being further along the chain, it could write values coming into and out of memory, so other effects were possible.

So, in the case of Super Mario Bros. ROMHacks, they all use Super Mario Bros. as a base ROM. Then from there, all you need to do is store the diff from the base.

Ooh, neat. I was wondering why anybody would make a binary-specific VCS. And why "elf" was in the name. This answers both questions. Thanks!