Hacker News new | ask | show | jobs
by moring 18 days ago
The article shows nicely how "every byte matters" is false. First, it starts off by talking about the cost of a new field, when the actual topic is array-of-structs vs. struct-of-arrays. Then, this:

> How much of an impact can this have? > Reading is:alive (1 byte) Across 1M Monsters

You aren't reading one byte here, you are reading 1M bytes! Of course, optimizing the access to 1M bytes is something to consider. Optimizing the access to one byte isn't.

The article is definitely worth reading IMHO, but it really needs a better headline!

2 comments

Even more so, it shows that SoA data structure means you can add fields to your 1M monsters with little impact.
This is valid for sequential scanning of the data. The CPU will fill whole cache lines at once with the arrays that do get used and the algorithm touches all the field instances in the array.

Now think about random access to single struct instances instead: the CPU loads a cache line worth of data for each field and uses only one element out of the whole cache line. This is much worse than a compact structure representation of the same data.

SoA is not universally better.

This sounds similar to relational databases vs document oriented databases, at least when I briefly looked into database like MongoDB when such things were all the rage 15-20 years ago.

For the internal web site that customer support people used a document oriented database would be great because that wants to load everything about one customer and pretty much doesn't need anything else until the user is done supporting that customer.

For the dozens or periodic reports that needed to be generated relational was way better. A given report generally only wanted a small amount of per customer data but wanted that for all customers.

A little bit of searching and LLM querying suggests that nowadays there are databases that are good at both kind of tasks, in particular Postgress with JSONB, at least at the scale we were looking at (maybe 30k or so customers), but maybe really big operations would need more specialized software.

The Array-of-Struct vs Struct-vs-Array organization is actually more similar to row-major ordering vs column-major ordering, i.e. the data structure that analysis databases use to optimize for aggregate calculations. Document databases are not really comparable because they don't impose structure on the data; with document databases you just have a tree of JSON elements, which is neither AoS nor SoA.
Or, another name for the same thing columnar: storage vs. row storage.
No it's not always better and I didn't mean to imply it was. I was simply saying that the article argues against its title.

In both cases you want to think about locality of the next read and structure the data accordingly.

> SoA is not universally better.

This is an important part of Data-oriented Design: the representation of the data should be pragmatically tied to its access patterns, not dogma.

Richard Fabian's DoD book gives the example that (x,y,z) is almost always better as a classic array-of-structs rather than a struct-of-arrays, because if you're accessing one dimension, you probably are want to process the other two dimensions at the same time:

https://www.dataorienteddesign.com/dodbook/node9.html#SECTIO...

> you can add fields to your 1M monsters with little impact.

Great for this access pattern, but I wouldn't make a general statement like that. This is the same thing as row-oriented vs column-oriented databases, OLTP vs OLAP. SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.

> SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.

Why is that? Genuinely curious. Does "weak" mean that it performs worse than AoS, or that the gains aren't as significant versus AoS?

It's because removing a monster with 20 fields from an SoA structure means resizing 20 arrays. Removing the same monster from an AoS array involves resizing a single array, which you're going to process in a very cache friendly way.
I'm not sure why anybody would at the same time be implementing SoA AND resizing 20 arrays for a single delete, those things seem to be on either ends of the "I care about performance" spectrum.
The point is that a simple SoA implementation requires this - each field in the monster struct is an item in 20 different arrays. So, removing one monster means removing that item from those 20 arrays.

Now, as others have suggested, you can have a more complex implementation, where instead of removing the monster's fields from those arrays, you just mark them as "dead" or whatever and then skip them when consuming the relevant arrays, with some relatively small extra bookkeeping overhead. Of course, this comes with its own drawbacks, especially if the number of monsters is very dynamic and you are memory constrained.

The point is not to say that SoA is never good for performance, it obviously and certainly is, probably even in most cases. It's just not always best for performance, this was all.

Assuming ordering isn't a concern, can't you just have a field called "removed" and skip those when iterating?

Or swap it with the last monster, and keeping an index for the last monster alive.

Sure, but these schemes might have their own drawbacks depending on the exact use case - especially if you have a very dynamic number of monsters and constantly add and remove them (say, some kind of bullet hell style game).
Then you have to read the "removed" field on every field read on every operation.

SoA is only useful when you don't read multiple fields for most operations.

Presumably they're referring to resizing the arrays.
Array resizing is avoidable with an embedded free list if ordering is of no concern.
If you take out ordering, then lookups on your SoA are now a search, and n-field lookup on an entity is now a JOIN operation.

The smarter you get about it, the closer you get to an OLAP db

Which leads to my theory… I feel like Bevy could be implemented on top of an in-memory DuckDB and get away with it

Yes. I think one of the big advantages of SoA is that you only pay for the fields you're currently using. If you need a field somewhere, you can add it and only pay the cost of iterating it where you need it.
Every Struct Matters