| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vlovich123 1033 days ago

While the sentiment is correct as to why compilers makes alignment assumptions, a lot of the details here I think are not quite right.

> For starters, not all hardware platforms allow unaligned accesses at all

If you're dealing with very simple CPUs like the ARM M0, sure. But even the M3/M4 allows unaligned access.

> Even on x86 where it's supported, you want to avoid doing unaligned reads at all costs because they're up to 2x slower than aligned accesses

I believe that information hasn't been true for a long time (since 1995). Unless you're talking about unaligned accesses that also cross a cache line boundary being slower [1]. But I imagine that aligned accesses crossing a cache line boundary are also similarly slower because the slowness is the cache line boundary.

> God forbid you try to use unaligned atomics, because while technically supported by x86 they're 200x slower than using the LOCK prefix with an aligned read

What you're referring to is atomic unaligned access that's also across cache line boundaries. I don't know what it is within a cache line, but I imagine it's not as bad as you make it out to be. Unaligned atomics across cache line boundaries also don't work on ARM and have much spottier support than unaligned access in general.

TLDR: People cargo cult advice about unaligned access but it's more because it's a simpler rule of thumb and there's typically very little benefit to pack things as tightly as possible which is where unaligned accesses generally come up.

[1] https://news.ycombinator.com/item?id=10529947

3 comments

AshamedCaptain 1033 days ago

Your message is more misleading than the GP.

Many architectures sold today still claim unaligned accesses are optional (e.g. all ARM pre-v7, which includes the popular Raspberry Pi Zero). Not to mention that even if they are supported, not all instructions support it (which is the case today on all ARM cores and even on x86).

From the architectures and instructions which may support it, it may have a performance penalty which may range from "somewhat slower" (e.g. Intel still recommends stack alignment, because otherwise many internal store optimizations start giving up) to "ridiculously slower" (e.g. I once had to write a trap handler that software-emulated unaligned accesses on ARM -- on all 32-bit ARMs Linux still does this for all instructions except plain undecorated LDR/STR when the special unaligned ABI is enabled).

And finally, even if the architecture supports it with decent enough performance, it may do it with relaxed atomicity. E.g. even as of today aarch64 makes zero guarantees regarding atomicity of even atomic instructions on unaligned addresses (yes, really). To put it simply because it is a _pain in the ass_ to implement correctly (say programmer does atomic load/store on overlapping addresses with different alignments). This is whether they cross cache lines or not.

i.e. it's as a bad as the GP is saying. You can't just put one example of one processor handling each case correctly to dismiss this claim, because the point is that most processor's don't bother and those who do bother still have severe crippling limitations that make it unfeasible to use in a GP compiler.

And there is still a lot of benefit to packing things up... but it does require way too much care and programmer effort.

link

torusle 1033 days ago

> If you're dealing with very simple CPUs like the > ARM M0, sure. But even the M3/M4 allows unaligned > access.

On ARM M3/M4 you have the same issue with LDRD and STRD instructions which do not allow unaligned access. Even the normal load/stores don't allow unaligned access in all cases. Try this in the peripheral memory region for starters. And things get even more complicated when the memory protection unit shakes up things.

link

macjohnmcc 1033 days ago

Yeah even Microsoft's compiler aligns values on appropriate boundaries for performance reasons. DWORDs on DWORD boundaries etc. And if you want to pack the data structure to avoid the gaps in structures there are methods to do so via #pragma options. I think their complaining about what was done for performance reasons shows a great lack of overall understanding. More time researching and less time griping would have served them better.

link