Hacker News new | ask | show | jobs
by Someone 1024 days ago
Nitpick: this isn’t standard C (it uses void main, not int main)

Nitpick 2: why ldx #0 txa rts? I would think lda #0 rts is shorter and faster

Back to my question: if it can’t, the claim “an assembler programmer couldn't do better” isn’t correct.

I think an assembler programmer for the 6502 would consider doing a jmp at the end, even if it makes the function return an incorrect, possibly even unpredictable value. If that value isn’t used, why spend time setting it?

A assembly programmer also would:

- check whether the routine at 0xFBFD accidentally guarantees to set A to zero, or returns with the X or Y register set to zero, and shamelessly exploit that.

- check whether the code at 0xFBFD preserves the value of the accumulator (unlikely, I would guess, but if it does, the two consecutive ‘l’s need only one LDA#)

- consider replacing the code to output the space inside “hello world” by a call to FBF4 (move cursor right). That has the same effect if there already is a space there when the code is called.

- call 0xFBF0 to output a printable character, not 0xFBFD (reading https://6502disassembly.com/a2-rom/APPLE2.ROM.html, I notice that is faster for letters and punctuation)

On a 6502, that’s how you get your code fit into memory and make it faster. To write good code for a 6502, you can’t be concerned about calling conventions or having individually testable functions.

2 comments

I bet that sizeof(int)==2 - which immediately tells you everything you need to know - and the return value from a function has 8 bits in X and 8 bits in A. So ldx#0:txa is how you load a return value of (int)0.

Regarding this specific unrolled loop, I would expect a 6502 programmer would just write the obvious loop, because they're clearly optimizing for space rather than speed when calling the ROM routine. They'll be content with the string printing taking about as long as it takes, which clearly isn't too long, as they wouldn't have done it that way otherwise. And the loop "overhead" won't be meaningful. (Looks like it'll be something like 7 cycles per character? I'm not familiar with the Apple II. Looks like $fbfd preserves X though.)

We did a lot of loop unrolling and self modifying code back in the day, when making demos for the C64. The branch is really expensive. For example, clearing the screen you might use 16 STA adr,x and then add 16 to X before you branch to the loop.
Indeed, in some cases you want the unrolls. The 6502 is good in the twisties, but if you're trying to do any kind of copy or fill then the percentage of meaningful cycles is disappointingly low, and the unroll may be necessary. Also, if you're trying to keep in sync with some other piece of hardware, then just doing it step at a time can be much easier.

I have done a lot of all of this sort of code and I am quite familiar with the 6502 tradeoffs. But for printing 15 chars by calling a ROM routine, I stand by my comments.

Yes, I compiled with -O3 for maximum speed. That would be an unusual flag choice in most cases.

I just wanted to use 6502 code (so many seem to be able to read it!) with C side by side. x86 would have worked as well. Where the fastest answer would also be the same construct, assuming the dependency on an external routine.

> Nitpick: this isn’t standard C (it uses void main, not int main)

You know what, I'm gonna nitpick that nitpick: void main() is fully allowed on a freestanding target, which is still standard C.

Given the C standards historically generous interpretation of undefined behaviour and other miscellany, I think it's a reasonable interpretation of the standard to pretend that a target that allows something other than int main(...) is freestanding rather than hosted, and therefore fully conforming.

Yep, llvm-mos-sdk is explicitly freestanding; the libc functions in the SDK follow the hosted C standard, but they don't add up to a hosted implementation. The only known C99 non-compliance is the lack of floating point support, which is actively being worked on.