| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by masswerk 989 days ago

This is how the program is actually stored:

  10SAVE"4",8:PRINT4

  0801  0F 08               link to next line at $080F
  0803  0A 00               line number (16-bit binary): 10
  0805  94                  token SAVE
  0806  22 34 22 2C 38 3A   ascii «"4",8:»
  080C  99                  token PRINT
  080D  34                  ascii «4»
  080E  00                  -EOL-
  080F  00 00               -EOP- (link = null)

As we may see, "SAVE" has been compressed already to a single byte (0x94), as is "PRINT" (0x99). Moreover, the line number is a 16-bit binary integer, meaning, the number of decimal digits in the listing has no effect on the in-memory format.

BTW, abbreviations of BASIC keywords work, because of how upper-case/shifted letters are encoded in the PETSCII character set: they have their sign-bit set. (So normal letters are all smaller than 0x80, and shifted characters are >= 0x80. We may also note that codes > 0x80 are used exclusively for tokens in the stored BASIC text, discriminating them from any other text.) Now, the tokenizing routine uses a table, which also uses a set sign-bit: as a marker on the last character on each of the keywords, which are stored in a table. It will compute the difference of each letter in an input word to the entries in that table, and, if the difference is exactly 0x80 (the sign-bit), this means, (a) we arrived at the end of the word stored in the table, and (b) all the letters up until here did match (otherwise, we would have already exited the loop, in order to test the next keyword). We have a match! The routine then adds 0x80 to the table index of that keyword, and voila, there is your BASIC token.

Notably, if we're dealing with single-byte values, for a difference of 0x80 it doesn't matter, which of the two bytes, this is the difference of, holds the bigger value. It's effectively unsigned and agnostic of which was the larger byte. For our tokenizing routine, this means it will only "know" that one character has the sign-bit set, while the other has not (but is otherwise the same), but it will not "know" which of the two this is. Therefore, adding the sign-bit to an input character will fool the routine into assuming, it already went over the entire keyword and hit the sign-bit set in the last character of the table entry. And we achieve this by shifting the character in the input text. And, voila, there is your abbreviated BASIC keyword.

(We can also see how the length of the input keyword doesn't contribute to the storage format, as it will be compressed to a token, which is 0x80 + the table index of the keyword, anyways. We may also see why "iN" matches "input#" but not "input", because the longer version has to come first in the table, in order to match at all, and it will be also the first to be recognized by the erroneous match.)