Which means Commodore 64, Apple ][, Atari 2600, Acorn Atom, NES…inspired, of course, by this topic.
I’m working on the 6502 Å-machine interpreter, and my inexperience with 6502 assembly is really showing. I’m currently working on this block of code, which is run when entering a div:
ldx divsp ; Div stack pointer
lda opermsb+0 ; Most significant byte of operand 0
sta divstk,x ; Div stack
lda operlsb+0 ; Least significant byte of operand 0
sta divstk+1,x
inx
inx
stx divsp
asl
rol opermsb+0
asl
rol opermsb+0
asl
rol opermsb+0
;clc
adc stybase
sta phydata
lda opermsb+0
adc stybase+1
sta phydata+1
I can’t figure out what this ASL - ROL - ASL - ROL - ASL - ROL sequence is doing. I know it’s messing with the bits of the operand, but what happens when you interleave them like this?
This treats accumulator as the lowbyte and opermsb+0 as the highbyte of a 16-bit number, and shifts this number three steps to the left, i.e. multiplies it by 8.
ASL shifts a 0-bit in at the bottom of a value and shifts the top bit of the value out into the carry. ROL shifts the carry in at the bottom of a value and shifts the top bit out into the carry.
At the end, if we’re sure the original number was small enough that we can’t get an overflow (i.e. the result can’t be bigger than what fits in 16 bits), we know that the carry is clear, thus we can skip clc before the addition.
Since multiplication isn’t supported in hardware on 6502, one will typically use shift operations instead, when speed is important and one of the factors is known in advance. E.g. an entry in the object table is 14 bytes long in z5. Instead of calling a generic multiplication routine with the value and 14 as the parameters, you can:
shift the value three times (Now we have the original value x8)
subtract the original value from this (Now we have the original value x7)
shift the value one more time (Now we have the original value x14)
Ahh, that makes sense! This interpreter stores style class information in structs seven bytes long; I’m guessing those structs are padded out to eight bytes for simplicity, so this is taking the index of a struct, multiplying it by eight (23), then adding it to stybase (start of the struct array?) and putting the 16-bit result in phydata (start of the current struct?).
Yes, it performs: phydata = stybase + 8 * oper0
(where oper0 consists of an LSB (least significant byte) and MSB (most significant byte), which are stored in two different arrays, because it’s faster to access them that way)
Not much to add to what Fredrik wrote, but regarding the structure padding, a non uncommon way to optimize things on 6502 is to unpack “arrays of structs” into “structs of arrays”.
Say you have an array with 50 elements that each contain a 16 bit pointer, and two 8 bit variables, instead of storing that as 50 times 4 bytes, you could have 4 arrays of 50 elements:
Low byte of the pointer
High byte of the pointer
first 8 bit variable
second 8 bit variable
This way you can access any element just with X or Y indexed access without multiplying.
Obviously that’s only useful if you are going to have no more than 256 elements, and if all the sizes are known at compile time.
I think once I finally wrap my head around all the opcode mnemonics, I’ll get a lot faster at all of this. I have the advantage of experience with assembly on other systems (though generally ones with a lot of registers, like MIPS or the Z-machine); having only A, X, and Y is a bit of a change! Thankfully Linus commented his 6502 assembly a lot more than his C.
I felt very smart when I finally figured out the purpose of BIT yesterday. The C64 Å-machine interpreter has a “register” (is there a better short name for “byte in the zero page”?) called stflag that records whether it’s currently in a status area: $00 for nothing, $FF for the top status bar (supported), $80 for the inline status bar (not supported). It took me ages to figure out why these values specifically—it’s because they can both be created without any immediates (decrement $00 to get $FF, rotate $00 with the carry flag set to get $80), and can then be tested with BIT. N will be set if any status bar is active, and V will be set if it’s a supported one.
Size optimization on the 6502 is a compounding thing: Since relative branches are encoded on a single byte you can only go that far forward or backward, so as soon as a routine grows you start having to use JMP instead of Bcc… which in turn makes the code larger so you end up doing inverted tests to branch over a JMP that can actually go where you want to go.
So every time you can reduce by one byte here, one byte there, you increase the change you can turn a JMP into a Bcc (by ensure state flags are stable of course), which in turn can be a large enough change that you can convert some more, remove the inverted branches, etc…
You also often see things like the RTS statement is actually in the middle of the routine so the exit can be reached relatively from any point in the routine.