Anybody here with Apple ][ experience?

I realize I’m probably better off on a different forum site dedicated to retrocomputing, but that feels like starting over.

I’m working on a new interpreter, written from scratch, for Apple 2. I know there are others out there, and Ozmoo exists. I wanted to write one myself for the same reason I wrote an interpreted in C++ from scratch, and an IF development language from scratch. They’re interesting problems, and I enjoy programming.

Anyway, this interpreter has been designed from the ground up to be as fast as possible, and as such, I have it filling memory at startup (Apple 2 disks are 140k which is enough to hold any V3 story and the interpreter) and using extended memory.

On the Apple ][+, the most common memory expansion is a Saturn memory board. It essentially looks like eight language cards, and is large enough I can efficiently support any V3 game up to 128k. It’s pretty cool, because I only need to recompute virtual mappings any time the program counter crosses the $BFFF or $FFFF boundaries. (Well, and branches and jumps).

The problem is Apple ][e. It has a really bizarre memory setup where there are two broad classes of memory, where each class is either “main” or “aux” memory.

  • Everything between $200 and $BFFF
  • Everything between $000-$1FF and $D000 and $FFFF

$C000-$CFFF on Apple 2 is reserved for I/O cards etc and is never RAM. Apple 2e does emulate the language card from Apple 2+, but not the same way Saturn memory does. Essentially the language card can be “main” memory from $D000-$FFFF, or “aux” memory from $D000-$FFFF, which falls into the second case above.

It would have been really nice if the memory between $000-$1FF was managed separately, but that’s not the world we live in.

For each of the ranges above, we can control whether memory reads come from main or aux memory, and we can control whether memory writes go to main or aux memory.

Finally, on Apple 2e, the standard way to extend memory past 128k is to support additional “aux” memory banks. You can control which of them is active, versus the standard base aux bank.

Currently, my interpreter code lives between $800 and $1FFF, and $2000 to $BFFF is meant for 40k of dynamic memory (and, currently, static memory too, but that may change if I find V3 stories that need more than 40k dynamic+static memory).

On Apple 2+, the high memory ends up being whatever fits in the first 40k, along with enough 12k banks of “language card” memory to hold the rest of the story. Any time the instruction pointer changes (due to a call, jump, or branch) I do some quick comparisons and make the appropriate language card 16k bank resident. This works fine and is extremely fast.

The problem is how to do this on the Apple 2e.

Fortunately I only ever need to read high memory, not write it (except during boot of course). My interpreter uses mostly zero page for variables for obvious reasons, but there are reads and writes to $800-$1FFF some times as well (stack, shadowed globals, etc).

So I feel like the correct way forward here is to only bank $000-$1FF and $D000-$FFFF.

But… zero page and the stack are included there, which is a real problem.

A simple solution would be to just treat high memory like a RAM disk. The story would always be loaded into aux banks 1+, and then any time I have a page miss, I copy pages from aux memory into main high memory. I’m not crazy about that because it’s a lot more complexity I didn’t need on the Apple 2+.

On the other hand, if I kept the aux memory active all the time, I wouldn’t have to copy high memory around, but then I’d have to do something about zero page and the stack being out of sync. Any time I switched to a different main/aux bank, I’d need to copy zero page variables over, which sounds really fragile. (And, of course, be really careful with the stack).

We need to read high memory in only a few situations though:

  • When decoding the current instruction (including branches)
  • When dealing with inline print and print_ret, or print_paddr

So maybe it wouldn’t be too bad to bank in aux memory only long enough to do the read? Right now the code to read the next instruction is a macro which loads the data through ZP then increments a zero page location and if that overflows, calls a helper to do more work. But I can’t really use zero page here because it’s no longer the correct data.

On 2e, I guess I could make the macro always call a function. Then the next address could be self-modifying code (lda $1234) which would avoid the whole zero page mess. If the instruction pointer is in the first 40k, the function can just be that load and wrapping the address properly. If the pointer is past that, the entire function can be rewritten to bank in the correct page first, then do the load, then bank it back out again (which is just a few instructions). Code that does inline strings could be special-cased to avoid constantly banking.

So… maybe that’s the way forward.

Thank you for listening to my TED talk, I guess.

-Dave

2 Likes

I think your analysis here is essentially correct, but a few comments.

First, yay! A proper Apple II interpreter! We have the Infocom terps, and Vezza if you have a Z80 softcard, but there has been a longstanding lack of free options for people wanting to legally ship PunyInform games (etc.) for stock Apple II. The Ozmoo team has so far not been interested in doing an Apple II port.

The usual approach for accessing the regions of aux memory that also affect the zero page is to either write small aux-resident copy/access routines, or only access the bank from short routines that do not require zero-page access. This is not a simple “bank-switch and read/write in-place” option, but the bank-switch requires only a few cycles per access, so it is not terribly impractical to have a separate “read byte and increment PC” routine that does the bank-switching on the fly when needed. This logic can easily be repurposed for print routines by temporarily repurposing the PC as the pointer for the ZSCII decoding.

However, if you do want to go down the rabbit hole of a full cache system, it would also open up the possibility of supporting Z5/Z8 games in the future by loading cache misses from disk. (Storing the first half of the game in RAM, then loading cache misses from a second disk with the remainder of the game [1].) I would love to have a non-proprietary Z5 terp on Apple II.


[1] qkumba has written a tiny fastloader called qboot/gumboot that can be used to directly load individual disk tracks, which would be ideal for fast disk caching. [edited with fixed link]

1 Like

FWIW my goal is to support V5 once I get V3 stable. But I’m using the $C600 routines to read the story data right now – no DOS of any kind – so any V5 support would need to be spread across multiple disks and/or require a lot of memory.

But I think I see what you’re saying … a V5 game at least could be two disks. And I suppose you could then have an option to either read the entire story up front, or on demand.

The bigger issue is that V5 games are that much more likely to need more than 40k of dynamic+static or even just dynamic memory. Plus, I’m getting close to overflowing the 6k I have allocated for my interpreter code already, which means it will probably end up becoming a 8/10k interpreter and max dynamic memory will go down by that much.

However, I wouldn’t be surprised if there were a lot of V5 games from the 90’s and early 2000’s that were actually smaller than 128k but were written during the time when V3 support was broken on Inform. Those should be easy enough to support – just a few more opcodes, and obviously larger object structures.

Another question… last time I touched a physical apple machine (well, actually it was a Franklin Ace 1000) was probably 1989 or 1990. I’ve just been on emulators since then. Do you have a feel for how many people out there are either

  1. Still running an Apple 2+ (which would need a Saturn 128k memory card for my current implementation, or at the very least, probably a language card)
  2. Running a 2e with only 128k
  3. Running a 2e with more than 128k
  4. How many 2e enhanced are out there? I noticed that enabling 65C02 instructions made the interpreter a few percent faster (mostly because of jmp ($1234,x) and lda ($12))

Attending the odd Apple II event, I’d say IIe’s are the most common model. Older models blow up or need repair more often by now (the Apple II+) and IIc’s are harder to repair due to their design.

I rarely hear about people putting extra memory in a 128 RAM Apple IIe. They probably do it in their tinkering – I am not a tinkerer – but it doesn’t have widespread practical application, except maybe being able to host a big Appleworks document or something. So I think assume the typical 128kb.

Enhanced IIe’s… I’m not sure. Odds suggest the majority are enhanced. The IIe was new in 1983, in 1985 you could get it enhanced, and from 1987 every new IIe was enhanced. The computer sold until 1993. But when I go to these events, people are running a whole range of models. I feel like they’re the kind of people you might be appealing to, and so it’s probably best to stick with instructions that will run on unenhanced machines.

The place people experiment with adding tons (megabytes) of memory is with the Apple IIGS’s 16-bit persona. But anyone who has a IIGS can use it as an enhanced IIe as well in its 8-bit persona.

-Wade

1 Like

Out of curiosity, I did a (somewhat slapdash) analysis of the Z-code directory in IFarchive awhile back. About 39% of Z-code games (across all versions, extracting blorbs) are <128 KB; 87% use <40 KB of dynmem; 78% use <40 KB of static memory.

1 Like

Interesting. Maybe attempting to fit the entire story in memory is a lost cause.

On a standard 2e, I could use 40k of aux memory from $2000-$BFFF, and 16k each of the main and aux language cards, which would give me 112k max story size .. so close.

And if there aren’t really many Apple 2+'s around, having a version that worked with a Saturn 128k card is probably pretty pointless.

So I’ll probably need to page high story memory from the disk.

On a 2e I could run my “Zieve” in 43 jiffies (60hz ticks, measured by polling $C019 before every Z instruction, which has its own overhead). Once I changed the memory model as discussed above, it went up to 52 jiffies mostly, I think, because it’s no longer using as many zero-page instructions and there’s subroutine call overhead in every call site.

Which maybe means I’m better off using the 12k of readily visible language card memory as “virtual memory” and then backing it from either aux memory (via AUXMOVE in the 80 column card rom, which is always visible) or the disk drive.

So, any time there’s a page miss, find the oldest page and evict it to aux memory. Then request the new page either from aux memory (if it’s there) or the disk (if it’s not).

I feel like you’re also working on a 6502 interpreter. Didn’t I see you pop into one of my other threads a month or so ago asking about shadowing globals past the 48k mark?

Z80 on SymbOS - almost ready to release. (But FWIW, my Apple IIe is a 128 KB unenhanced machine.)

1 Like

FWIW, I’ve found that shadowing and splitting the globals works pretty well. (Use one 240 blob to store all the low halves, then another one to store the high halves)

(Ozmoo solves this differently, in a pretty cool way - they maintain one ZP for the first half of globals, and one ZP for the second half; this has the advantage of working on the “real” version of the globals, but it’s definitely slower than the solution I adopted; it’s several more instructions to compare, double A, etc)

That way, you can access the upper and lower parts with the same index register.

Even sneakier, you can save some logic when decoding variable operand types (0=TOS, 1-15=local, 16+=global) by keeping your current routine’s locals in that same array, and then the only special case is TOS.

(The tradeoff of course is that then you need to copy the locals to the real stack every time you make a subroutine call; in some really rough experiments it was like 4-5% faster to avoid the extra branch when decoding operands, but I had a feeling the extra stack maintenance might offset it)

1 Like

Another oddly-specific question - do people still use real disks, or do they tend to hook up fake drives that read nibblized disk images from USB sticks or SD cards?

(Basically, I’m asking whether I need to bother with ever turning the drive motor off if I end up paging from the disk. I know it’s not very much code to do it properly, but there’s no point if nobody actually uses magnetic media any longer)

Both are common: virtual drives for the convenience, physical drives for the experience.

In practice I’m guessing the decoding algorithm is more important for performance than motor spin-up. Gumboot (mentioned above) is the best solution I’ve found in the specific category of “small routines that read a specific track into memory as fast as possible,” but I haven’t looked into this recently.

Cool project!

Reading instructions at PC is a pretty big part of what takes up time when running a Z-code program. I.e. on C64, Ozmoo uses 15 cycles for a typical byte read. If each Z-code instruction, excluding printing instructions, takes up an average of 3.5 bytes (my guess), Ozmoo uses about 52 cycles just reading the bytes (a lot more when we pass a page boundary, possibly needing to switch to another vmem page). On average, executing a non-printing instruction takes about 550 cycles. If we were to put this code in a subroutine rather than a macro, we’d increase the cost of reading a byte from 15 to 27 cycles - quite a big difference. The slower the CPU, the more important this gets.

On Plus/4, the code to read a byte gets so big, due to almost all of RAM being obscured by ROM, that we use a subroutine anyway.

On Commander X16, we have a situation that reminds me of what you describe on Apple 2. It has 512 KB of extra RAM, and we can bank 8 KB of this into our 16-bit address space. This extra RAM can contain dynamic, static and high memory. Any time we want to read a byte, we call a subroutine, which (a) sets the last calculated bank, (b) reads the byte value, (c) increases the pointer, (d) checks if the pointer has entered a new page, if so it changes the bank if needed, and (e) returns the value read. With an 8 MHz clock, speed isn’t much of an issue here.

2 Likes

I went off into the weeds for a bit trying to narrow down some major breakage I’d missed in Cloak of Darkness.

But some comparisons:

Apple 2e Enhanced: 43 jiffies for Zieve
This uses the 65C02 instructions jmp ($1234,X) and lda ($12) to speed up dispatch and instruction fetch in general. Instruction fetch uses a macro that only does a jsr if the page changed.

Apple 2e Enhanced: 53 jiffies for Zieve
This uses a subroutine for all fetches, where the subroutine does an absolute address load (code that messes with the instruction pointer actually modifies the second and third bytes of the instruction).

That’s a huge speed hit, unfortunately.

I also did an Apple2e “standard” version which doesn’t use any 65C02 instructions. It was 55 jiffies.

The only other major speedup I haven’t tried is making a 256 element dispatch table. That is a significant amount of code and data for what I expect is a relatively minor speedup. The current code works in a pretty standard way — dispatches once on the upper 4 bits of the opcode to decode all of the operands and general instruction class, then dispatches on the lower 4/5 bits to the actual handler after opcodes have been decoded.

Maybe you’ve checked, but in case you haven’t: Ozmoo stores the lowbyte of all opcode routines in one table, and the highbyte in another. Or actually, it stores these values minus one. Then to jump to an opcode routine, it pushes both values onto the stack, and returns. I think that’s the fastest way, using vanilla 6502 assembly.

For the non-65C02 version, I would do instruction fetches with lda (zp),y and just kept y at zero.

In some loops, like string printing, I would increment Y instead, and if it wrapped I’d increment (zp+1).

I’m a little shocked at how much overhead there is making it a jsr instead of a macro, but I guess it all adds up.

lda (zptr)
inc zptr ; zero page
bne +
jsr increment_zpc_mid
+

versus

jsr next_byte where that function is

next_byte
  !byte $ad. ; lda $1234
zptr !byte 0,0
  inc zptr ; not zero page!
bne +
... body of increment_zpc_mid here...
+ rts

The fast path for the first one is 5+5+3 cycles, or 13 cycles.

The second one is 6 cycles for the jsr, then 4+6+3+6 cycles, or 25 cycles, nearly twice as slow per fetch! Ugh.

1 Like

Yes, I use the “separate low and high address tables” trick too. I did borrow that wholesale from Ozmoo :slight_smile:

On 65C02, the jmp (table,x) is several cycles faster but that’s obviously not an option on C64.

If Ozmoo supports any 65C02 targets, that might be an easy tweak. I made a few ugly macros (that take either 16 or 32 parameters) to hide the difference.

!macro dispatch16 label {
	asl
	tax
	jmp (label,x)
}

versus

!macro dispatch16 label {
	tay
	lda label+16,Y
	pha
	lda label,Y
	pha
	rts
}

For the initial decode, where I need to shift the opcode right four, on 65C02 I just shift it right three and mask it to save a few instructions.

-Dave

1 Like

Ozmoo supports Commander X16, but then as I noted, speed isn’t much of an issue on that machine.

We’re certainly not using the MEGA65 to its full potential either, but we are using certain features of the CPU in code that would have to be written specifically for MEGA65 anyway, e.g. we use lda [pointer],z (flat 32-bit addressing with Z register offset) to read story data in Attic RAM, we use the hardware single-cycle 32-bit multiplication, and we use instructions that treat all four registers together as a 32-bit register, e.g. ldq and adcq.

While we want Ozmoo to be fast, we’re also valuing our own sanity.

1 Like

With a real drive, I would be extremely weirded out by a program that never stopped the drive motor :slight_smile: I’m not sure I ever encountered one in my whole life.

Like @prevtenet said, both alt solutions like USB sticks in cards, and real drives, are common.

I don’t know how people continue to get by with the latter. As soon as a disk decays, it screws the drive heads when you put it in. I’ve found they can be very hard to clean if you’re just using a cleaning disk and head cleaner. I got sick of the Russian roulette every single time I put an old disk in. Maybe I had particularly mouldy disks or something! Anyway that’s why I don’t use any real drives on my IIGS today, just a card-USB-stick combo.

-Wade

Spinning all the time would be murder on the floppy itself. They wear out eventually.