Anybody here with Apple ][ experience?

Yeah I think I read something where Woz said in retrospect he wished he’d realised the head didn’t even need to actually touch the floppy. I mean, waaaay in retrospect :slight_smile:

-Wade

1 Like

I’m overthinking the problem. When you turn the drive off it still runs for another second (a trick apparently used by some copy protection to make it harder to debug). And when you turn it back on after it’s been off long enough to stop, it just takes a while for valid data to start showing up under the drive head. If you don’t see any data for a while, it’s still safe to start trying to move the drive head (in case the disk track under the head is somehow blank).

Anyway, that’s a problem for future me. I’m still trying to figure out if I can use ZP even in auxiliary memory and just be careful about shadowing the important bits.

Hey Fredrik, unless I’ve screwed up my math, using something like

lda table,y
sta .jump+2
lda table+16,y
sta .jump+1
.jump jmp $1234 ; overwritten

is one cycle faster than pha/pha/rts (4+4+3=11 versus 3+3+6=12 cycles). Using zero page variables is 2 cycles faster, but using an indirect jmp is 2 cycles slower so that’s a wash.

That wasn’t enough to make a measurable difference for me (1/60th of a second is pretty coarse) but if you can guarantee at least some of the time (like the initial dispatch based on the upper opcode bits) all your handlers are on the same page, you can eliminate a load and a store entirely.

I tried the forcing the initial dispatch to all be on the same page and the benchmark dropped from 49 to 47 jiffies (versus 43 for 65C02 targets). Plus it saved 16 bytes of addresses I no longer need.

-Dave

I think that instead of a dummy address, is somewhat better point to $FF69 aka -151, so a (weak) debugging fallback is by default.

Best regards from Italy,
dott. Piergiorgio.

I am genuinely curious. I can understand the retro-thing if you have the hardware but when using an emulator… why not play these games under some other emulated computer. Is it just the challenge of the constraints (is this possible) and like a puzzle? Or is there more to it that I am missing?

1 Like

Pretty much just the challenge of the constraints, and the fact that some of my earliest computing memories are playing and clumsily attempting to write interactive fiction. So any interactive fiction I write today, I would like to have been theoretically able to have enjoyed in 1983.

For whatever reason, the Z machine is it for me. I doubt I personally could write an interesting text adventure game that needed more than it could provide.

(Of course, even though I should be actually writing interactive fiction, I keep hitting roadblocks for that so I stick to what I’m good at… programming)

-Dave

1 Like

Did some more brainstorming walking home from my day job.

If I put the interpreter in the upper 8k of language card memory, I can then bank main memory (from $800-$BFFF, or 46k). That’s 92k total right there. There are also two 4k banks at $D000 below the interpreter, which gets us up to 100k.

That’s enough to run many of the Infocom v3 stories.

That seems like way less fuss than trying to bank the language card memory (which is only 16k anyway).

All of the v3 stories I looked at seemed to have high memory start at about the 16-20k mark, so 46k is plenty there.

Looking further down the line, to support v5 on a standard Apple 2e, I’ll definitely need to implement virtual memory, but honestly, it’s not that complicated. 100k should be enough to play any V5 story reasonably well. The limiting factory is probably needing more than 46k for dynamic+static at some point. But static memory at least can be done entirely with VM as well. For performance reasons, we probably want the dictionary to be in a contiguous block of memory that is always resident though.

-Dave

1 Like

I got this working last night. The one hiccup is it’s easier to read an entire track at a time, so I “gave up” 2k and use the 44k from $1000-$BFFF. However, I can include it in VM trivially since that will be reading 512b at a time for V3. I use the $C600 ROM to read the first track of the disk, then switch to my own routines so I don’t have to worry about the conversion table being in the “wrong” kind of memory. Amazingly, the disk read routines worked correctly first try. Well, not really that amazingly since I had the ROM disassembly open in another window as I was implementing it.

Bonus! Zork is only 82k so it still fits.

1 Like

Okay that makes perfect sense. I’ve noticed how often creativity arises from constraints, and those sound like good constraints to pick :slight_smile:

1 Like

Impressive. :laughing:

FYI, if you’re willing to work directly with raw track data, qkumba’s qboot algorithm (the descendant of gumboot, mentioned above) can read a track in 1 revolution (200 ms), about 2x faster than the ProDOS RWTS and 12x faster than the DOS3.3 RWTS.

1 Like

Damn! I never could find gumboot, just the long article about cracking Gumball.

Right now I’m using “dos order” dsk files and I feel like if I switched over to .nib files I’d have more control over the interleave.

The normal read routine reads all the 2’s, then all the 6’s, at which point the read is done, and then it runs 256 iterations of a probably 8 instruction loop. I’m not sure what that is in terms of rotational speed, but figured I’d try either 1 or 2 sectors “between” increasing sectors to see if I could speed things up.

I assume qboot wins by just reading the entire track into memory and then doing the 6+2 decode in one pass after the fact? You’d need an extra 85 bytes of storage per 256 bytes read.

--
 ldx #86
- dex ; 2 cycles
 bmi -- ; 2 cycles if not taken
 lda (data_ptr),y ; 5 cycles
 lsr twos_buffer,x ; 7 cycles
 rol ; 2 cycles
 lsr twos_buffer,x ; 7 cycles
 rol ; 2 cycles
 sta (data_ptr),y ; 6 cycles
 iny ; 2 cycles
 bne - ; 3 cycles (usually taken)

An absolute load/store is one cycle faster than a ZP indirect, so I could use self-modifying code there. I don’t see an easy way to make the lsr/rol/lsr/rol faster without some huge lookup tables.

One LUT would, given N=0…63, return N shifted left twice.

- lda data_ptr,y
 tay
 lda shift_left_two,y
 ldy twos_buffer,x
 ora twos_buffer_lut_1,y
 sta data_ptr,y
 iny
 dex
 bne -

Repeat above loop two more times (with a different twos_buffer_lut each time). But that’s another 256 bytes of lookups to squeeze out a few more cycles (and I’m not sure what I’ve written there is even significantly faster)

Actually if the conversion table (that maps $96 to $00 etc) had the shift baked into it, we wouldn’t need a separate LUT for that part. Or there might be enough time in the read loop to shift it left twice without a separate LUT. I’d need to look up how many cycles I have – I think it’s 32 cycles per byte on the disk.

So the original loop looks to be about 40 cycles x 256 iterations, so you’d definitely need at least two sectors between subsequent sectors. IIRC ProDos only needs one sector, so it probably has a tighter loop.

Yeah, I think there’s enough time to put two asl’s in the read_sixes loop so that it’s shifted in place already. The bits are reversed (we LSR twos_buffer and rol the bit in from the left) so we would need three 64 byte LUT’s to manage the shifting and reversing of bits, but I’m pretty sure that loop can run in under 32 cycles per byte, meaning we only need one sector of interleave.

Strictly speaking, the three sub loops don’t have to be identical, they just need to average out to less than 32 cycles per byte. One could be a little faster, one could be a little slower.

Edit: just took a closer look at qboot and one of the first things they do is shift the 6’s left twice before storing them too :slight_smile:

Took a stab at it, couldn’t get it to work offhand, realized qboot was rewriting more inline code than I’d considered - like patching the disk drive read address so we don’t need to keep the slot in X, which is a neat trick.

Looking closer, it’s reading the twos, and then the sixes, but as it reads the sixes it’s merging in the last two bits each time using some tables. I’d thought of using the tables, but I was still in the mindset of doing the final merge as a separate pass, which is always going to take at least one sector’s time.

Nice code, thank you for sharing it with me!

Oh man, it’s even more brilliant than I thought. If I’m understanding it correctly, it’s pre-shifting the nibblized data twice, which saves shifting it later, but also, the two’s are shifted as well, which means it can use an interleaved lookup table to pick the right bits out.

As I was (trying to) go to sleep last night I realized why the scatter read is important - it can read an entire track regardless of the interleave - which is also important when trying to read multiple tracks at once, since you could potentially waste nearly an entire revolution if you just missed your desired sector after seeking there.

-Dave

2 Likes

I got my interleave routine down to 29 cycles and things still didn’t load any faster, regardless of whether I used “DOS 3.3” order or “ProDOS” order, which surprised me. I’m not sure if the emulator (Virtual ][) is normalizing sector order behind my back or something.

However, I kept at it, and realized that qboot only needed a single counter because it stored the two’s table in reverse order, which let me get it down to 22 cycles:

	ldy #$55
-	ldx twos_buffer,y	; 4
	lda interleave,x	; 4
patch1
	ora $ff00,y			; 4
	sta $ff00,y			; 5
	dey					; 2
	bpl -				; 3 - 22 cycles

The loop is repeated two more times, because the $ff00 lower byte needs a different offset, and the interleave table needs +1 or +2 applied to it.

The next step is to integrate the actual sixes read loop like qboot does, but at least it’s at a good check point.

-Dave

1 Like

…and two head-spinning hours later, I got the single pass version working. My solution ended up a bit different from what qboot did, although there are certainly many things in common.

	; the following code is extremely timing sensitive
	; we can't afford any branches crossing page boundaries here.
	; normally we'd start at $55 and count downward, but we need the bytes
	; in the correct order and we also can't afford the compare instruction.
	; but we also have to watch out for page crossings.
	ldy #$2A
-	ldx RDBYTE6			; 4 - read next 6's byte
	bpl -				; 2 - when not taken
	eor conv_tab-$96,X	; 4 - convert to 6 bit (pre-shifted) value
	ldx twos_buffer-$2A,Y; 4 - get matching 2's entry (no crossing)
	ora interleave,X	; 4 - merge them
patch1
	sta $ff00-$2A,Y		; 5+1 - store the result (page crossing)
	and #$fc			; 2 - clear the bits so next ora works.
	iny					; 2 - stops at 128 (need this to have fewer page crossings)
	bpl -				; 3 - 31 cycles per byte (30 cycles on last iteration)

	; note that after this point are in perfect lockstep; there are two
	; cycles available for the reload of Y, then the next byte is latched (32 cycles)

	ldy #$2A
-	ldx RDBYTE6
	bpl -
	eor conv_tab-$96,X
	ldx twos_buffer-$2A,Y
	ora interleave+1,X
patch2
	sta $ff56-$2A,Y			; no page crossing this time!
	and #$fc
	iny
	bpl -					; 30 cycles per byte, 29 on last iteration

Basically, I had exactly enough cycles to get this to work without any crazy extra hacks. The first loop runs in 31 cycles per byte because there’s an unavoidable page crossing at patch1. The twos_buffer was originally at $100 (I don’t use much stack) but I had to move it up $2C bytes to avoid a page crossing there as well.

On the very last iteration, the branch isn’t taken, so it’s only 30 cycles. That leaves 2 cycles to reset Y prior to the next block of code, so the disk data latch read will be ready exactly in time.

I was worried I was going to have to go into Atari 2600 mode and cycle count the entire loop but that ended up not being necessary.

The third iteration is pretty much the same as the second, except Y starts two higher at $2C because we need two fewer iterations on the last part.

1 Like

Got the scatter code working. Advent used to take (this morning, before I started replacing the code) about 53 seconds to load. Now it takes 5 seconds. Now I just need to get advent working :slight_smile:

4 Likes

If anybody has real Apple 2e hardware I’d love to see if this works.

Requirements: Apple 2e with 80 column card. (Shouldn’t need 128k memory if you have the 1k version).

I can make an Apple 2+ version or an Apple 2e Enhanced version (although the latter is like 5% faster, you’ll probably never see the difference outside of a benchmark)

Yes, it turns the drive motor off after booting :slight_smile:

-Dave

(If you’re interested in the code, click on my profile picture for the GitHub link to the repo; only external requirement at present is it needs the latest acme from SourceForce, the version on GitHub is too old; it needs the same version Ozmoo uses)

applez-cloak.zip (8.0 KB)

1 Like

Seems to work:

The game’s parser seems to have a number of problems (DROP CLOAK in the cloakroom erroneously gives “You’re not carrying that”, for example), but the behavior is the same in an emulator, so I’m not sure if this is the fault of the interpreter or of the game file.

Runs very fast! Compliments to the compiler for that, too.

3 Likes

It’s a bug in the 6502 interpreter. Dropping the cloak works fine on my C++ interpreter.

I’ve fixed several major bugs in the interpreter today and Zork and Adventure actually make it to a command prompt now, although there are myriad output bugs.

I was going to start running it through Czech and Etude but they’re Z5-based and I only have Z3 working so far. Still a lot more work to do.

Thank you for taking the time to test this on real hardware!

Fixed the bug in @jin, it was pretty dumb (note none of this code supports V4+ objects yet)

	; is operand+0's parent operand+1?
z_jin
	jsr get_object_addr
	ldy #4 ; parent
	lda (obj_ptr),y ; This was missing. Helps if you load the parent before comparing it
	cmp operands_lo+1
	beq +
	jmp branch_failed
+	jmp branch_passed

-Dave

3 Likes

(Still fixing tons of bugs. Smartest thing I did today was track down a copy of Czech I could build from source and target v3. First thing it spotted is branches to 0/1 don’t work correctly, which was likely breaking zillions of things. Should have done this sooner, sigh)

(Offsets worked fine; I only had a 32 character buffer for buffered output and it was right before the Z-machine stack, so the long lines Czech would print corrupted the Z stack. Derp.)

How big is your screen? There’s no real reason to have an output buffer bigger than one screen line + 1 character.