I recently released Zym II, a Z-machine interpreter for SymbOS. Zym II has some unique acceleration features which speed up Inform 6 games by 30-60%, making many otherwise-annoyingly-slow games playable on 8-bit hardware. This post briefly documents what it does, for the interest of any other 8-bit terp developers.
(Wall of text incoming.)
Philosophy
I’m mostly targeting Inform 6 games here, as (1) there are a lot of them, and (2) they often sit right on the edge of playability on 8-bit hardware. ZIL/PunyInform games don’t need help, while Inform 7 games are generally out of reach for other reasons (stack usage). Dialog games would ideally benefit from similar work, but there are fewer of them and they would require a slightly different toolchain to analyze.
I’m also focusing on optimizations that can be implemented compactly, as code space is limited in 8-bit terps. In theory the simplest approach is to inline every Inform 6 veneer routine (as Glulx optionally does), but some optimizations have a better gain-to-size ratio than others.
Telemetry was conducted by cruftily patching bocfel.
Nothing here is close to the final word, but hopefully it’s useful to someone.
Trivial optimizations
Most-used opcodes
There are a few sources documenting the most-used opcodes, but fewer breaking them down by exact form. The first column shows usage as a percentage of all opcodes actually executed in a typical run (not merely present in the file).
Tables
Curses r16 (earlier Inform 6)
| Percent | Byte | Opcode |
|---|---|---|
| 20.5% | 41 |
je V,S |
| 8.7% | 8C |
jump L |
| 8.2% | 95 |
inc S |
| 6.9% | 63 |
jg V,V |
| 5.9% | 52 |
get_prop_addr V,S |
| 5.1% | C1 |
je (var) |
| 4.6% | 42 |
jl V,S |
| 4.0% | 2D |
store S,V |
| 2.4% | 0D |
store S,S |
| 2.1% | D0 |
storeb (var) |
Savoir-Faire (later Inform 6)
| Percent | Byte | Opcode |
|---|---|---|
| 9.3% | A0 |
jz V |
| 9.1% | 8C |
jump L |
| 7.0% | 95 |
inc S |
| 6.6% | 2D |
store S,V |
| 6.2% | 41 |
je V,S |
| 4.8% | 62 |
jl V,V |
| 4.5% | 61 |
jl S,V |
| 4.4% | 57 |
div V,S |
| 4.1% | 42 |
jl V,S |
| 4.1% | 52 |
get_prop_addr V,S |
Bronze (earlier Inform 7)
| Percent | Byte | Opcode |
|---|---|---|
| 10.3% | A0 |
jz V |
| 8.3% | 8C |
jump L |
| 7.0% | 42 |
jl V,S |
| 6.0% | CF |
loadw (var) |
| 5.3% | 61 |
jl S,V |
| 4.7% | 62 |
jl V,V |
| 4.4% | 41 |
je V,S |
| 4.2% | 95 |
inc S |
| 3.9% | E0 |
call_vs (var) |
| 3.3% | 54 |
add V,S |
Key takeaways:
- Most of the most-used opcodes are branch instructions; optimize the branch logic aggressively.
- Most opcodes are disproportionately used in certain forms (e.g.,
jz V), so unrolling the decoding logic does make a difference. - This is particularly true for
je(EQUAL?), which includes a heavy VAR form with an arbitrary number of operands. In practice, however, it is mostly theje V,Sform that is used, so it is very much worth unrolling thejeimplementation to minimize the code path for simple (non-VAR) cases. div V,Sis weirdly common; see below.
Divide-by-2
The Inform 6 library has a bad habit of using integer division by 2 (and to a lesser extent, division by 256 or multiplication by 2) to perform trivial bit-shift operations. The worst offender is the ofclass veneer routine (see below), which in many versions spends most of its time in the following loop:
n = obj.#2;
for (j=0: j<n/2: j++)
{ if (a-->j == cla) rtrue;
}
This iterates over the object’s class property, performing a useless division by 2 on every iteration to test the for condition. This results in div V,S being up to 5% of all opcodes executed in a typical run! Since integer division is one of the slowest opcodes on 8-bit CPUs, the time impact is even worse. Interpreters can therefore get a measureable speed boost just by handling the case div positive_number,2 as a bit-shift operation instead of a true divide.
Patch optimizations
The Glulx standard accelerates Inform code by providing alternate native versions of the Inform veneer routines. The same approach is fruitful here—by redirecting common veneer calls to native opcodes, we can save a tremendous amount of execution time. But how to do this with pre-existing story files?
The obvious approach is to pattern-match the story file on load to determine the locations of veneer functions, but this proved impractical on 8-bit hardware. It’s possible, but doing it properly increases load times by minutes. I finally gave up on this and wrote an offline script that patches story files in advance. I do not really like this solution (the whole point of a Z-code file is that it’s cross-platform, so creating a patched version that only runs on one interpreter feels wrong), but it’s already common to distribute 8-bit releases as platform-specific disk images, so I suppose this is not really different.
My patch script uses txd to disassemble Z-code files and pattern-matches against the disassembly output to locate known heavyweight routines. It then replaces calls to those routines in situ by converting the call opcodes to custom EXT opcodes, e.g.:
E0 2B nn nn 02 03 00 = call_vs nnnnn (L01,L02) -> -(SP)
(VAR form: long, var, var) might become:
BE 06 6B nn 02 03 00 = custom_call nn (L01,L02) -> -(SP)
(EXT form: short, var, var). The change from “long” to “short” in the first operand reflects the fact that the original routine address has been trashed by adding an extra BE byte to convert this to EXT—but we no longer need this address, so this new “short” operand can just be discarded or used for something by the new EXT opcode.
Optimizable inline code blocks are likewise overwritten in-place with new EXT opcodes, with any extra space left over padded with NOPs.
For some routines it is necessary to extract and save certain dynamically-generated information (like the total object count); where this could not be added as an operand, I saved them to unused (or Z6-specific) bytes in the story file’s header.
The Inform 6 veneer code can be found in veneer.c in the Inform 6 compiler source, although it may also be helpful to look at old versions on IFArchive, as some details have changed since the 1990s.
OC__Cl
By far the heaviest veneer routine is OC__Cl (ofclass), which accounts for 30-50% (!!!) of all opcodes executed in many Inform 6 games. The majority of this is usually spent iterating over an object’s class property (property 2), as described above. The pattern for the start of OC__Cl is as follows (in txd disassembly with the -d -n options):
Code
?????: 42 01 01 cc JL L00,#01 [TRUE] ?????
?????: d5 1f ?? ?? ff 00 SUB #????,#ff -> -(SP)
?????: 63 01 00 57 JG L00,(SP)+ [FALSE] ?????
?????: c1 97 02 03 04 40 JE L01,#03,#04 [FALSE] RFALSE
?????: 55 02 01 00 SUB L01,#01 -> -(SP)
?????: d9 2f ?? ?? 01 00 CALL_2S ????? (L00) -> -(SP)
?????: 61 00 00 c1 JE (SP)+,(SP)+ [TRUE] RTRUE
?????: b1 RFALSE
# ...routine continues, with slight library variations...
The full logic for OC__Cl is complex, but it is necessary to implement it all exactly, as different games require different aspects of it.
RA__Pr
RA__Pr (object.&property) accounts for between 5-20% of all opcodes executed in a typical Inform 6 game. This comes in two main forms, depending on library version:
Code
“Short” form:
?????: 42 02 40 4b JL L01,#40 [FALSE] ?????
?????: 43 02 00 47 JG L01,#00 [FALSE] ?????
?????: 72 01 02 00 GET_PROP_ADDR L00,L01 -> -(SP)
?????: b8 RET_POPPED
?????: c9 8f 02 80 00 00 AND L01,#8000 -> -(SP)
?????: a0 00 80 47 JZ (SP)+ [TRUE] ?????
?????: 49 02 ff 00 AND L01,#ff -> -(SP)
?????: cf 2f ?? ?? 00 05 LOADW #????,(SP)+ -> L04
?????: 52 05 03 00 GET_PROP_ADDR L04,#03 -> -(SP)
# ...routine continues...
“Long” form:
?????: a0 01 c0 JZ L00 [TRUE] RFALSE
?????: 42 02 40 4e JL L01,#40 [FALSE] ?????
?????: 43 02 00 4a JG L01,#00 [FALSE] ?????
?????: 72 01 02 ff GET_PROP_ADDR L00,L01 -> Gef
?????: e8 bf ff PUSH Gef
?????: b8 RET_POPPED
?????: c9 8f 02 80 00 00 AND L01,#8000 -> -(SP)
?????: a0 00 80 4d JZ (SP)+ [TRUE] ?????
?????: 49 02 ff 00 AND L01,#ff -> -(SP)
?????: cf 2f ?? ?? 00 05 LOADW #????,(SP)+ -> L04
?????: 52 05 03 ff GET_PROP_ADDR L04,#03 -> Gef
# ...routine continues...
The Inform 6 definition of this routine is complicated, but in practice I have never found a game that needed anything besides the following parts of the code:
Code
if (obj==0) rfalse;
if (identifier<64 && identifier>0) return obj.&identifier;
if (obj.&3 == 0) rfalse;
if (self == obj)
otherid = identifier | $8000;
i = obj.3;
while (i-->0 ~= 0)
{ if (i-->0 == identifier or otherid)
return i+3;
i = i + i->2 + 3;
}
rfalse;
which is comparatively easy to implement. Everything after “i = obj.3” is a trivial traversal of the object’s individual properties table. (If any experts know of cases where other features of this routine are needed, let me know.)
Z__Region
Z__Region accounts for 1-2% of opcodes executed in most games, but mainly because it is called by OC__Cl. However, it is required to implement OC__Cl correctly, so we get it for free:
Code
Common “long” form (whole routine):
?????: c1 93 01 00 ff ff c0 JE L00,#00,#ffff [TRUE] RFALSE
?????: 2d 02 01 STORE L01,L00
?????: 0f 1a 00 00 LOADW #1a,#00 -> -(SP)
?????: e0 2b ?? ?? 02 00 00 CALL_VS ????? (L01,(SP)+) -> -(SP)
?????: 42 00 00 40 JL (SP)+,#00 [FALSE] RFALSE
?????: 42 01 01 cc JL L00,#01 [TRUE] ?????
?????: d5 1f ?? ?? ff 00 SUB #????,#ff -> -(SP)
?????: 63 01 00 41 JG L00,(SP)+ [FALSE] RTRUE
?????: e0 23 ?? ?? 01 ?? ?? 00 CALL_VS ????? (L00,S001) -> -(SP)
?????: 42 00 00 c4 JL (SP)+,#00 [TRUE] ?????
?????: 9b 03 RET #03
?????: e0 23 ?? ?? 01 ?? ?? 00 CALL_VS ????? (L00,#????) -> -(SP)
?????: 42 00 00 c4 JL (SP)+,#00 [TRUE] ?????
?????: 9b 02 RET #02
?????: b1 RFALSE
Common “short” form (whole routine):
?????: a0 01 c0 JZ L00 [TRUE] RFALSE
?????: 42 01 01 cc JL L00,#01 [TRUE] ?????
?????: d5 1f ?? ?? ff 00 SUB #????,#ff -> -(SP)
?????: 63 01 00 41 JG L00,(SP)+ [FALSE] RTRUE
?????: e0 23 ?? ?? 01 ?? ?? 00 CALL_VS ????? (L00,S001) -> -(SP)
?????: 42 00 00 c4 JL (SP)+,#00 [TRUE] ?????
?????: 9b 03 RET #03
?????: e0 23 ?? ?? 01 ?? ?? 00 CALL_VS ????? (L00,#????) -> -(SP)
?????: 42 00 00 c4 JL (SP)+,#00 [TRUE] ?????
?????: 9b 02 RET #02
?????: b1 RFALSE
Unsigned__Compare
Unsigned__Compare is used to work around the fact that Z-machine arithmetic is signed, so values >32767 are treated as negative numbers, which creates problems when comparing absolute memory addresses. Unsigned__Compare accounts for 2-3% of opcodes executed in most games, but mainly because it is called several times by Z__Region, so if we accelerate OC__Cl and Z__Region there is less point in accelerating Unsigned__Compare. But it’s so trivial that we might as well.
Code
Whole routine:
?????: 61 01 02 ?? JE L00,L01 [FALSE] ?????
?????: b1 RFALSE
?????: 42 01 00 ?? JL L00,#00 [FALSE] ?????
?????: 42 02 00 ?? JL L01,#00 [TRUE] ?????
?????: b0 RTRUE
?????: 42 01 00 ?? JL L00,#00 [TRUE] ?????
?????: 42 02 00 ?? JL L01,#00 [FALSE] ?????
?????: 8b ff ff RET #ffff
?????: c9 8f 01 7f ff 03 AND L00,#7fff -> L02
?????: c9 8f 02 7f ff 04 AND L01,#7fff -> L03
?????: 63 03 04 ?? JG L02,L03 [FALSE] ?????
?????: b0 RTRUE
?????: 8b ff ff RET #ffff"
objectloop
The Inform 6 objectloop syntax, which iterates over all objects in the game, can add up to a large amount of processing time. Most of this time is spent iterating over objects that do not meet the loop condition.
The form objectloop (x in y) is easiest to accelerate, and looks schematically like this:
Code
STORE local,#01
start:
JIN local,y [FALSE] next
... inner loop code ...
next:
INC local
JG local,maxobjects [TRUE] end
JUMP start
end:
Zym’s approach is to replace the first jin opcode with a custom opcode that performs the whole check-increment-repeat cycle in one operation, starting from the current value of local and advancing through the object table until it hits something with the parent y. Once found, it writes the object ID back to local and exits without branching, falling into the loop. On reaching the end of the object table, it branches, letting the code at next handle the exit condition. This reduces the whole check-increment-repeat cycle (which must otherwise repeatedly branch and recalculate the object address) into a simple forward search that checks the parent entry of each object (every 7th word, or 9th byte in Z3) against y. This also automatically accelerates compound conditions like objectloop (x in y && some_other_condition); only the first condition will actually be accelerated, but it is typically the heaviest.
Impact depends on the game; at minimum the library routine NoteObjectAcquisitions benefits from this (accounting for 1-3% of opcodes in many games, up to 15% in, e.g., Anchorhead).
The form objectloop (x ofclass y) is also common, but difficult to accelerate because any practical implementation would still involve calling OC__Cl on every object. The form objectloop (x has y) is easy to accelerate in theory but is relatively rare in practice.
Manual copy_table
Inform 6 generally does not use the copy_table (COPYT) opcode, instead performing table copies manually. While not typically a major bottleneck, the inline copy pattern can be trivially converted to a copy_table opcode and a jump (to skip the remaining bytes in the original code).
Code
Form when copying a static address:
?????: 0d ?? ?? STORE ???,#??
?????: 42 ?? ?? 53 JL ???,#?? [FALSE] ?????
?????: d0 2f ?? ?? ?? 00 LOADB #????,??? -> -(SP)
?????: e2 2b ?? ?? ?? 00 STOREB #????,???,(SP)+
?????: 95 ?? INC ???
?????: 8c ff ed JUMP ?????
Form when the table address is in a variable:
?????: 0d ?? 00 STORE ???,#00
?????: 42 ?? ?? 51 JL ???,#?? [FALSE] ?????
?????: 70 ?? ?? 00 LOADB ???,??? -> -(SP)
?????: e2 2b ?? ?? ?? 00 STOREB #????,???,(SP)+
?????: 95 ?? INC ???
?????: 8c ff ef JUMP ?????
The Mulldoon Legacy class search
This one is very CISC, but scratches a personal itch.
I really like The Mulldoon Legacy (as do a lot of other people, judging by its regular placement in “top n of all time” lists), but it runs too slowly to be playable on 8-bit hardware. It turns out this is because it spends 50% of its time evaluating OC__Cl and another 24% evaluating the following function:
Code
Routine 22030, 2 locals
22031: 0d 02 01 STORE L01,#01
22034: e0 27 5d ae 02 01 00 CALL_VS 2ed70 (L01,#01) -> -(SP)
2203b: a0 00 d4 JZ (SP)+ [TRUE] 22050
2203e: c1 97 02 02 01 ce JE L01,#02,#01 [TRUE] 22050
22044: e0 2b 5d ae 01 02 00 CALL_VS 2ed70 (L00,L01) -> -(SP)
2204b: a0 00 c4 JZ (SP)+ [TRUE] 22050
2204e: ab 02 RET L01
22050: 95 02 INC L01
22052: c3 8f 02 03 3c c5 JG L01,#033c [TRUE] 2205b
22058: 8c ff db JUMP 22034
2205b: b0 RTRUE
This is effectively a way of retrieving the first class of an object by the completely bonkers method of iterating over all objects in the game, checking if that object is a class, and then (if so) checking if the passed object belongs to that class:
Code
get_first_class(x) {
for (i = 1; i <= #total_obj-255; ++i) {
if (!(i ofclass 1)) continue;
if (i == 1 || i == 2) continue;
if (x ofclass i)
return i;
}
return true;
}
I am not sure exactly what the intent was in doing it this way, especially since no object in The Mulldoon Legacy has more than two slots available in its class list. (Possibly a rudimentary form of class precedence?) But in any case, accelerating this function can be done with only a few dozen bytes and single-handedly makes The Mulldoon Legacy playable on Zym. I have not found any other game that uses this exact method.
Other potential targets
NounWord and NextWord are logically trivial but surprisingly heavy, accounting for a few percent in some games.
MoveFloatingObjects (from the library file verblib.h) accounts for 1-2% of opcodes executed in most games, but is very heavy in a few (e.g., 20% in Curses). The logic is too complex to fully inline in a space-constrained interpreter, but accelerating the main objectloop might be helpful.
Cache optimization
On SymbOS, copying from banked memory is generally cheaper than repeated byte-by-byte access, so Zym II stores the first 48 KB of the story file in directly accessible RAM and uses a write-through LRU cache to access parts of the story file above this waterline. I ran a lot of simulations in bocfel to determine the best cache strategy (what happens if I allocate more of the address space to cache and less to direct access? what’s the best aging strategy? what happens if direct-access doesn’t start at address 0? what happens if loadb/storeb bypass the cache? can I detect and prevent thrashing from large sequential reads? etc.). In the end I mostly arrived where I started:
- Direct-access the lowest 48 KB
- 4 KB of write-through LRU cache
- 256-byte cache pages, page-aligned
- Start page ages at 1 when first loaded
- Reset a page’s age to 0 on cache hit
- Age all pages by 1 on cache miss
The trick of starting page ages at 1 rather than 0 reduced cache misses by about 2% in the average game.