Accelerating Inform 6 games on 8-bit interpreters

I recently released Zym II, a Z-machine interpreter for SymbOS. Zym II has some unique acceleration features which speed up Inform 6 games by 30-60%, making many otherwise-annoyingly-slow games playable on 8-bit hardware. This post briefly documents what it does, for the interest of any other 8-bit terp developers.

(Wall of text incoming.)

Philosophy

I’m mostly targeting Inform 6 games here, as (1) there are a lot of them, and (2) they often sit right on the edge of playability on 8-bit hardware. ZIL/PunyInform games don’t need help, while Inform 7 games are generally out of reach for other reasons (stack usage). Dialog games would ideally benefit from similar work, but there are fewer of them and they would require a slightly different toolchain to analyze.

I’m also focusing on optimizations that can be implemented compactly, as code space is limited in 8-bit terps. In theory the simplest approach is to inline every Inform 6 veneer routine (as Glulx optionally does), but some optimizations have a better gain-to-size ratio than others.

Telemetry was conducted by cruftily patching bocfel.

Nothing here is close to the final word, but hopefully it’s useful to someone.

Trivial optimizations

Most-used opcodes

There are a few sources documenting the most-used opcodes, but fewer breaking them down by exact form. The first column shows usage as a percentage of all opcodes actually executed in a typical run (not merely present in the file).

Tables

Curses r16 (earlier Inform 6)

Percent Byte Opcode
20.5% 41 je V,S
8.7% 8C jump L
8.2% 95 inc S
6.9% 63 jg V,V
5.9% 52 get_prop_addr V,S
5.1% C1 je (var)
4.6% 42 jl V,S
4.0% 2D store S,V
2.4% 0D store S,S
2.1% D0 storeb (var)

Savoir-Faire (later Inform 6)

Percent Byte Opcode
9.3% A0 jz V
9.1% 8C jump L
7.0% 95 inc S
6.6% 2D store S,V
6.2% 41 je V,S
4.8% 62 jl V,V
4.5% 61 jl S,V
4.4% 57 div V,S
4.1% 42 jl V,S
4.1% 52 get_prop_addr V,S

Bronze (earlier Inform 7)

Percent Byte Opcode
10.3% A0 jz V
8.3% 8C jump L
7.0% 42 jl V,S
6.0% CF loadw (var)
5.3% 61 jl S,V
4.7% 62 jl V,V
4.4% 41 je V,S
4.2% 95 inc S
3.9% E0 call_vs (var)
3.3% 54 add V,S

Key takeaways:

  • Most of the most-used opcodes are branch instructions; optimize the branch logic aggressively.
  • Most opcodes are disproportionately used in certain forms (e.g., jz V), so unrolling the decoding logic does make a difference.
  • This is particularly true for je (EQUAL?), which includes a heavy VAR form with an arbitrary number of operands. In practice, however, it is mostly the je V,S form that is used, so it is very much worth unrolling the je implementation to minimize the code path for simple (non-VAR) cases.
  • div V,S is weirdly common; see below.

Divide-by-2

The Inform 6 library has a bad habit of using integer division by 2 (and to a lesser extent, division by 256 or multiplication by 2) to perform trivial bit-shift operations. The worst offender is the ofclass veneer routine (see below), which in many versions spends most of its time in the following loop:

n = obj.#2;
for (j=0: j<n/2: j++)
    {   if (a-->j == cla) rtrue;
    }

This iterates over the object’s class property, performing a useless division by 2 on every iteration to test the for condition. This results in div V,S being up to 5% of all opcodes executed in a typical run! Since integer division is one of the slowest opcodes on 8-bit CPUs, the time impact is even worse. Interpreters can therefore get a measureable speed boost just by handling the case div positive_number,2 as a bit-shift operation instead of a true divide.

Patch optimizations

The Glulx standard accelerates Inform code by providing alternate native versions of the Inform veneer routines. The same approach is fruitful here—by redirecting common veneer calls to native opcodes, we can save a tremendous amount of execution time. But how to do this with pre-existing story files?

The obvious approach is to pattern-match the story file on load to determine the locations of veneer functions, but this proved impractical on 8-bit hardware. It’s possible, but doing it properly increases load times by minutes. I finally gave up on this and wrote an offline script that patches story files in advance. I do not really like this solution (the whole point of a Z-code file is that it’s cross-platform, so creating a patched version that only runs on one interpreter feels wrong), but it’s already common to distribute 8-bit releases as platform-specific disk images, so I suppose this is not really different.

My patch script uses txd to disassemble Z-code files and pattern-matches against the disassembly output to locate known heavyweight routines. It then replaces calls to those routines in situ by converting the call opcodes to custom EXT opcodes, e.g.:

E0 2B nn nn 02 03 00 = call_vs nnnnn (L01,L02) -> -(SP)

(VAR form: long, var, var) might become:

BE 06 6B nn 02 03 00 = custom_call nn (L01,L02) -> -(SP)

(EXT form: short, var, var). The change from “long” to “short” in the first operand reflects the fact that the original routine address has been trashed by adding an extra BE byte to convert this to EXT—but we no longer need this address, so this new “short” operand can just be discarded or used for something by the new EXT opcode.

Optimizable inline code blocks are likewise overwritten in-place with new EXT opcodes, with any extra space left over padded with NOPs.

For some routines it is necessary to extract and save certain dynamically-generated information (like the total object count); where this could not be added as an operand, I saved them to unused (or Z6-specific) bytes in the story file’s header.

The Inform 6 veneer code can be found in veneer.c in the Inform 6 compiler source, although it may also be helpful to look at old versions on IFArchive, as some details have changed since the 1990s.

OC__Cl

By far the heaviest veneer routine is OC__Cl (ofclass), which accounts for 30-50% (!!!) of all opcodes executed in many Inform 6 games. The majority of this is usually spent iterating over an object’s class property (property 2), as described above. The pattern for the start of OC__Cl is as follows (in txd disassembly with the -d -n options):

Code
?????:  42 01 01 cc             JL              L00,#01 [TRUE] ?????
?????:  d5 1f ?? ?? ff 00       SUB             #????,#ff -> -(SP)
?????:  63 01 00 57             JG              L00,(SP)+ [FALSE] ?????
?????:  c1 97 02 03 04 40       JE              L01,#03,#04 [FALSE] RFALSE
?????:  55 02 01 00             SUB             L01,#01 -> -(SP)
?????:  d9 2f ?? ?? 01 00       CALL_2S         ????? (L00) -> -(SP)
?????:  61 00 00 c1             JE              (SP)+,(SP)+ [TRUE] RTRUE
?????:  b1                      RFALSE          
# ...routine continues, with slight library variations...

The full logic for OC__Cl is complex, but it is necessary to implement it all exactly, as different games require different aspects of it.

RA__Pr

RA__Pr (object.&property) accounts for between 5-20% of all opcodes executed in a typical Inform 6 game. This comes in two main forms, depending on library version:

Code

“Short” form:

?????:  42 02 40 4b             JL              L01,#40 [FALSE] ?????
?????:  43 02 00 47             JG              L01,#00 [FALSE] ?????
?????:  72 01 02 00             GET_PROP_ADDR   L00,L01 -> -(SP)
?????:  b8                      RET_POPPED      
?????:  c9 8f 02 80 00 00       AND             L01,#8000 -> -(SP)
?????:  a0 00 80 47             JZ              (SP)+ [TRUE] ?????
?????:  49 02 ff 00             AND             L01,#ff -> -(SP)
?????:  cf 2f ?? ?? 00 05       LOADW           #????,(SP)+ -> L04
?????:  52 05 03 00             GET_PROP_ADDR   L04,#03 -> -(SP)
# ...routine continues...

“Long” form:

?????:  a0 01 c0                JZ              L00 [TRUE] RFALSE
?????:  42 02 40 4e             JL              L01,#40 [FALSE] ?????
?????:  43 02 00 4a             JG              L01,#00 [FALSE] ?????
?????:  72 01 02 ff             GET_PROP_ADDR   L00,L01 -> Gef
?????:  e8 bf ff                PUSH            Gef
?????:  b8                      RET_POPPED      
?????:  c9 8f 02 80 00 00       AND             L01,#8000 -> -(SP)
?????:  a0 00 80 4d             JZ              (SP)+ [TRUE] ?????
?????:  49 02 ff 00             AND             L01,#ff -> -(SP)
?????:  cf 2f ?? ?? 00 05       LOADW           #????,(SP)+ -> L04
?????:  52 05 03 ff             GET_PROP_ADDR   L04,#03 -> Gef
# ...routine continues...

The Inform 6 definition of this routine is complicated, but in practice I have never found a game that needed anything besides the following parts of the code:

Code
if (obj==0) rfalse;
	if (identifier<64 && identifier>0) return obj.&identifier;
if (obj.&3 == 0) rfalse;
	if (self == obj)
		otherid = identifier | $8000;
	i = obj.3;
	while (i-->0 ~= 0)
	{   if (i-->0 == identifier or otherid)
			return i+3;
		i = i + i->2 + 3;
	}
	rfalse;

which is comparatively easy to implement. Everything after “i = obj.3” is a trivial traversal of the object’s individual properties table. (If any experts know of cases where other features of this routine are needed, let me know.)

Z__Region

Z__Region accounts for 1-2% of opcodes executed in most games, but mainly because it is called by OC__Cl. However, it is required to implement OC__Cl correctly, so we get it for free:

Code

Common “long” form (whole routine):

?????:  c1 93 01 00 ff ff c0    JE              L00,#00,#ffff [TRUE] RFALSE
?????:  2d 02 01                STORE           L01,L00
?????:  0f 1a 00 00             LOADW           #1a,#00 -> -(SP)
?????:  e0 2b ?? ?? 02 00 00    CALL_VS         ????? (L01,(SP)+) -> -(SP)
?????:  42 00 00 40             JL              (SP)+,#00 [FALSE] RFALSE
?????:  42 01 01 cc             JL              L00,#01 [TRUE] ?????
?????:  d5 1f ?? ?? ff 00       SUB             #????,#ff -> -(SP)
?????:  63 01 00 41             JG              L00,(SP)+ [FALSE] RTRUE
?????:  e0 23 ?? ?? 01 ?? ?? 00 CALL_VS         ????? (L00,S001) -> -(SP)
?????:  42 00 00 c4             JL              (SP)+,#00 [TRUE] ?????
?????:  9b 03                   RET             #03
?????:  e0 23 ?? ?? 01 ?? ?? 00 CALL_VS         ????? (L00,#????) -> -(SP)
?????:  42 00 00 c4             JL              (SP)+,#00 [TRUE] ?????
?????:  9b 02                   RET             #02
?????:  b1                      RFALSE          

Common “short” form (whole routine):

?????:  a0 01 c0                JZ              L00 [TRUE] RFALSE
?????:  42 01 01 cc             JL              L00,#01 [TRUE] ?????
?????:  d5 1f ?? ?? ff 00       SUB             #????,#ff -> -(SP)
?????:  63 01 00 41             JG              L00,(SP)+ [FALSE] RTRUE
?????:  e0 23 ?? ?? 01 ?? ?? 00 CALL_VS         ????? (L00,S001) -> -(SP)
?????:  42 00 00 c4             JL              (SP)+,#00 [TRUE] ?????
?????:  9b 03                   RET             #03
?????:  e0 23 ?? ?? 01 ?? ?? 00 CALL_VS         ????? (L00,#????) -> -(SP)
?????:  42 00 00 c4             JL              (SP)+,#00 [TRUE] ?????
?????:  9b 02                   RET             #02
?????:  b1                      RFALSE          

Unsigned__Compare

Unsigned__Compare is used to work around the fact that Z-machine arithmetic is signed, so values >32767 are treated as negative numbers, which creates problems when comparing absolute memory addresses. Unsigned__Compare accounts for 2-3% of opcodes executed in most games, but mainly because it is called several times by Z__Region, so if we accelerate OC__Cl and Z__Region there is less point in accelerating Unsigned__Compare. But it’s so trivial that we might as well.

Code

Whole routine:

?????:  61 01 02 ??             JE              L00,L01 [FALSE] ?????
?????:  b1                      RFALSE          
?????:  42 01 00 ??             JL              L00,#00 [FALSE] ?????
?????:  42 02 00 ??             JL              L01,#00 [TRUE] ?????
?????:  b0                      RTRUE           
?????:  42 01 00 ??             JL              L00,#00 [TRUE] ?????
?????:  42 02 00 ??             JL              L01,#00 [FALSE] ?????
?????:  8b ff ff                RET             #ffff
?????:  c9 8f 01 7f ff 03       AND             L00,#7fff -> L02
?????:  c9 8f 02 7f ff 04       AND             L01,#7fff -> L03
?????:  63 03 04 ??             JG              L02,L03 [FALSE] ?????
?????:  b0                      RTRUE           
?????:  8b ff ff                RET             #ffff"

objectloop

The Inform 6 objectloop syntax, which iterates over all objects in the game, can add up to a large amount of processing time. Most of this time is spent iterating over objects that do not meet the loop condition.

The form objectloop (x in y) is easiest to accelerate, and looks schematically like this:

Code
  STORE local,#01
start:
  JIN local,y [FALSE] next
  ... inner loop code ...
next:
  INC local
  JG local,maxobjects [TRUE] end
  JUMP start
end:

Zym’s approach is to replace the first jin opcode with a custom opcode that performs the whole check-increment-repeat cycle in one operation, starting from the current value of local and advancing through the object table until it hits something with the parent y. Once found, it writes the object ID back to local and exits without branching, falling into the loop. On reaching the end of the object table, it branches, letting the code at next handle the exit condition. This reduces the whole check-increment-repeat cycle (which must otherwise repeatedly branch and recalculate the object address) into a simple forward search that checks the parent entry of each object (every 7th word, or 9th byte in Z3) against y. This also automatically accelerates compound conditions like objectloop (x in y && some_other_condition); only the first condition will actually be accelerated, but it is typically the heaviest.

Impact depends on the game; at minimum the library routine NoteObjectAcquisitions benefits from this (accounting for 1-3% of opcodes in many games, up to 15% in, e.g., Anchorhead).

The form objectloop (x ofclass y) is also common, but difficult to accelerate because any practical implementation would still involve calling OC__Cl on every object. The form objectloop (x has y) is easy to accelerate in theory but is relatively rare in practice.

Manual copy_table

Inform 6 generally does not use the copy_table (COPYT) opcode, instead performing table copies manually. While not typically a major bottleneck, the inline copy pattern can be trivially converted to a copy_table opcode and a jump (to skip the remaining bytes in the original code).

Code

Form when copying a static address:

?????:  0d ?? ??                STORE           ???,#??
?????:  42 ?? ?? 53             JL              ???,#?? [FALSE] ?????
?????:  d0 2f ?? ?? ?? 00       LOADB           #????,??? -> -(SP)
?????:  e2 2b ?? ?? ?? 00       STOREB          #????,???,(SP)+
?????:  95 ??                   INC             ???
?????:  8c ff ed                JUMP            ?????

Form when the table address is in a variable:

?????:  0d ?? 00                STORE           ???,#00
?????:  42 ?? ?? 51             JL              ???,#?? [FALSE] ?????
?????:  70 ?? ?? 00             LOADB           ???,??? -> -(SP)
?????:  e2 2b ?? ?? ?? 00       STOREB          #????,???,(SP)+
?????:  95 ??                   INC             ???
?????:  8c ff ef                JUMP            ?????

The Mulldoon Legacy class search

This one is very CISC, but scratches a personal itch.

I really like The Mulldoon Legacy (as do a lot of other people, judging by its regular placement in “top n of all time” lists), but it runs too slowly to be playable on 8-bit hardware. It turns out this is because it spends 50% of its time evaluating OC__Cl and another 24% evaluating the following function:

Code
Routine 22030, 2 locals

22031:  0d 02 01                STORE           L01,#01
22034:  e0 27 5d ae 02 01 00    CALL_VS         2ed70 (L01,#01) -> -(SP)
2203b:  a0 00 d4                JZ              (SP)+ [TRUE] 22050
2203e:  c1 97 02 02 01 ce       JE              L01,#02,#01 [TRUE] 22050
22044:  e0 2b 5d ae 01 02 00    CALL_VS         2ed70 (L00,L01) -> -(SP)
2204b:  a0 00 c4                JZ              (SP)+ [TRUE] 22050
2204e:  ab 02                   RET             L01
22050:  95 02                   INC             L01
22052:  c3 8f 02 03 3c c5       JG              L01,#033c [TRUE] 2205b
22058:  8c ff db                JUMP            22034
2205b:  b0                      RTRUE           

This is effectively a way of retrieving the first class of an object by the completely bonkers method of iterating over all objects in the game, checking if that object is a class, and then (if so) checking if the passed object belongs to that class:

Code
get_first_class(x) {
    for (i = 1; i <= #total_obj-255; ++i) {
        if (!(i ofclass 1)) continue;
        if (i == 1 || i == 2) continue;
        if (x ofclass i)
            return i;
    }
    return true;
}

I am not sure exactly what the intent was in doing it this way, especially since no object in The Mulldoon Legacy has more than two slots available in its class list. (Possibly a rudimentary form of class precedence?) But in any case, accelerating this function can be done with only a few dozen bytes and single-handedly makes The Mulldoon Legacy playable on Zym. I have not found any other game that uses this exact method.

Other potential targets

NounWord and NextWord are logically trivial but surprisingly heavy, accounting for a few percent in some games.

MoveFloatingObjects (from the library file verblib.h) accounts for 1-2% of opcodes executed in most games, but is very heavy in a few (e.g., 20% in Curses). The logic is too complex to fully inline in a space-constrained interpreter, but accelerating the main objectloop might be helpful.

Cache optimization

On SymbOS, copying from banked memory is generally cheaper than repeated byte-by-byte access, so Zym II stores the first 48 KB of the story file in directly accessible RAM and uses a write-through LRU cache to access parts of the story file above this waterline. I ran a lot of simulations in bocfel to determine the best cache strategy (what happens if I allocate more of the address space to cache and less to direct access? what’s the best aging strategy? what happens if direct-access doesn’t start at address 0? what happens if loadb/storeb bypass the cache? can I detect and prevent thrashing from large sequential reads? etc.). In the end I mostly arrived where I started:

  • Direct-access the lowest 48 KB
  • 4 KB of write-through LRU cache
  • 256-byte cache pages, page-aligned
  • Start page ages at 1 when first loaded
  • Reset a page’s age to 0 on cache hit
  • Age all pages by 1 on cache miss

The trick of starting page ages at 1 rather than 0 reduced cache misses by about 2% in the average game.

12 Likes

Fantastic work!

This is a mild tangent, but:

Dialog’s veneer is much heavier than Inform’s, and I imagine a similarly huge percentage of the time is spent in veneer routines. I’ve mostly assumed Dialog Z5 games will always be too big and slow for retro platforms, so the 6502 Å-machine terp should be the focus for retro compatibility. But if you think Dialog Z-code could feasibly work on 8-bit platforms, and have a way of profiling which routines the game spends most of its time in, I’d love to find out which parts of the veneer are the bottleneck in terms of performance.

My (very janky) toolchain consisted of (1) disassembling the source with txd, (2) running the game in a patched version of bocfel that records a heatmap of all addresses read as the first byte of an opcode, and (3) merging the heatmap with the source to identify hotspots. The main difficulty is that txd tends to choke on Dialog games, but if I had a complete disassembly with memory addresses profiling would be straightforward.

1 Like

I know unz specifically works with Dialog games, and its default output is in txd format. I’ll hook you up with a disassembly of Miss Gosling (my most “standard” Dialog game), the Z-code it was generated from, and an unz decompilation of that file.

Unfortunately, Dialog does not record the addresses it assembles each routine at, so going from decompilation back to veneer routine will be moderately annoying. But since the veneer is written in raw Z-machine assembly, it shouldn’t be too awful.

2 Likes

Funny. This reminds me inversely of something a contemporary audio programmer (Chris at Airwindows) came up with for audio producers.

Every time you change the gain of audio in a digital audio workstation, you’re putting it through more floating point operations. Speed is of no consequence. But audio is all math operations, zapping the waveform with hundreds, thousands, millions of floating point transformations over the course of a project.

He released a plugin called BitShiftGain. It lets you turn a signal up or down without subjecting it to floating point… but only in increments of 6db. It’s 6db because that’s a doubling or halving of the volume, which the plugin achieves by just shifting the bits of the sample one way or the other.

-Wade

2 Likes

I know the latest versions of the Inform compiler simplify certain arithmetic operations with constants (e.g. + 1 gets turned into a @inc instruction). Does that include turning / 2 into a shift instruction? If not, that should be a relatively straightforward change (though it wouldn’t help older games that have already been compiled).

3 Likes

Thank you for the exhaustive writeup! The overhead of Inform veneer routines has also been annoying to me, and I took a much different approach - write a new language that targets v3 specifically.

I realize this means that my target audience is now me, and only me.

I’ve been writing every bit of my library code with performance in mind. For example, I don’t need to compare unsigned addresses. Any return value between 2 and $num_objects is considered to be an object reference; anything else is a routine address. If you wanted a string, that’s just a routine address of a single print_ret instruction (so two bytes of overhead).

(On the interpreter side, I’ve heavily optimized my inner loops to limit instruction overhead, but as everybody reading this thread knows, that only gets you so far)

Noting that many divisions are by two is a huge but simple optimization, thank you for pointing it out. My hope is to avoid those in my own code, or at least hoist the computation out of the for loop manually.

I agree that having a separate program scan the story is a bit ugly, but totally understandable, and as you already mentioned, you generally need a bespoke process to generate a playable disk image anyway.

And honestly, there is a lot of great stuff out there that doesn’t run particularly well on old platforms, so I think what you’re doing is really cool, thank you for sharing it!

-Dave

2 Likes

Don’t forget to add padding before the first routine address if it threatens to collide with $num_objects. This is guaranteed to happen in at least one test case. :)

1 Like

Objects are at least nine bytes each (not counting properties), and will always appear before any routines?

I guess if you had a V8 story file that contained a single object with no description and massively stripped down globals?

-Dave

I’m just saying, you’ll spend less time checking for that case now than you will spend debugging it when it happens unexpectedly.

3 Likes

I don’t think thats been done, yet.

1 Like

Does that include turning / 2 into a shift instruction?

That’s not valid unless the argument is known to be non-negative. Not sure the compiler can be smart enough about loop structure to verify that.

Optimizing the veneer routine by hand might be a prudent choice.

1 Like

Can’t that be handled by using an arithmetic shift rather than a logical one?

No, because integer division truncates towards zero (or at least, most of the Z-machine unit tests assume it does). -32095 / 2 = -16047, 0x82A1 >> 1 = -16048.

1 Like

a dumb question from the peanut gallery:

i’ve heard the term ‘veneer routine’ used here and in lots of other places. what is a ‘veneer routine’?

i guess i’ve always assumed it was the routines the interpreter is running as it handles the z-code opcodes, but maybe that’s wrong or naive.

1 Like

Veneer routines are functions that the I6 compiler inserts to handle common operations which are part of the Inform 6 language, but not built into the Z-machine. For example, veneer routines are needed to handle I6 classes and individual properties.

They’re normal Z-code routines as far as the interpreter is concerned, but they’re compiled from Inform source supplied by the compiler – not by the library or game. (The library can replace a veneer routine, but this is rarely done.)

The Glulx VM has fewer operations built in, so it needs more veneer routines. (E.g. “parent of this object” is a Z-machine opcode, but in Glulx it’s a veneer routine.)

3 Likes

The more usual term in compiler design is “runtime”, which is what Dialog calls it, but I tend to use “veneer” on this forum because people are used to I6. Dialog frequently needs operations like “can these two lists be unified?” that the Z-machine doesn’t natively supply, so there’s a routine for that written in Z-machine assembly embedded in the Dialog compiler, and it gets inserted into the story file if needed.

Since the implementation of this routine should be the same for every single program compiled by Dialog, it’s a good opportunity for a specialized interpreter to speed things up. In Dialog, “can these two lists be unified?” (R_WOULD_UNIFY) accounts for over 15% of the Z-machine cycles when playing a typical game(!), so an implementation in interpreter code could speed up execution quite a lot. Glulx actually has this built into the spec on a basic level (“accelfuncs”).

4 Likes

It would be nice if more/all of the I6 veneer code resulted in library calls, because those are a LOT easier to patch even on a tiny eight bit interpreter.

For example, it’s pretty simple to scan high memory for the “bread crumbs” of Unsigned__Compare. Since branches and jumps in Z-machine are always relative, I wouldn’t be surprised if a simple binary compare would be sufficient.

Well… except that if it’s a larger story file that you can’t fit fully into memory, then you’d have to identify it every time you loaded the routine back in, which would get expensive.

So we’re back to… it only makes sense to do this when creating the disk image.

Maybe it’s worth… standardizing this somehow? There are already defined extension opcodes, and there could be a standard tool that would “rewrite” a story file suitable for any “retro accelerated compatible” interpreter. That way, the hard work of identifying the hot spots could be centralized and gradually improved.

But for example in the case of Muldoon’s Legacy… it might be simpler to ask the author to fix their code from nearly 30 years ago instead.

Another crazy-ass idea if we’re allowed to rewrite story files during mastering is… transcode the entire Z-machine story file into (say) 6502/Z80 assembly. But it would end up looking a lot like FORTH threaded code, and you’d make your VM problems even worse because even 6502 code is heavier than Z-code, so that’s a terrible idea.

Separately, if objectloop doesn’t use @inc_chk internally it really should, that would compress three instructions down to one.

-Dave

This has been suggested at Code generation optimizations · Issue #87 · DavidKinder/Inform6 · GitHub but nobody has tackled it.

1 Like

I expect the main problem with standardizing I6 accelerations is that not all interpreters are likely to handle all possible accelerations (especially if the standard continues to add more). Glulx handles this by allowing the story file to voluntarily request specific accelerations, but that solution isn’t applicable here.

One possibility might be to standardize the function-call accelerations around a single opcode that uses a lookup table of addresses pre-attached to the story file; so if the terp does not accelerate a specific function, it could fall back on the Z-code implementation at the original address.

But this would be slower than bespoke native opcodes, and standardizing an alternate “accelerated” story file format wouldn’t be all that much more convenient than running a terp’s own acceleration tool. As it stands, running the tool is just another step in the (usually already belabored) process of installing a game on an 8-bit terp.

1 Like