Edit: I moved this from the “Interpreters” group, rethinking my original idea of where this should go…
Given some operation (don’t pick on the syntax or even the specific op code below, its just the concept I’m questioning). Assume this is I6 and the result is using one of the more popular 'terps. Browser-based, if it makes a difference.
#ifdef CONDITION;
@bitxor a b -> c; ! line a
#ifnot;
c = (a | b) - (a & b); !line b
#endif;
I’d expect line a to outperform line b.
But what if line a was wrapped in a routine…
[myXOr a b c; @bitxor a b c; return c;]
Would the routine call overhead outweigh the performance gains from the inline version?
c = myXOr(val1, val2);
vs
c = (a | b) - (a & b);
Again, this code is NOT tested and probably has syntax errors, but the question is about the cost from the routine call overhead.
Well, in the best case, (a|b)-(a&b) can be done in three instructions (OR, AND, SUB), and so can the routine (CALLVS, XOR, RET). The routine call will take more work on the interpreter’s side, since it has to make a call frame and then destroy it again. So on retro hardware, I’d go with the inline arithmetic. Anywhere else, it doesn’t matter.
I am curious since the topic is adjacent… Once upon a time on old hardware (not IF interpreters), I remember working in assembly and consulting a table which listed how long each op code took (in some unit of measure I don’t recall) and making optimization decisions in my code based on that. So BNE (branch-if-not-equal) would take, say 20units, while SHL (shift-left) might only take 2.
Is there an equivalent table someplace for the Z/Glulx VMs? I know this is not quite the same as measuring CPU level performance, since each interpreter could translate opcodes differently, but if there’s a general estimate someplace, I’d be interested in getting my fingers on it.
Afraid not. The Z-machine doesn’t really have a notion of “cycles” the way most actual hardware does; what’s blazing fast on one machine might be very tedious on another. (For example, 16-bit multiplication will be very fast on any machine that supplies it natively, and very slow if you have to synthesize it from 8-bit operations.)
The closest thing we’ve got is counting the number of instructions, which is 3 for both versions here. Beyond that, you have to profile it on a specific interpreter.
When I was profiling Secret Letter + FyreVM, I came to the conclusion that number of instructions was essentially all that mattered. That was Glulx, but the same idea applies - in a straightforward interpreter, the actual work of the instruction is dwarfed by the overhead of fetching/decoding/dispatching it.
The difference between instructions might be more noticeable in an interpreter like ZLR or Git that does the decoding up front.
Decoding instructions and arguments takes a lot of time on 8-bit hardware, as does virtual memory handling, if the interpreter uses virtual memory.
Let’s take an example:
@add x y -> z;
(encoded in long form) on Ozmoo for C64, where x, y and z are all local variables, typically uses 347 clock cycles (~352 microseconds), 16 of which are spent performing the addition.
Here’s how the time is spent for this for this operation:
This is all for the cheapest scenario - if the instruction happens to straddle a page border, we’d spend time figuring out if the next Z-machine memory byte is in fact the next byte in RAM, or if we need to find a different page in virtual memory.
Of course, some instructions take a significant time to perform as well, like multiplication and division, as we basically need to perform them like humans perform them on paper, one digit at a time. (The C64 CPU lacks instructions for multiplication and division). Fetching or changing a property value, performing scan_table, copy_table and tokenize all take time.
Printing is a very heavy operation (about 600 cycles per character, plus time for scrolling the screen as needed), since we unpack the text from 5-bit codes to 8-bit ZSCII, send it to the streams module to figure out where it is to be sent, converting ZSCII to PETSCII (Commodore’s version of ASCII), sending the characters to the screen handling module, which converts PETSCII to screen codes and stores these codes in screen RAM, plus updates the colour RAM.
In my observations function calls are one of the most expensive operations for a VM. But it’s often much clearer to write the code with sensible function calls. Inlining can make a big performance difference, but you should only look at doing that in the hottest code.