Benchmarking Z-code

Mike_G · September 1, 2024, 3:06pm

I am thinking of creating some z-code files designed for benchmarking interpreter speed. Perhaps a suite of individual files for testing specific types of opcodes: call instructions, catch/throw, object manipulation, etc. and maybe a more comprehensive test that combines many types of instructions into something more representative of an actual game. Ideally all should run to completion without any player input, to make it easier to benchmark.

Would anyone be interested in this?

fredrik · September 1, 2024, 3:55pm

I’m certainly interested.

I have made a sorting benchmark which I and a few others have been using. I’d be happy to share it with you if you want.

Hanna · September 1, 2024, 5:12pm

I don’t have a stake in this since I don’t work with Z-code or Z-code interpreters, but from my experience with Glulx interpreters and lots of non-IF performance engineering I must advise against the approach of creating synthetic benchmarks from scratch.

I totally agree that it’s important to have benchmarks and to automate them so they don’t have a human in the loop. But in the end, the scenario that matters is a real human playing a real game on a real machine. The more a benchmark resembles that, the more useful it is for measuring, understanding, and improving the things that actually matter. Empirically, it’s a lot easier to achieve that by starting with a “real workload” and simplifying it until it’s a good benchmark, than starting from a blank slate and adding code that only exists to be benchmarked. Even writing an artificial little game in Inform 6 or ZIL or something and compiling it to z-code will likely be more representative than focusing on individual opcodes.

Of course, there is a place for synthetic benchmarks. For example, if you’ve identified a specific performance issue in an interpreter and want to zoom in on it, crafting a test case that will exercise only that one thing may be useful. But you have to know what you’re looking for first and remember to zoom back out afterwards. Asking questions you don’t already know the answer to is a good way to get sent on a wild goose chase.

Mike_G · September 1, 2024, 6:49pm

I’d normally agree, but the issue with Z-code is that it spends 99.9999% of its time doing absolutely nothing while it waits for user input. Scripting input to a game is likely to just end up testing how fast you can read from a file and feed it to the game.

I see advantages both in benchmarking separate aspects, i.e. stack, memory reads/writes, object tree changes, and display updates, as well as game-like comprehensive tests. A comprehensive test can show you whether a problem exists, and specific tests can narrow it down. While the desired end result is playing games as smoothly and efficiently as possible, it can be useful to know exactly why/where any slowdowns are happening.

Hanna · September 1, 2024, 7:21pm

Have you actually measured that? If that’s true, why worry about interpreter performance at all?

Yes, with a human in the loop, most wall clock time is ideally spent waiting for the human to think of what to do next. Once they’ve picked something, there’s going to be a delay between pressing enter and getting the response, which you can approximate by scripting input. If this delay is large enough to be perceptible for the human, reducing it may be useful. On the other hand, if it’s small enough to be drowned out by the mere act of reading a short line of input text, then the game already reacts instantaneously as far as any human can tell (or text I/O is so catastrophically slow that it needs to be improved first). This is wonderful news, but it also means there’s no reason to try and make it any faster.

Alternatively, there may be more performance work to be done in other contexts:

Perhaps the game you tried first is as fast as it’ll get, but a more complex one still has some room to go.
Perhaps it’s fast enough on your desktop computer, but not in the browser on a cheap phone used by millions of people around the world, or on various retro computers.
Perhaps further performance improvements would enable doing something more ambitious and computationally intense with Z-code than the average parser game.
Perhaps there is some other metric than raw interpreter throughput that matters, e.g., overall memory usage or additional human-perceptible latency in the display layer.

Each of these points suggests changes to the benchmark setup, but all of them still benefit from large, realistic benchmarks.

fredrik · September 1, 2024, 7:38pm

For Ozmoo, we have a benchmarking option, which builds a self-playing version of a game. When played, it prints the time elapsed at the end. It can currently handle two different games. We’ve used this a lot to figure out how valuable an optimization is, to decide if it’s worth its cost in bytes.

Of course, Ozmoo runs on 8-bit computers, where it’s a challenge to get anything running on a virtual machine to run at a reasonable speed. Your goals for optimization may look quite different for an interpreter running on more modern hardware.

zarf · September 1, 2024, 7:46pm

If you can improve a 30ms turn response by 10ms, that’s not perceptible. But the same optimization will probably improve a 600ms turn response by something like 200ms, and that is perceptible. So you need to be able to measure it.

In order to find out how good your optimization is, you need very precise timing. So it’s worth isolating the text-input stage from the code-execution stage – even when the numbers are tiny. Especially when the numbers are tiny!

zarf · September 1, 2024, 8:21pm

Thinking further: a very simple way to do this is to build an Inform 7 game with a line

Test me with “e / get all / w / s / s / s / unlock grate with key / d / turn on lamp”

Inform’s TEST ME feature runs all the listed commands in a single game input. It’s a good way to run through a lot of VM execution while amortizing out the cost of the I/O system.

Hanna · September 1, 2024, 8:25pm

I don’t really disagree with anything you wrote. It’s absolutely worth sweating the milliseconds. But a millisecond is practically a whole century for any reasonably modern CPU! Even if every turn only takes one millisecond when scripted, then the vast majority of that time should still be spent on interpreter work and not on reading a line of input. If it isn’t, that’s also a performance issue worth addressing. Yes, file I/O from a physical disk can also be measured in milliseconds, but we’re talking about very small files here so even a benchmark that repeatedly reads the file “from disk” will get it from the OS page cache after the first run. So I struggle to imagine a scenario where you have to little actual work in the interpreter that you need to jump through hoops to isolate it from the cost of some tiny amount of I/O, but enough work that you can draw any meaningful conclusions from putting it under the microscope.

(This is not to say I am opposed to isolating one part of the workload when there’s a good reason to. I have done more than my fair share of counting CPU cycles consumed by some inner loop and I don’t intend to stop doing it. But in this specific context, I’ve not yet encountered the need.)

Mike_G · September 1, 2024, 8:42pm

Players may think about input for minutes or longer, so yes.

What matters for player perception is what happens between inputs, between pressing enter and receiving the next prompt. That depends on how much code and the nature of the code that is being executed each turn, which varies wildly in real-world games. An interpreter that works well for most games only to have noticable slow-downs for more computationally intensive games is not ideal. Large amounts of object tree manipulation is a common culprit for slowdowns.

Perhaps scripting input isn’t enough to affect testing results, or maybe it is, but why bother if you can just avoid the issue entirely? Like you said: “…the scenario that matters is a real human playing a real game on a real machine.” So optimizing an interpreter’s ability to switch in and out of input mode and receive input quickly is less important than other things the interpreter does, as in the real-world input will be received orders of magnitude slower. It’s not like it’s difficult to remove the need for input from a test file.

Mike_G · September 1, 2024, 9:07pm

I just wanted to add that I do appreciate your input and I don’t mean to be argumentative. It’s just I do see some utility for some ‘synthetic’ benchmark files (duh, I probably wouldn’t have suggested the topic in the first place otherwise)

I like to think of isolated tests as akin to unit tests and game-like comprehensive ones more like integration tests. There’s utility in both.

Hanna · September 1, 2024, 9:11pm

To be clear, I was not referring to 99.999% of time being spent waiting for input, but to the latter part: that scripting input would mostly test file I/O speed rather than the thing that matters.

Fully agreed. But to improve those slow games and slow turns, you have to find them and measure them. Otherwise, how can you know what to focus your effort on, and how can you confirm that you’ve actually improved them? Of course, it’s never a great idea to keep a glaring inefficiency untouched just because you don’t have a benchmark that implicates it, but few things are as unreliable as programmers’ gut feeling about how to make a program go faster. Even if you know, for example, that it’s about object tree manipulation: the size and shape of the tree, how exactly it’s being manipulated, and in which order, can all matter enormously, as well as other work that’s interspersed with the object tree manipulation.

If we’re just talking about something like “take the inputs from a buffer in memory, instead of reading them from a file” – sure, that’s easy, go ahead if it makes you less worried. I am only urging caution about wholly (or mostly) “artificial” benchmarks. Yes, they’re an easy way to ensure that your benchmark spends as much time as you want executing whatever mix of opcodes you want. That’s precisely why it’s so hard to know whether they meaningfully reflect reality. In those cases where a human waits too long for the game to react, what’s that time being spent on? Even if you already know the answer precisely, any optimization aimed to address it should ideally also be validated against the real game, not (just) against a minimal test case.

Dannii · September 1, 2024, 9:11pm

One thing that can cause notable slow downs even on modern computers is heavy regular expression usage. That’s perhaps one reason why smarter parser type extensions haven’t been more widely used. So that’s one potential source of realistic benchmarks.

The good news is that the I7 implementation of texts and regexes can be significantly improved, but that doesn’t matter for using them in benchmarks now.

Mike_G · September 1, 2024, 9:15pm

I was thinking more along the lines of not having read opcodes in the test file at all (unless the thing you want to test is in fact the read opcode.

You are definitely correct that eventually you need to test against actual games to verify, but having a quick and automated suite of test files that could point to potential problems seems worth it to me.

Hanna · September 1, 2024, 9:31pm

As I said in my first post: I’m super on board with building a benchmark suite and automating it. Of course, it should also be fast enough that you can get quick feedback from it. I just don’t think that’s at odds with using real games as a major or perhaps only component. For the Glulx interpreter I’m working on, my regular performance sanity check is timing a full walk through (ca. 500 turns) of Counterfeit Monkey, which takes around 12 or 13 seconds per run these days. I should probably add more diversity (CM is so damn convenient though), make each run take more like a second, and automate the whole thing – but it’s already been tremendously helpful as-is.

Mike_G · September 1, 2024, 9:37pm

I use complete game run-throughs as part of the integration tests for my interpreter. I have a whole set of games together with scripted input and expected output. I would probably notice slow-downs there first, but having benchmarks broken down by opcodes or other ways would allow me to find performance regressions quickly.