I am thinking of creating some z-code files designed for benchmarking interpreter speed. Perhaps a suite of individual files for testing specific types of opcodes: call instructions, catch/throw, object manipulation, etc. and maybe a more comprehensive test that combines many types of instructions into something more representative of an actual game. Ideally all should run to completion without any player input, to make it easier to benchmark.
I donāt have a stake in this since I donāt work with Z-code or Z-code interpreters, but from my experience with Glulx interpreters and lots of non-IF performance engineering I must advise against the approach of creating synthetic benchmarks from scratch.
I totally agree that itās important to have benchmarks and to automate them so they donāt have a human in the loop. But in the end, the scenario that matters is a real human playing a real game on a real machine. The more a benchmark resembles that, the more useful it is for measuring, understanding, and improving the things that actually matter. Empirically, itās a lot easier to achieve that by starting with a āreal workloadā and simplifying it until itās a good benchmark, than starting from a blank slate and adding code that only exists to be benchmarked. Even writing an artificial little game in Inform 6 or ZIL or something and compiling it to z-code will likely be more representative than focusing on individual opcodes.
Of course, there is a place for synthetic benchmarks. For example, if youāve identified a specific performance issue in an interpreter and want to zoom in on it, crafting a test case that will exercise only that one thing may be useful. But you have to know what youāre looking for first and remember to zoom back out afterwards. Asking questions you donāt already know the answer to is a good way to get sent on a wild goose chase.
Iād normally agree, but the issue with Z-code is that it spends 99.9999% of its time doing absolutely nothing while it waits for user input. Scripting input to a game is likely to just end up testing how fast you can read from a file and feed it to the game.
I see advantages both in benchmarking separate aspects, i.e. stack, memory reads/writes, object tree changes, and display updates, as well as game-like comprehensive tests. A comprehensive test can show you whether a problem exists, and specific tests can narrow it down. While the desired end result is playing games as smoothly and efficiently as possible, it can be useful to know exactly why/where any slowdowns are happening.
Have you actually measured that? If thatās true, why worry about interpreter performance at all?
Yes, with a human in the loop, most wall clock time is ideally spent waiting for the human to think of what to do next. Once theyāve picked something, thereās going to be a delay between pressing enter and getting the response, which you can approximate by scripting input. If this delay is large enough to be perceptible for the human, reducing it may be useful. On the other hand, if itās small enough to be drowned out by the mere act of reading a short line of input text, then the game already reacts instantaneously as far as any human can tell (or text I/O is so catastrophically slow that it needs to be improved first). This is wonderful news, but it also means thereās no reason to try and make it any faster.
Alternatively, there may be more performance work to be done in other contexts:
Perhaps the game you tried first is as fast as itāll get, but a more complex one still has some room to go.
Perhaps itās fast enough on your desktop computer, but not in the browser on a cheap phone used by millions of people around the world, or on various retro computers.
Perhaps further performance improvements would enable doing something more ambitious and computationally intense with Z-code than the average parser game.
Perhaps there is some other metric than raw interpreter throughput that matters, e.g., overall memory usage or additional human-perceptible latency in the display layer.
Each of these points suggests changes to the benchmark setup, but all of them still benefit from large, realistic benchmarks.
For Ozmoo, we have a benchmarking option, which builds a self-playing version of a game. When played, it prints the time elapsed at the end. It can currently handle two different games. Weāve used this a lot to figure out how valuable an optimization is, to decide if itās worth its cost in bytes.
Of course, Ozmoo runs on 8-bit computers, where itās a challenge to get anything running on a virtual machine to run at a reasonable speed. Your goals for optimization may look quite different for an interpreter running on more modern hardware.
If you can improve a 30ms turn response by 10ms, thatās not perceptible. But the same optimization will probably improve a 600ms turn response by something like 200ms, and that is perceptible. So you need to be able to measure it.
In order to find out how good your optimization is, you need very precise timing. So itās worth isolating the text-input stage from the code-execution stage ā even when the numbers are tiny. Especially when the numbers are tiny!
Thinking further: a very simple way to do this is to build an Inform 7 game with a line
Test me with āe / get all / w / s / s / s / unlock grate with key / d / turn on lampā
Informās TEST ME feature runs all the listed commands in a single game input. Itās a good way to run through a lot of VM execution while amortizing out the cost of the I/O system.
I donāt really disagree with anything you wrote. Itās absolutely worth sweating the milliseconds. But a millisecond is practically a whole century for any reasonably modern CPU! Even if every turn only takes one millisecond when scripted, then the vast majority of that time should still be spent on interpreter work and not on reading a line of input. If it isnāt, thatās also a performance issue worth addressing. Yes, file I/O from a physical disk can also be measured in milliseconds, but weāre talking about very small files here so even a benchmark that repeatedly reads the file āfrom diskā will get it from the OS page cache after the first run. So I struggle to imagine a scenario where you have to little actual work in the interpreter that you need to jump through hoops to isolate it from the cost of some tiny amount of I/O, but enough work that you can draw any meaningful conclusions from putting it under the microscope.
(This is not to say I am opposed to isolating one part of the workload when thereās a good reason to. I have done more than my fair share of counting CPU cycles consumed by some inner loop and I donāt intend to stop doing it. But in this specific context, Iāve not yet encountered the need.)
Players may think about input for minutes or longer, so yes.
What matters for player perception is what happens between inputs, between pressing enter and receiving the next prompt. That depends on how much code and the nature of the code that is being executed each turn, which varies wildly in real-world games. An interpreter that works well for most games only to have noticable slow-downs for more computationally intensive games is not ideal. Large amounts of object tree manipulation is a common culprit for slowdowns.
Perhaps scripting input isnāt enough to affect testing results, or maybe it is, but why bother if you can just avoid the issue entirely? Like you said: āā¦the scenario that matters is a real human playing a real game on a real machine.ā So optimizing an interpreterās ability to switch in and out of input mode and receive input quickly is less important than other things the interpreter does, as in the real-world input will be received orders of magnitude slower. Itās not like itās difficult to remove the need for input from a test file.
I just wanted to add that I do appreciate your input and I donāt mean to be argumentative. Itās just I do see some utility for some āsyntheticā benchmark files (duh, I probably wouldnāt have suggested the topic in the first place otherwise)
I like to think of isolated tests as akin to unit tests and game-like comprehensive ones more like integration tests. Thereās utility in both.
To be clear, I was not referring to 99.999% of time being spent waiting for input, but to the latter part: that scripting input would mostly test file I/O speed rather than the thing that matters.
Fully agreed. But to improve those slow games and slow turns, you have to find them and measure them. Otherwise, how can you know what to focus your effort on, and how can you confirm that youāve actually improved them? Of course, itās never a great idea to keep a glaring inefficiency untouched just because you donāt have a benchmark that implicates it, but few things are as unreliable as programmersā gut feeling about how to make a program go faster. Even if you know, for example, that itās about object tree manipulation: the size and shape of the tree, how exactly itās being manipulated, and in which order, can all matter enormously, as well as other work thatās interspersed with the object tree manipulation.
If weāre just talking about something like ātake the inputs from a buffer in memory, instead of reading them from a fileā ā sure, thatās easy, go ahead if it makes you less worried. I am only urging caution about wholly (or mostly) āartificialā benchmarks. Yes, theyāre an easy way to ensure that your benchmark spends as much time as you want executing whatever mix of opcodes you want. Thatās precisely why itās so hard to know whether they meaningfully reflect reality. In those cases where a human waits too long for the game to react, whatās that time being spent on? Even if you already know the answer precisely, any optimization aimed to address it should ideally also be validated against the real game, not (just) against a minimal test case.
One thing that can cause notable slow downs even on modern computers is heavy regular expression usage. Thatās perhaps one reason why smarter parser type extensions havenāt been more widely used. So thatās one potential source of realistic benchmarks.
The good news is that the I7 implementation of texts and regexes can be significantly improved, but that doesnāt matter for using them in benchmarks now.
I was thinking more along the lines of not having read opcodes in the test file at all (unless the thing you want to test is in fact the read opcode.
You are definitely correct that eventually you need to test against actual games to verify, but having a quick and automated suite of test files that could point to potential problems seems worth it to me.
As I said in my first post: Iām super on board with building a benchmark suite and automating it. Of course, it should also be fast enough that you can get quick feedback from it. I just donāt think thatās at odds with using real games as a major or perhaps only component. For the Glulx interpreter Iām working on, my regular performance sanity check is timing a full walk through (ca. 500 turns) of Counterfeit Monkey, which takes around 12 or 13 seconds per run these days. I should probably add more diversity (CM is so damn convenient though), make each run take more like a second, and automate the whole thing ā but itās already been tremendously helpful as-is.
I use complete game run-throughs as part of the integration tests for my interpreter. I have a whole set of games together with scripted input and expected output. I would probably notice slow-downs there first, but having benchmarks broken down by opcodes or other ways would allow me to find performance regressions quickly.