Glulx to Webassembly

Hanna · September 3, 2024, 5:31pm

This may be a bit different for Z-code because it has more higher-level opcodes, but when I profile Glulxe, Git, or my own Glulx interpreter, I see very little that’s both actionable and specific to one opcode:

If the interpreter decodes every Glulx bytecode format from scratch every time an instruction is executed (Glulxe and mine), then that takes 40-60% of the total time. Any improvement or regression here will affect most opcodes equally. (Git shows you can do much better, but that requires a deep rewrite of the whole interpreter loop.)
Git has an interesting failure mode for Counterfeit Monkey (and presumably other games of similar size). The hard-coded cache size for translating bytecode to its internal format is apparently too small, so it ends up spending ca. 40% of the time translating bytecode again and again (though it’s still much faster than the other interpreters). Increasing the cache size from 256 KiB to 8 MiB drops this to 0.4% and makes the whole thing go 1.7x faster! (It still spends more time than I’d like looking up compiled code in its cache.) Again, great to know but not related to any one opcode or mix of opcodes.
The next most expensive thing in Glulxe running CM is saveundo. I’ll admit, this is a straightforward case of “just make this one opcode faster”. But here the cost depends entirely on the game’s RAM size (and to a lesser extent, on how compressible it is).
The other thing that sticks out in profiles of all three interpreters are the accelerated functions, especially those related to property access, and especially the binary search of the property table. Seems like a nice, well-isolated optimization problem, and Git has a leg up on other interpreters here. But Git still spends at least 15% of its time on these operations and I don’t see any way to do better without looking very deeply at patterns in how these instructions are used.
After that, the profile starts getting pretty flat. At best you get a small single digit percentage spent on one thing, but even those are rarely opcode-specific, but rather interpreter overhead that gets smeared all over the place, such as Glulx stack manipulation, local variable accesses, reading and writing memory, and so on.

And then there’s the other source of interpretation overhead that’s smeared all over the place, the struggle of the CPU’s branch prediction to learn something about the program being interpreted and steer the CPU towards the right places in the interpreter loop in the right order. It’s difficult to assign blame for this with a CPU profiler and it’s also very hardware-specific, but it probably does matter for Git, where all of the “trivial” opcodes (integer math, branches, loading and storing operands) are implemented very efficiently so bytecode dispatch is a much bigger fraction of the CPU’s job.

Hanna · September 3, 2024, 6:18pm

I’ve also done some profiling at the Glulx bytecode level (pretty much only on CM). I don’t entirely trust the results yet, and I don’t know to what extent they would generalize, but here’s a profile of a full CM walkthrough cm-release-speedscope.json (136.8 KB) that can be visualized online at https://www.speedscope.app/

One thing that stands out to me (in the “sandwich” view) is the 31% attributed to GetEitherOrProperty and the functions it calls. It all seems like a very thin layer on top of the already-accelerated property accesses, but those end up taking only a small fraction of the time. The remaining time is not spent in any heavy-duty computations, but in a soup of trivial checks repeated a million times.

I guess that means that just mapping individual Glulx functions and opcodes straight to wasm will (1) cut out a lot of interpretation overhead, but (2) still leave a lot of performance on the table. There seem to be a lot of wasteful abstraction that a more sophisticated JIT compiler could potentially identify and cut out with inlining and a laundry list of other optimizations.

Dannii · September 4, 2024, 1:31am

I have thought about using Cranelift to make a true JIT for Glulx, but you know, other things are a bigger priority.

Even without that type of a JIT, adding a function safety system would allow for functions like GetEitherOrProperty to be accelerated too. By which I mean, allowing safe functions to be called directly skipping the Glulx stack. Such a system is actually ideal for JS, whereas it might be harder to get it working if part of the VM is WASM and part is JS.

Hanna · September 4, 2024, 8:43am

The way I see it, there’s two dimensions along which a Glulx JIT compiler can get more sophisticated.

A lot of what we’re talking about is stuff that an AOT compiler could also do (except that it would always have to include an interpreter as fallback to be safe): only optimizing code in ROM, inlining direct calls, cleaning up local inefficiencies that the Inform 6 compiler didn’t catch because it has other priorities, and proving when it’s safe to skip a Glulx stack frame entirely (I think that’s the heart of what you call accelerating functions). All of that works perfectly fine within the limitations of emitting wasm in the end.

Other kinds of cleverness seem to require that the JIT has more control over the final machine code, including the ability to remove or rewrite code after it has already executed once (essentially impossible in wasm). That includes inline caching, generating code incrementally at very fine granularity rather than in big batches (eg, lazy basic block versioning as done in YJIT for Ruby), and doing highly speculative optimizations with efficient bail-out when they are invalidated. I’m not sure if any of those are all that useful for something as low level as Glulx. And Cranelift specifically doesn’t support most of it either.

Besides, compiling to wasm has the overwhelming advantage of actually working on all platforms that people care about. In browsers and iOS apps, JIT-to-native simply isn’t an option.

I don’t think that sort of model makes much sense. The display layer will be JS, of course, and it may be more convenient to write the JIT compiler itself in JS rather than in a language compiled to wasm. But I don’t see any reason to have “runtime support functions” that would frequently call or be called from JIT’d code be written in JS. Even if raw wasm-JS calls are reasonably fast these days, unlike JS-JS calls they can’t be inlined, and wasm-wasm calls typically aren’t inlined either. So if someone were to compile Glulx to wasm, they’ll have to reconsider what should be in a separate function to begin with. For example, memory accesses should definitely be inline code (not helper functions as they are in Quixe). A few things are both performance sensitive and can’t reasonably be inlined into JIT’d code, but none should be so complex that they can’t be in wasm rather than JS.

tsanderdev · September 4, 2024, 10:02am

That’s actually a very good point: Android (and I’m sure iOS, too) has policies that forbid downloading and running unsandboxed code, so for an IF-Runtime on those systems, it’s either using a battletested sandboxed solution like WASM or only interpret everything (or just use HTML output and a WebView).

Another idea I had: Maybe the other way around would be better: Have WASM as the main target for modern development systems, and provide a compiler from WASM to Glulx or even Z machine as a secondary target. Again I’m not familiar with Glulx or Z machine instructions, but the problems posed earlier shouldn’t apply in the other direction, right? The main problem would probably be optimizing the output, since the main appeal of the Z machine output would be running on retro hardware with tighter memory and CPU contraints.

Hanna · September 4, 2024, 10:43am

Wasm (the MVP version at least) is reasonably easy to compile to any target that’s vaguely like a CPU with byte-addressable memory, except possibly the floating point math. But yeah, it’s not gonna work (well) if the wasm module needs more RAM than the target machine has. And the detour through Glulx or Z-code is probably at least as difficult as just implementing wasm directly for the target machine, even if you don’t care about efficiency but even more so if you do.

A much more fundamental question is how to do I/O. Glulx has Glk, the Z-machine has I/O built-in. An IF systems that emits wasm has to pick or make up some way of doing I/O, and that will determine whether it’s practical to lower the wasm to Z-code, Glulx, or directly to some retro platform (imagine a Vorple-style “the I/O system is arbitrary HTML + JS”). It will also heavily affect what sort of user experience is possible. All of this is more important than how the arithmetic and memory accesses end up being executed.

Dannii · September 4, 2024, 11:11am

Fixing C mode to have working save/restore would probably be better than trying to directly emit WASM.

Hanna · September 4, 2024, 11:17am

(I believe Tarek’s idea was about future development systems, not about Inform.)

Dannii · September 4, 2024, 11:24am

Well sure, but any such plan proposing a universal WASM IF target would no doubt include Inform, and it would probably be one of the first dev systems to be adapted as most of the others aren’t well known or well maintained.

tsanderdev · September 5, 2024, 11:06am

I don’t know, apparently there are machines out there with a Z-code interpreter, but without a C compiler. At least I wouldn’t want to make a WASM interpreter for those ones, especially if a Z-code interpreter already exists. But compilation to Z-code or Glulx would probably be a secondary goal, since WASM already works on all modern platforms.

I’d probably adapt Glk into a WASM interface, or maybe with the Component Model in a few years when it’s more robust and widely supported. Because the function names are predetermined as imports, you can provide them via the WASM runtime/JS or insert the appropriate calls when lowering to Glulx or Z-code (though AFAIK that would only provide limited support for Glk features for Z-code).

I also think that’s the way to go, have save/restore functionality in the code directly instead of trying to automatically provide it by copying the WASM stack and memory. Though if that relies on setjmp/longjmp, that would still need asyncify, but IIRC that should be easy to set up for C with Emscripten.

Dannii · September 5, 2024, 12:20pm

For now anything based on Glk would need Asyncify as the Glk API uses blocking functions. Unless maybe you do it with threads instead, which is arguably just as hacky as Asyncify. There will probably eventually be better options, but Asyncify isn’t a problem either.

jkj_yuio · September 5, 2024, 12:48pm

Just to add; I wrestled with these sorts of problems. WASM threads require changes to the hosting server. I didn’t want this restriction. I was in a pickle until i discovered EMSC supports coop fibers. Fibers require asyncify though. But it does all work. That’s what i did to separate the illusion of the game and the UI running at the same time.

tsanderdev · September 5, 2024, 1:38pm

Threads would be another option for that, I don’t think they’re hacky, and they’re supported on the web and in wasmtime, but require isolation headers from the server for web, so I’d like to avoid that. I have managed to make a workaround for that though: You can register a service worker that adds the appropriate headers, so that you don’t need server support for that. The final output would then be 3 files: A simple HTML container with setup JS, the service worker, and the wasm file. To not impact the main page with the isolation restrictions, you’d use an iframe for the game.

I think that’s what I just described and provided a possible solution for.

Asyncify would probably be a good option, at least initially until a threads solution is there, since asyncify also creates some overhead.

Hanna · September 5, 2024, 3:37pm

The most efficient way would be if wasm engines had native support for wasm code do apparently-synchronous calls to async JavaScript (there’s already experimental support for this in some browsers, but it’s not yet standardized). Otherwise, the wasm code needs to be written in a style where it can suspend itself instead of doing a blocking call, and resume execution when the call is done. That’s very similar to what save/restore needs! Async calls don’t need the ability to resume one year later on an entirely different machine, but both need to reify all of the program’s current state, including what code to run next, into a data structure.

For an interpreter, both save/restore and suspend/resume around a Glk call are straightforward because the interpreter already stores 99% of the relevant state in data structures. My Glulx interpreter handles all Glk calls this way, except the basic ones needed for the @stream[...] opcodes. Every @glk instruction suspends the VM and yields control to the host, who performs the call and passes the result back into the VM when it’s done. In the wasm + JS setting, this allows the Glk call to be asynchronous Javascript (callsbacks and/or promises). I have this working with glkote-term in node.js, but I haven’t wired it up to asyncglk yet.

When Inform is compiled to C, or Glulx is compiled to wasm, the same idea applies at a larger scale. You could throw asyncify at it as the last step, but it’s also possible to directly generate code that supports suspending itself and resuming later. If you solve this for save/restore, the same mechanism should apply to Glk calls, even if there’s more of them. I don’t know if that will be more efficient than other approaches, but at least for a JIT compiler it would avoid the need to run Binaryen in every end user’s browser.

tsanderdev · September 5, 2024, 4:13pm

I just checked, wasmtime actually supports async host calls, and with the JS Promise integration feature (which should work on basically all browsers nowadays), async host calls are also possible on the web. That’s why I’d leave the undo functionality to the generated code. I mean when using wasm as the main target, when JIT-compiling Glulx to wasm, appropriate instrumentation would have to be inserted for save/load operations, which would probably amount to the same thing as asyncify does, but slower, since you don’t run the equivalent of wasm-opt afterwards. I’d probably actually include binaryen, since it also has a JS API. I think going from high-level systems like Inform to WASM directly would be better (or with a detour through C).

dfranke · September 27, 2024, 2:35pm

Funny that this thread came up right while I was hard at work sending things in the opposite direction developing Wasm2Glulx. I think what’s being proposed here, translating Glulx into Wasm, is a much harder problem for reasons that others have already been reaching toward. WebAssembly mandates structured control flow so that the validator can statically prove control flow integrity; this is what makes it possible to JIT compile it while maintaining a secure and low-overhead sandbox. That Glulx allows jumping from one function to another isn’t a problem. As long as you can identify all the call/jump instructions and their destinations, then you can collect that into a control flow graph and then generate properly structured WASM from that. The problem is, you can’t always identify jump destinations, because Glulx permits indirect and computed branches. That is, the argument to a call or jump instruction doesn’t have to be an immediate; it can be the result of an arbitrary computation. This makes control flow graphs uncomputable.

tsanderdev · September 27, 2024, 2:58pm

You could make a JIT, but each time an indirect jump is encountered with a previously unused destination, you’d have to JIT compile the relevant section.

I’m working on the other direction: The next IF thing I’ll probably make is a standard draft for a glk-like WASM interface, with reference implementations for web and wasmtime.

dfranke · September 27, 2024, 3:08pm

That would be terrific; I was already thinking about the same thing and would like to collaborate on that so Bedquilt can support it as a target. A future in which IF interpreters support WebAssembly natively, and Wasm2Glulx is no longer required, would be really nice.

The API imported by the WASM module should look as much as possible like GlkOte.