What can and should we expect from the Glulx heap?

Continuing the discussion from If we malloc a memory range on glulx, can we dynamically check its extent?:

I understand where this is coming from: if you have a lot of objects of the same size, and can efficiently organize your available memory around these size classes, you need less metadata, less complexity, less computation per (de-)allocation. In contrast, Glulx interpreters have to be reasonably “general purpose”. They can’t even rely on common simplifications made by other general-purpose allocators (e.g., size class rounding or alignment rules) because they have to work with existing heap layouts from saved games created by other interpreters that didn’t necessarily apply the same rules.

However, that doesn’t automatically make it a good idea to layer another allocator on top of it from Glulx code. Besides more complexity and more code that has to be interpreted, there’s a risk of greatly amplifying memory fragmentation. And it’s not a given that this buys you anything. In the world of boring, non-IF software, custom allocators were popular and useful in the dark ages where general-purpose allocators often plainly sucked. However, they stopped making much sense, except possibly for very restricted (de-)allocation patterns, when general-purpose allocators got their act together. This started in the 90s with Doug Lea’s malloc and Wilson et al.'s important “survey and critical review”.

I know that all currently existing Glulx interpreters have very simplistic heap implementations, but some updates will probably be necessary anyway if IE-0008 is ever implemented in some form. I hope that they can become good enough at managing many small heap blocks that layering more allocators over them won’t be worth the trouble. For example, I have a sketch for a design that uses as little as 4 bytes of metadata per allocated block while supporting (de-)allocation in O(log(number of blocks)) time with, hopefully, quite decent constant factors. The only drawback is that the space efficiency hinges on carefully implementing a somewhat fiddly textbook data structure (some sort of B-Tree), so I’ve put off working on that.

It’s a bit of a chicken-and-egg problem: as long as Glulx programs don’t stress malloc/mfree at all, it’s hard to justify putting effort into it, and difficult to get a realistic benchmark for guiding the effort. But as long as malloc/mfree implementations remain as they are, it would be ill-advised to use them heavily. So I look forward to seeing what mad science @Zed will do.

Have people measured/observed an actual issue?

Even a very naive implementation would still be a lot faster than what you could do in Glulx code. Except perhaps an arena allocator where all you do is check/set a bit.

Glulx implementations could improve their algorithms while still loading old save files, as they only say where the allocations are, not how the backend is designed. Non aligned allocations could be rounded for example.

And because the smallest practical allocation is 32 bytes, I suspect that if Data Structures became commonly used that having most allocations be the same length might mean a naive implementation would still be okay.

1 Like

I’m not sure what exactly you’re looking for. Part of why I opened this topic is that I’m worried about custom allocator shenanigans becoming entrenched and remaining even if/when they are (no longer) solving an actual issue. However, all signs point to them being useful today, if your heap is going to contain more than a few thousand objects.

There’s been a couple threads over the years where the Flex layer was (partially) blamed for slow performance. At one point, @eu1 did some template hackery to try and estimate the effect of using @malloc/@mfree instead. Results were… mixed, though there’s several confounding factors (e.g., an optimization in BlkSize that is entirely unrelated to using @malloc) and the benchmark is very artificial. One interesting observation is that the terps (then and now, the relevant code hasn’t changed in forever) slow down considerably as the heap fills up with more blocks, though the Flex layer does even worse.

This is easy to reproduce, if you aren’t opposed to artificial micro-benchmarks. A trivial I6 program that just allocates N = 10k blocks in a loop and then exits takes ca. 100 ms to run on my machine with the Git and Glulxe builds I have lying around, ca. 400ms for N = 20k, and ca. 900ms for N = 30k. I swear I didn’t fudge these numbers, it’s just a textbook case of doing N² work (going through n extant blocks before finding a free block to use for the the nth malloc).

At least the simple linear-list-of-blocks approach used by current terps can slow down to the point of being completely unusable (see above). Beating that with a pure-Glulx allocator isn’t hard, even if it’s a very conventional general-purpose allocator – it’s just Flex specifically that’s even worse. Using intrusive linked lists and inline object headers, the common path for (de-)allocation can be less than ten Glulx instructions, which is probably pretty competitive even with a smarter terp.

Better algorithms are definitely possible, but they’re constrained or complicated by saved game states. To keep with the alignment example: if you start with an empty heap and do @malloc 1; @malloc 4; then all currently existing terps will put first allocation at the heap start and the second one right after, and they have to remain there until they’re freed. So whatever a smater terp does, it can’t simply assume that all block sizes or block addresses have any particular alignment. It may ensure that for its own allocations but it has to be able to deal with exceptions somehow. I’ve come to the conclusion that it’s better to just choose implementation strategies that don’t need such assumptions in the first place.

It’s true that many programs mostly allocate objects of a few distinct sizes. This, together with other trends in program behavior, make it feasible to have fast and good (with respect to fragmentation) memory allocators in the first place. The question is what you mean by “naive implementation”. Currently, all interpreters implement the address-ordered first fit policy, which actually seems pretty great at limiting fragmentation (see The Memory Fragmentation Problem: Solved? by Johnstone and Wilson. The specific data structure they use, however, scales very badly as the heaps get larger. On the other hand, going all-in on the assumption that most objects have the same fixed size may lead you to completely separate free lists per size class or object type. This also gives a simple implementation, and it’s easy to make (de-)allocation very fast, but the simple ways of doing it can easily lead to enormous fragmentation (again see Johnstone and Wilson).

And after so many words about what doesn’t work well, I owe you some positivity. A relatively simple way to fix the worse-case performance and maintain low fragmentation is to change the policy to some kind of best fit. That means allocating n bytes picks a free block of exactly n bytes if one exists, or otherwise as close as possible (to minimize the remainder that’s split off and kept free). How to choose between different blocks of the same size doesn’t seem to make much difference, there’s at least three simple variants that did well in past studies.

The important part is that “find a free block of exactly n bytes, or as close as possible” can be implemented in a scalable way by keeping one list of free blocks per size (e.g., all 10-byte free blocks, and only those, are stored together in one list). Then you just need scalable ways to:

  • Find the relevant free list for a requested size (for @malloc).
  • Find the size of a block from its start address (for @mfree).
  • Find out of a block that was just freed is adjacent to another other free block, and if so, merge them (putting them into the free list for their combined size, so they’ll be available for large @mallocs)

It’s not a novel observation that all of these can be achieved with some sorted map data structures (e.g., a binary search tree or trie). The first point calls for a sorted map from block size to free list for that size (storing only the sizes for which there’s any free block). The second and third point need a map from a block’s address to some metadata (its size, whether it’s free and used, and maybe something that helps remove it from the free list it currently resides in). The only problem for Git, Glulxe and Quixe is that neither C nor JavaScript have such a data structure built-in, so those terps would have to make one for themselves or pull in a library for it.

Things only get really complicated when you start caring a lot about (1) how much extra memory is needed for all that metadata, and (2) how many nanoseconds it typically takes to allocate or free a small block. But that shouldn’t stand in the way of fixing the O(N²) behavior of what terps currently do.

oh, I doubt I’ll pursue anything to the point of actual usefulness. I’m just playing around. I think there would very likely be a big speed advantage to using character array strings instead of texts, a big enough difference to make practical text manipulation that would be too slow today. But that’s if you replaced the text code altogether: you’d likely lose the advantage if you were translating texts back and forth to character array strings all the time. (You’d likely also need to implement hashes to be able to do a lot of text-munging things you might want to do.)

I haven’t really tackled the Neptune docs yet, but it looks like the punctuation feature might make it possible to say, e.g.,

q is a string variable.
q is initially «Hello world».

I’ll have to try it out some time.

1 Like

The worst thing for performance is the multiple system. Dropping that and directly accessing the allocation rather than having to call a function (BlkValueRead) will make a huge difference.

I had just assumed that the VM’s malloc opcode would be faster than an I6 malloc implementation. If that’s not true, then we should definitely push for the terps to be updated with better implementations. If the savefile format means the VM is fundamentally limited then perhaps some hybrid system could be used where the VM allocates large arenas which I6 then subdivides.

Like with the random number system, I’m really not a fan of “interpreters don’t handle this efficiently at the moment, let’s abandon the interpreter implementation and handle it entirely in I6 code”. I understand it’s faster than changing every interpreter, but I think we should philosophically be able to trust that the interpreter will do what the specification says, in a way that works well on its platform.

2 Likes

The flip side is that you design a new interpreter feature, get it into all the interpreters, and then find that it doesn’t handle the real-world requirements and therefore nobody uses it.

I don’t think I noticed before exactly what these numbers are. Yes 0.9s is really bad. But 30k allocations is also absolutely massive. I doubt you’d even get to 1k concurrent allocations most of the time. (The standard Glulx Flex_Heap is 32kB, so could fit at most 1000 allocations. It can then use @malloc to get more space, but I assume it usually doesn’t have to.) 100ms cumulative for 10k allocations, which normally would be spread out over many turns, sounds reasonable enough to me. That’s why I asked if it’s actually been observed to be an issue; some of our concerns here could be premature optimisations. Of course that doesn’t mean we shouldn’t look for improvements.

Could you adjust your benchmark to also check how FlexAllocate compares?

We should also be able to profile something like Counterfeit Monkey to check how many times it calls FlexAllocate over a full playthrough. I used to remember how to do that…

1 Like

Well I found an old CM profile, and FlexAllocate was called 357K times, taking 7 seconds. And that was for a test that only called glk_select 21 times…

Now I’d assume that it doesn’t keep anywhere near 300K concurrent allocations, but the brief profile doesn’t include FlexFree. Most of the time it will set up a bunch of allocations before a function, then free them afterwards.

I’m still not convinced the VM side of it needs a big overhaul. I think most of the improvements can be made on the I6 side.

Probably what we should really do is just make a prototype of replacing the code in Flex.i6t with code that calls the Glulx opcodes. Then we can run some proper benchmarks.

1 Like

Interesting, that’s a lot more than the current version I profiled yesterday: 15845 FlexAllocate and 14569 FlexFree over a full test_me.txt playthrough (554 glk_select calls). Guess it’s another thing where CM benefited from a lot of optimization effort over the years. Some time ago I tried to build a version of CM from around the time when that profile was taken, but I couldn’t find compatible versions of some extensions and gave up eventually.

I started doing this with the 6M62 version of Flex.i6t a few months ago, but never got it anywhere close to a working state before I put it aside. It definitely needs to be done before trying any changes that are supposed to make things 2x or 10x faster or lower fragmentation or whatever. It’s super easy to create synthetic benchmarks for memory allocators, and virtually all of them are misleading at best – there’s no substitute for careful experiments with real workloads.

However, “accidentally quadratic” style issues are a little different. The annoying thing about those is, even if the constant factors are very favorable, they go from irrelevant to game-breaking very quickly once you hit a certain size. Sometimes that’s an entirely theoretical concern (e.g., there will never be more than a few hundred Glk functions or a few dozen Glulx accelfuncs), but that’s not the case here. 10k or 100k heap blocks (just a few MB of RAM!) isn’t an inconceivably huge number, even for a text-based game. Existing Inform games that one would take as benchmarks are unlikely to reach it, but that’s partially because genre conventions, standard rules, and conventional wisdom in the community all encourage a style that minimizes dynamic memory allocation in general. Ambitious projects that ignore these “guardrails” quickly run into various performance issues, and may give up or change course before they get to a point where heap management becomes their biggest problem.

That’s not to say that IE-0008 won’t be an improvement all on its own. It very well might be! We’ll have to see what happens when it’s implemented, but it’s certainly plausible that it’ll increase the limits of what one can reasonably do with Inform. I’m just saying it probably won’t increase those limits anywhere close to where they ought to be (regarding heap size).

2 Likes

If you’re still interested in doing this, I might be able to help.

3 Likes

That would be great! I’ll get back to you with specific questions once I figure out what the hell I was doing. I just found a checkout of a July 2016 commit (76061d3) that had a lot of extensions added, but no clue where they’re sourced from. After adding one more missing extension, I got far enough for a Linux build of ni to crash while “drawing inferences” – but the 6M62 release of the Windows IDE manages to build it! The result appears to run correctly (with some debugging output that I guess is intentional), but I have no idea if I’ve picked the right versions of these extensions, how I ended up with that commit, and whether it’s an interesting commit to profile.

1 Like