Inform 7's (lack of) Speed

I’m interested, of course. I added the opcode calls partly for your benefit. Let me know how helpful they are, if any. I think for the typical turn in a typical game, the parser comprises most of the processing that a game needs to do, so speeding it up is fruitful for everyone.

I do feel that there’s a lot of missed opportunities for performance that Inform 7 & 6 miss and could be implemented with reasonable effort, but until it goes open source there ain’t much for us to do on this end. For example, wrapping #ifs around stuff like empty rulebooks.

Of course, since the procedural rules are going away, (and if Danii gets his way with a series of rules being compiled into the same function), I imagine the overhead on calling rulebooks, and maybe rules, will dramatically decrease.

I suspect a lot of gain would come from changing the parser to using regexes throughout, and then adding acceleration for regexes (which would be pretty trivial and a good idea even if the parser isn’t changed.)

I have grave doubts about that. The parser is designed around the idea that words can be compared in constant time. Going to per-letter regexes sounds like a nightmare, even if they’re running in native code.

(Using a regex algorithm based on words, rather than letters, would avoid that. At that point you’re talking about building a state machine for word sequences, which does resemble the problem that the parser solves now… but you’d have to demonstrate that you’re accelerating the slow part of the problem, rather than the already-fast part. For example, the deciding-concealed-possessions code we were just talking about – that’s a hook for user code. It would be exactly as much of a problem inside a regex parser.)

If this has produced a noticeable speed-up, it may be worth looking at re-writing all the memory block handling code in Flex.i6t. The I7 library has logic here to maintain its own memory heap, but if you’re running on a modern Glulx interpreter then you should be able to just rely on the interpreter’s heap support (via the @malloc and @mfree opcodes) to get the same effect with much greater efficiency. (I haven’t actually tried to do this as yet, but it is something I’ve considered.)

Actually the parser does surprisingly little with actual text. Most of the code I’m sifting through deals with arrays of objects: the match list, the multiple object list, scoring objects, grouping objects together for printing purposes, seeing how pronouns and objects relate, and about 15 different ways to filter one set of objects to a smaller set of objects. Then there’s a few oddball heterogeneous arrays like “pattern” which are sometimes objects and sometimes dictionary words.

The mapping of a mess of letters & spaces to a series of dictionary words happens in the VM already, not the parser.

Yeah I haven’t looked at the parser to really see what would help. But Aaron’s smarter parser sure does use a lot of regexs (and for things other than just detecting different types of errors), so I’d be surprised if some wouldn’t help. And I was thinking of regexes for whole phrases, definitely not letters.

It sounds to my unknowledgeable mind that too much state is being calculated as the parser is running. To borrow a metaphor from SQL, it sounds like more indexes or saved queries are needed! The parser can check if an object is in some list, that’s fine. Constructing that list is not.

One of the first things I noticed when I looked at Ron’s code is that allowing access to the input buffers could hold tremendous advantages for Smarter Parser - I expect you could reduce the dependence on indexed text.

I did try this sometime back, but then got distracted by other things. As I recall, the improvements were notable when the number of outstanding allocations remained small, but @malloc and @free (at least on the terp I was using, which would have been one of the terps shipped with the Mac IDE) slowed down considerably as allocations piled up. I will try to dig up the code again and have another look.

Indeed, my @malloc/@mfree implementations are not very smart, and I haven’t thought about improving them exactly because the I7 library doesn’t rely on them!

Anybody know much about heap management?

Sorry, but to a non-technical guy like me, “heap management” calls to mind the ancient vaudeville routine in which two guys (call them Sam and Irving) run into each other after a few years:

Sam: So, Irving, how ya doin’?
Irving: OK. I just got a new job, as a pilot.
Sam: A pilot? What airline do you work for?
Irving: I don’t work for an airline, I work in a stable.
Sam: A stable? What does a pilot do in a stable?
Irving: My job is to take all the manure and pile it.

Robert Rothman

Okay, so I did find the template hacking I mentioned earlier. It’s now bundled as an extension, attached for your reference, but be warned that the code is in quite a yucky state.

I also ran some quick-and-dirty performance tests on both Glulxe and git, and I found out that I don’t have such a good @malloc/non-@malloc comparison as I thought yesterday—any performance differences from switching to @malloc are drowned out by other effects.

An example:[code]There is a room.

Instead of singing:
let W be some indexed text;
repeat with X running from one to 80:
now W is “[W][W]”;
say “[the number of characters in W].”

Foo relates numbers to numbers.

Instead of waving hands:
repeat with the counter running from zero to 8000:
now the foo relation relates the counter to the counter;
showme the number that relates to 1592 by the foo relation.

Instead of jumping:
let Y be a list of indexed text;
repeat with X running from one to 8000:
add “” to Y.[/code]
With a debug build of this source and the extension, Git ran the script sing/wave/quit about six times faster, and Glulxe (profiling) went at roughly a tenfold pace. (Though, admittedly, neither the sing nor the wave rule is likely to be representative of actual IF.) The profiling output is spoilered below.

Without the extension:[spoiler][code]Code segment begins at 0x3c
290 called functions found in ./profile-raw
1276 functions found in test-debug were never called
Functions that consumed the most time (excluding children):
RT__ChLDB:
at $03eddb (line 1); called 40991752 times (4294967296 accelerated)
43.076851 sec (4540917808 ops) spent executing
80.325556 sec (17753753712 ops) including child calls
Unsigned__Compare:
at $03e6f6 (line 1); called 46466034 times (4294967296 accelerated)
42.222459 sec (4666695568 ops) spent executing
42.222459 sec (17551597456 ops) including child calls
BlkSize:
at $0310e3 (line 27569); called 3792638 times (4294967296 accelerated)
30.291623 sec (4498572034 ops) spent executing
108.539218 sec (17942947802 ops) including child calls
RT__ChLDW:
at $03ee0c (line 1); called 3451082 times (4294967296 accelerated)
4.148779 sec (4319124870 ops) spent executing
7.317254 sec (17231635414 ops) including child calls
BlkValueWrite:
at $031e0b (line 27972); called 363600 times (4294967296 accelerated)
3.859299 sec (4319845324 ops) spent executing
70.812351 sec (17676125153 ops) including child calls
BlkValueRead:
at $031d12 (line 27946); called 361973 times (4294967296 accelerated)
3.269208 sec (4316540557 ops) spent executing
55.611922 sec (17569324293 ops) including child calls
RT__ChSTB:
at $03ee41 (line 1); called 793328 times (4294967296 accelerated)
1.255341 sec (4302107248 ops) spent executing
2.666186 sec (17199702384 ops) including child calls
RT__ChSTW:
at $03ee8b (line 1); called 218272 times (4294967296 accelerated)
0.390770 sec (4297150016 ops) spent executing
0.785204 sec (17185544256 ops) including child calls
BlkAllocate:
at $0314e0 (line 27686); called 39 times (4294967296 accelerated)
0.302045 sec (4296977667 ops) spent executing
1.964846 sec (17194490934 ops) including child calls
HashCoreCheckResize:
at $03bf70 (line 33297); called 8000 times (4294967296 accelerated)
0.224566 sec (4296280743 ops) spent executing
55.649696 sec (17571155823 ops) including child calls

^D[/code][/spoiler]With the extension:Code segment begins at 0x3c 285 called functions found in ./profile-raw 1273 functions found in test-debug were never called Functions that consumed the most time (excluding children): Unsigned__Compare: at $03e013 (line 1); called 2349167 times (4294967296 accelerated) 2.177504 sec (4313760632 ops) spent executing 2.177504 sec (120277877624 ops) including child calls RT__ChLDB: at $03e6f8 (line 1); called 1452052 times (4294967296 accelerated) 1.681527 sec (4303679608 ops) spent executing 3.032960 sec (120279413016 ops) including child calls BlkValueWrite: at $03172a (line 27890); called 363600 times (4294967296 accelerated) 1.114526 sec (4302745734 ops) spent executing 3.697855 sec (120283951311 ops) including child calls BlkValueRead: at $0315c4 (line 27832); called 361973 times (4294967296 accelerated) 1.102141 sec (4302688473 ops) spent executing 3.708498 sec (120283816891 ops) including child calls RT__ChLDW: at $03e729 (line 1); called 896525 times (4294967296 accelerated) 1.061123 sec (4301242971 ops) spent executing 1.886649 sec (120272532163 ops) including child calls BlkSize: at $03115a (line 27628); called 725826 times (4294967296 accelerated) 0.642042 sec (4297870600 ops) spent executing 2.155998 sec (120272149156 ops) including child calls HashCoreCheckResize: at $03b88d (line 33225); called 8000 times (4294967296 accelerated) 0.226893 sec (4296280743 ops) spent executing 3.530675 sec (120282662662 ops) including child calls INDEXED_TEXT_TY_Cast: at $031edf (line 28337); called 80 times (4294967296 accelerated) 0.183588 sec (4295847646 ops) spent executing 3.825970 sec (120283077175 ops) including child calls INDEXED_TEXT_TY_Say: at $03218d (line 28469); called 164 times (4294967296 accelerated) 0.170247 sec (4295868348 ops) spent executing 1.868206 sec (120270349406 ops) including child calls glk_put_char_uni: at $000564 (line 1390); called 149902 times (4294967296 accelerated) 0.096039 sec (4295417002 ops) spent executing 0.175487 sec (120259533994 ops) including child calls ^D
The short story is that time spent in BlkSize drops considerably: it runs nine or ten times faster (presumably because I used @shiftl in place of a loop), and is called about one fifth as often (likely because I opted to keep data in a single block and relocate it when space ran out, rather than overflowing to multiple blocks). But without more and more careful experiments, it’s not clear how much speed-up is contributed by @malloc.

Still, I can make some comment on how it scales with outstanding allocations. I was half-wrong yesterday: the terps do slow down as the number of outstanding allocations grows, but, using the story above and jump/quit, I watched the extension vs. no-extension runtime ratio steadily improve as the loop bound in the jump rule increased. So, it seems that the terps scale better than Inform’s code does.
Block Value Management via Malloc.txt (16.2 KB)

IIRC Smarter Parser recognized questions and sentences, rather than just imperatives, that ordinary people were prone to typing in. That doesn’t really require a parser rewrite so much as a few additional Understand tokens and grammar lines that use them.

As for regexes themselves, writers don’t regex, so the way understand lines are written now can’t change. And since understand lines aren’t as complicated as regexes, it seems like using a Buick to swat a fly. I’m not even sure how “the regex engine” would be sped up by opcode use. What would the opcode(s) even do? How would that work?

If it fell to me to rewrite Parser.i6t from scratch in I7, the first thing I’d do is replace any & all arrays with object attributes. For example, instead of adding & removing objects to the match list (an array), I’d give or remove the “marked for listing” property. It’s something of a dream of mine to have a parser who’s first line is “now every object is not marked for listing” and go from there, efficiency be damned. I was tempted when writing the default behavior for the which-did-you-mean activity to do something very similar to that, so customizing that activity wouldn’t require the arcane I6 code seen at the bottom in the Cleopatra’s Nose example.

I’m currently about halfway through the convoluted flow of Parser__parse, and am waiting to see what else I learn. For example, did you know you could enter “LOOK, INVENTORY, EXAMINE ME” as a single command and have it work in every Inform game? I didn’t know that. I didn’t know commas could be used in that way. But, I think the ability to do that falls far behind in importance nowadays than being able to parse “WHAT SHOULD I DO?” and the like.

Currently the Inform library has a regex-matching routine, written in I6. It would be fairly easy to add a Glulx opcode which did exactly the same thing. The new implementation would be in native code (C, or whatever the interpreter was written in) and so would be faster.

We’ve been very conservative with these move-functionality-to-VM-opcodes decisions. It effectively locks down functionality. (Right now, if somebody finds a bug in the Inform library – or wants to add a feature – it can be patched in an individual game, and then integrated into the next I7 release. If the behavior has been moved into the VM, it’s a lot harder to change, and raises game/interpreter compatibility headaches.) Currently Glulx has opcodes for copying arrays; searching arrays and linked lists; and a few very low-level pieces of I6 functionality. (Property-checking and class-checking routines which haven’t changed since the early I6 releases.)

Mm. The tradeoff there is that adding objects to lists becomes somewhat faster, but iterating through lists becomes much slower. Whether this is an improvement would depend entirely on the code in question.

Still, interesting. I can see that BlkSize() is the major hold-up in the current implementation, but I guess the real solution is to think about how to modify the I7 memory code (and possibly the Glulx opcodes) to fit together better, rather than just accelerating BlkSize().

Aren’t tables (or arrays) the only way to sort? Isn’t sorting important?

Nope. The objects in the match list & multiple object list (and by list I mean array) are populated by whatever order the objects come out of the object containment tree in. The tree itself has a kind of an ordering to it, and the DM4 spends some ink on managing it – such as saying X is in Y when it already was: it moves X to the front of the list, the “first child”.

Parser.i6t only ensures it doesn’t screw up whatever order the objects are in. For example, when removing an element x at index i from an array, rather than copying the last element into the i-th slot, it’ll shift all the elements left by one slot, starting at i+1.

I think, though I’m not sure, that I7’s “now every…” command traverses the containment tree in the same way. Assuming it does, removing an object (or objects, as with a ALL EXCEPT command) would be much faster when using a property instead of an array.

But as I said, I wouldn’t really expect an easy, from-scratch I7 re-implementation to produce quick code by accident, so such speed comparisons are probably moot.

No, that statement usually iterates through all things. (Although if it can pick a narrower kind, e.g. “now every person…”, it will iterate through a shorter array containing only things of that kind.)

There may be other cases, but I don’t recall ever catching it traversing the world tree.

Other people may disagree with me, but if I added an accelerated function for something like regexs, it would be to implement the feature as it was intended, not to precisely duplicate the I6 code - ie, more like a real opcode. Any bugs in the I6 code would be fixed independently of updating the accelerated function.

Of course it would be really good for the I7 test suite to be made public so that any new accelerated functions or opcodes could be tested more thoroughly!

In other news my news VM will now sometimes produce real loops (as well as nested block if statements)! When I can get it to produce loops in more situations then it will be even faster. And it won’t be hard to turn the e.pc= into a break statement either…

/* 3462/2 */ while (!(!(e.U2S(l[3])<e.U2S(l[2])))) { s.push(/* 3466/15 */ m.getUint16(l[1]+2*e.U2S(l[3]))); /* 3470/1 */ if(!(!(s.pop()==l[0]))) { l[5]=/* 3474/13 */ l[4]; /* 3477/140 */ e.pc=7+3478; return }; /* 3480/133 */ e.incdec(4,1) };

Really sorry to be so long in getting this to you. I got kind of pulled away for a while there. (“Squirreled,” as my wife and I say in reference to our favorite character from Up.)
concealment.txt (8.07 KB)

I’ve been experimenting a bit with the opcode calls, and haven’t found them to deliver enough of a performance boost to be worth implementing throughout the parser. Which doesn’t mean that I don’t hugely appreciate your work, of course…

At this point I think any further performance gains I find will not come through the parser. In that spirit, I’m going to do one more round of performance profiling to see if any more functions elsewhere stand out as particular time sinks. If I turn up anything interesting, I’ll report back here again. If not, I’m about ready to judge performance acceptable (at least on this story file) and call it a day (at least for now).