The "Rewrite The Parser in Inform 7" Project

Ron_Newcomb · September 15, 2011, 1:05am

I’ve started on the grand ambition of rewriting Inform’s parser in Inform 7, meaning, the entire Parser.i6t template file. My goal in this is primarily for teaching and learning purposes: we get questions in this forum pretty regularly about just what, exactly, is going on inside that parser. While there’s a handful of Inform 6 experts here who know this, I feel it would be in everyone’s best interest if more people could diagnose parser-related problems with their works-in-progress.

My technical goal in this is to have the Inform 7 compiler generate Inform 6 code that resembles the original Parser.i6t as close as possible. This is so I don’t have to test anything. Seriously, a full regression test on a rewrite would be nearly impossible to construct, let alone verify. By ensuring the generated I6 corresponds line-by-line with the original, I also ensure that the I7 code isn’t slower than the original I6. Or at least, not by much.

I’ll further try to use the original I6 names where possible for variables, functions, etc., though I can only go so far. Inform names parameters and locals formulaically, like t_0, t_1, and so on, but the names will at least appear in the comments in the generated I6.

I have two examples already. Version 1 looks like I block-copied Parser.i6t from Appendix B into the IDE and hit compile. Except, it actually does compile. Version 1 is just a base-line to start from. Version 2 I’ve modernized three functions – ScopeWithin, PlaceInScope, and AddToScope – with quite a bit of technical bits added in a volume at the beginning of the file for support. Version 2 also compiles and works, and indicates what later versions will likely look like. Comments at the beginning of the game file are mine, while most of the comments spread throughout are Graham’s from Appendix B.

For the purposes of correctness I won’t include any fixes or upgrades that the extensions site have. For now, this is a pure translation project, nothing more.

zarf · September 15, 2011, 4:04am

Do you think that a line-by-line translation will be more readable than the I6 original?

Looking at this, I see mixed improvements. “The parser’s current word” is way better than “wn”. “Do scope action and recurse on … excluding … under …” is better than “DoScopeActionAndRecurse(…, …, …)”, but only by the prepositions. Then you have a lot of lines that are essentially adding spaces and “the” here and there.

The overall logic is exactly the same, which is to say occult in the extreme. (I see you translated the line “wn = match_from”, in PlaceInScope. A couple of days I mentioned that particular line, as the immediate cause in the infinite-loop-scope-bug thread. I said I had no idea what it’s doing there. I still don’t. Do you? Anybody?)

I realize that you’ve chosen a (sanely) limited goal here; the alternative is “write a new parser in pure I7.” (I’m not touching that either, believe me.) I’m wondering what the specific benefits will be. (“Let’s find out” is a perfectly good answer.)

Ron_Newcomb · September 15, 2011, 7:49am

Not as readable as a rewrite of course, but I think almost any function written in I7 is more readable than its I6 counterpart.

I had a good name for wn only because I already knew exactly what it was from my previous exposure to the code. The phrase and match_from I have less of an idea, hence the too-literal names. Since the I6 names are preserved for variables via “translates into I6 as” and to-phrase naming, I can completely rename their I7 selves like I did with wn. It’s just I need to understand the code better before I can pick better names. You should have seen what I had before I hit upon those particular prepositions. Ugh!

I have a philosophy that if working code is quick and relatively bullet-proof, but its logic looks convoluted, sometimes it’s just a question of what you’re naming things. Sometimes the name of a variable is a poor match for the concept that it eventually ended up standing for. Since this project allows me to name many of the I7 constructs however I wish (parameters excluded), it’s possible we find that the parser isn’t as convoluted as it seems. It might even be elegant but we can’t see it.

It’s also possible to highlight certain patterns in the code, such as a long string of if-statements, which look like they’d be happy as a rulebook. The scoring especially. This can be brought out with I7 comments even if I can’t express it directly code-wise. The line you bring up in particular I’ve seen multiple times, and the word “reset” comes to mind with it. I believe it’s a safety mechanism in case other GPRs didn’t reset wn like they’re supposed to, or an off-by-one error happened. (Or maybe, since many matching routines only work on wn, it’s using up wn to use one of those functions. If parsing is done, wn can be trashed.)

When constructing the Custom Library Messages extension, I slipped in the comment [reset to] in a now…is… assignment so I could tell when an assignment was meaningful or not. The same can apply here.

I don’t think a brand new I7 parser would gain much traction for the same reason that Glulx didn’t immediately obsolete the Z-machine. So at the very least this project could be a stepping stone toward a better day: more people could join in with sensible opinions on the parser due to better readability, increasing pressure to standardize things like improved disambiguation, transparent scoring, etc. Besides, not everyone speaks I6, and that alone is a hurdle.

So I think there isn’t much in the way of real technical benefits despite being a technical challenge. But I think the “soft” benefits of bringing in more community eyes are definitely worth it.

And I’ll try to comment more.

capmikee · September 15, 2011, 3:46pm

This is very cool! My idea for a parser rewrite was to keep it in I6 but to break the big routines into small, readable chunks with better naming. But this looks like the beginning of an even better solution. Thanks! (edit: I see that’s what you’ve actually done, by putting chunks of I6 code into the I7 source.)

Sheesh, sometimes I think I must be the dumbest programmer in the world. Why didn’t I think to use “to decide” phrases to set constants?

I have no shame in admitting that this is beyond me, though:

To decide which grammar token is the noun token: [0] (- NOUN_TOKEN -). To decide which grammar token is the end of line token: [15] (- ENDIT_TOKEN -).

Are you defining “fake” values for the grammar token KoV? Is this a form of typecasting? I wouldn’t have expected it to be legal to define a numeric value for a KoV while only giving its name as a phrase-name. I can see that it could be useful for 0, which is not a normally used KoV. (right?) But is it dangerous to specify another integer - couldn’t there be collisions? I’m assuming you’re doing it because there’s a gap in values between topic and EOL, so perhaps it makes sense as a way to keep the numeric values in sync with I6 constants, but it looks arcane and scary.

zarf · September 15, 2011, 7:42pm

It’s just type-casting. Specific values have to be defined for grammar-token values that are referred to in I6 code. Ultimately, I guess, once all the I6 code is gone, those specific value definitions will be eliminated – the code will be able to rely on “naturally-generated” numbers for each grammar-token value.

zarf · September 15, 2011, 7:43pm

Nope.

capmikee · September 15, 2011, 7:56pm

This project may inspire me to do that refactoring I wanted to do. But don’t let that stop you, if that’s something you’d like…

Ron_Newcomb · September 15, 2011, 9:28pm

Eventually those chunks will be commented out and the I7 version will sit beside, so no I6 other than some glue code will be left.

capmikee:

I have no shame in admitting that this is beyond me, though:To decide which grammar token is the noun token: (- NOUN_TOKEN -). To decide which grammar token is the end of line token: (- ENDIT_TOKEN -).Are you defining “fake” values for the grammar token KoV? Is this a form of typecasting? I wouldn’t have expected it to be legal to define a numeric value for a KoV while only giving its name as a phrase-name. I can see that it could be useful for 0, which is not a normally used KoV. (right?) But is it dangerous to specify another integer - couldn’t there be collisions? I’m assuming you’re doing it because there’s a gap in values between topic and EOL, so perhaps it makes sense as a way to keep the numeric values in sync with I6 constants, but it looks arcane and scary.

Apparently it isn’t beyond you because you understand it, and its potential problems, perfectly. Zarf’s also right in that the actual number that the enums/constants/kovs/words stand for isn’t important, so as long as every use of which is taken into account, I don’t have to use the above gymnastics to force a particular name to compile to a particular number.

O ye of little faith. A trick I’ve started using is exposing the same I6 variable to I7 multiple times under different names, and sometimes with different types. This is useful. A couple of parser variables actually (but only sometimes) hold a routine to call with indirect(), which one I7 incarnation of I have typed as “phrase object → nothing” or “phrase nothing → truth state” as appropriate. This helps.

(I just wish I7 implemented it’s “applied to” and other of its new functional-programming features were implemented with I6’s indirect(). That would have been tidy.)

zarf · September 16, 2011, 12:15am

I’m pretty sure that indirect() was deprecated after Inform 5, and only hangs on through habit and indifference. (It’s not mentioned in the DM4, for example.) Any line “indirect(x)” can be written as “x()”, and that’s got to be clearer I6 code.

Ron_Newcomb · September 16, 2011, 6:11am

Oh that’s interesting. The IDE still colorizes it like child(), sibling(), and parent(). Thank you for that info. An I7 rule is invoked the same exact way from the I6, so I might be able to simplify some stuff.

Right now I’m working out ways that I7 should access an I6 array, especially if it’s sometimes a routine instead, as with add_to_scope.

Ron_Newcomb · September 16, 2011, 10:46am

As of version 3 the Book of Scope is complete. The following functions have been translated: ScopeWithin, PlaceInScope, AddToScope, DoScopeAction, TestScope, LoopOverScope, SearchScope, and DoScopeActionAndRecurse. (I will likely rename some of the I7 constructs therein as the project continues, of course.)

I’ve had my first taste of dealing with I6 arrays, hidden object properties, and properties which can be arrays or routines at will, all at once thanks to DoScopeActionAndRecurse.

In the interests of I7 readability, I exchanged a while loop with a for loop. I had previously exchanged a switch statement with a else-if chain due to an I7 restriction. I don’t believe either of these changes should be a problem. Details on them are in the comments at the beginning of the game file. The modernized functions are about four-fifths of the way down in the source, or, use the navigation features to jump to the Book of Scope.

Sslaxx · September 17, 2011, 11:01am

How did you get around that?

capmikee · September 17, 2011, 2:35pm

Oooh, maybe we could have an “I6 Arrays” extension…

Ron_Newcomb · September 17, 2011, 8:03pm

For the type-stacking problem I made a to-decide-if phrase that invokes I6’s “ofclass Routine” condition. At the I7 level I call them rules rather than routines, of course, but add_to_scope isn’t the only place such a condition is useful.

Arrays are more difficult, but there is precedent for how they should work in I7: lists, which is how arrays are used, and table columns, which actually are arrays. So I’ll make repeat loops that set the ct_0 and ct_1 table variables to the array name and index. (Since the I6 parser doesn’t use I7 Tables they’re free for the taking.) With those I can make “the chosen row” same as for tables, and either “the chosen element” and/or “the (name of kind of value K) element” for ct_0–>ct_1.

Exposing them requires I6 inclusions on every single array like I did for the NOUN_TOKEN value up-thread. I did make a “1-based array” KOV for them: To decide which 1-based array is the multiple-object list: (-multiple_object-). I made the type mainly so I don’t accidentally pass the wrong kind of thing to the to-phrases that operate on such. And for documentation purposes too, of course, which this project largely is.

@Mikee: I doubt I’ll make an array extension, partly because creating a new one involves two I6 inclusions per array, and partly because I believe arrays are planned for I7 proper.

zarf · September 19, 2011, 3:53pm

I gotta say, looking at the I7 bug site, this project is worthwhile for purely for its stress-testing of the I7 code generator.

Ron_Newcomb · September 19, 2011, 11:12pm

Version 4 is up. A dozen more functions were converted: MakeMatch, MatchTextAgainstObject, MultiSub, MultiAdd, MultiFilter, Refers, WordInProperty, CreatureTest, ReviseMulti, BestGuess, SingleBestGuess, and PrintInferredCommand. Plus a few more have partial starts. The only thing these functions have in common is that they’re small or are otherwise low-hanging fruit. Only two are worth noting.

WordInProperty is the best example of clarity I hope this project can bring. WordInProperty’s former implementation was this:k = obj.&prop; l = (obj.#prop)/WORDSIZE-1; for (m=0 : m<=l : m++) if (wd == k-->m) rtrue; rfalse;But now, it is this:if the word is listed in the names of the obj, yes.
Relatedly is Refers. A minor goal of mine is, when I have to create a new name for a function or whatever, I try to make that name resemble similar stuff from Inform 7. Refers takes two parameters, an object and a “dictionary word”. But now there’s an I7 type called an “understood word”, as they come from Understand…as… statements, and invoking Refers resembles a relation: if the word can refer to the object. I can’t use actual relations due to the code they auto-generate, but I can make it possible to think of some things in that way, and further ease translation to pure I7 if this project once completed is taken further.

But most of the time I spent on this version went into thinking about all the different kinds of arrays in there. Some are 1-based with their size in the 0th element (multiple_object), some are 0-based with their size who-knows-where (match_list), one is even 2-based (parse – but only under the Z-machine, it seems). Some have object elements, some numbers, a few are byte-based ZSCII characters (buffer). At least one is an array of structs (parse), the shape of which depends on Z or Glulx. Sometimes arrays work in concert, like a struct of arrays: see how match_list and match_score share the same index. And then there’s array-valued properties, which are just different enough from 1-based arrays to be annoying.

Similar to the ideal of making I6 operations resemble I7’s is hiding some of these nitpicky differences. For the I7 author, “1” is always the first element, and size comes from a “size” property. So I spent a great deal of time creating heavily overloaded array accessor phrases that normalize as much of this as possible. And one idea that hopefully bears more fruit than problems is, again, relations.

Type-wise, I’ve changed arrays from a simple KOV to a relation of index-type to element-type. This is handy since I can use Inform’s generic types to make creating accessor functions easier. Given definitions like this:To decide which relation of 0-based index to numbers is the match scores list: (-match_scores-). To decide which relation of 1-based index to objects is the multiple-object list: (-multiple_object-). To decide which relation of 0-based index to ZSCII letters is the player's input buffer: (-buffer-).I can create phrases with a parameter of a relation of a 0-based index to a value of kind K and then re-use the K for the element’s type.

capmikee · September 20, 2011, 3:02pm

This is still so cool!

I don’t have anything else to add, just bumping the topic.

Ron_Newcomb · September 26, 2011, 9:01am

Version 5 is up, with 25 more functions translated, leaving 25 more yet to go.

One of the more useful translations is the Keyboard() function, which does the bulk of what we think of as the Reading A Command activity. It handles OOPS and UNDO processing, plus the blank line parser error. I took extra time to place comments into this function highlighting where and what the check, carry out, and report rules would be for each of those commands if anyone wants to promote them to full out-of-world actions, or just get a sense of how they work. ScoreMatchL, part of the does-the-player-mean subsystem, was done similarly in case someone wants to break it out into a rulebook or three. I’m unsure if the first two “rulebooks” are useful, or even necessary, but the third certainly looks promising.

It was during the translation of Keyboard() that I began to think of placing “landmarks” in the code. There’s a lot of code in here with little to distinguish one block of black ink from another block of black ink. Just scrolling around, it’s hard to tell where you are in the source. So I’ve decided that all of the calls to the library message system will also have, in blue text, the message that they’ll print. That text isn’t actually used, it’s just a visual cue and yet another way for one to see what the code is doing.

Another thing I’m trying to do is to remove the GOTO statements (called “jump” in Inform 6). Keyboard() used break and next statements to mimic them, but I removed even those via else-if chains. But ChooseObjects, on the other hand, took some doing. While it was conceptually simple – make the final third of the function the first third – code is fussy. And then it used a while-loop in a peculiar way: by combining a following if-statement into itself, using next-commands to do the looping, and skipping the following code should the loop fail prematurely. It was a clever trick, but doesn’t jive well with making the code clear and readable. The solution I found I’m still unsure of. It’s hard to find test cases that hit it particularly.

However, even if ChooseObjects was a great success, TryGivenObject broke me. TryGivenObject has only two labels, but with four different jumps that aim at them. The code is convoluted and difficult to follow even in the translated version. I certainly intend to come back to it after the rest of the functions are translated, as I’ll have a better understanding of the parser code (along with shorter code, I presume), but until then, Inform 7 now has GOTOs under the trumped-up moniker “control labels”. May God have mercy on my soul.

This version also introduces bit-twiddling to the project. This I’ve had experience with before, but I started afresh rather than importing an earlier solution. I’ve used the verbs include and exclude to do all the bit-twiddling. As imperatives they work much like Inform’s existing increment and decrement phrases. Because of a bug I hit a year or two ago, most bit tests here take the form “if (x & y) == y” rather than the “if (x & y) ~= 0” which is standard. This is so I can test “if x includes y1 + y2” safely. The values for Y are still defined as I6 constants and exposed through KOVs with to-decide phrases. Include/exclude work on any KOV so long as X and Y are the same KOV.

And finally but certainly (unfortunately) not least, my handling of arrays continues to evolve. It turns out that relations, at least the ones that are passed through phrases, are treated more like indexed text and less like objects: they’re passed by copy. Even though I’m using only them as a type, Inform wants to insert code to make copies. And while I’ve been avoiding that via the {-pointer-to:xxx} directive, it doesn’t work when I7 phrases call other I7 phrases. So arrays as relations, while making sense conceptually as a special case of associative arrays, were a blind alley for this project. Instead, arrays are now typed as rulebooks. This makes no sense conceptually, but works mechanically, so shall it be.

The second difference in how arrays are handled are the arrays of structs, such as the parse array and a couple of pronoun arrays. I had originally created the fields as a kind-of-value in their own right, but that disallowed the syntax of “the 1st word of the parsed input” since Inform’s phrases cannot have two parameters right next to each other. I found “the word # 1 of the parsed input” to be annoying, so each field now has a phrase dedicated to it. Inelegant, perhaps, but the client code is more readable as a result, and that’s what matters.

Relatedly, for one phrase, “if x is listed as a (field) in y”, I made a handful of field types specifically for that phrase. This is because the field, of type struct, is now a Unit rather than an empty KOV just for automatic phrase-selection purposes, and that allows me to pass two numbers for the price of one. (It’s just like how hours and minutes are both passed together as a single “time” unit.) The second number is the number of fields (“columns”) that that array has, a number that’s needed in the for-loop that that phrase masks. The effect is that I don’t need to clutter the actual parser code with extra appelations like “with element size 3” for the ct_1 += 3, because it’s hidden in the unit.

Ron_Newcomb · October 6, 2011, 8:17am

The project’s first draft nears completion, with 22 more functions translated in version 6. Highlights this time around include the translation of NounDomain and Adjudicate, neither of which were trivial to do. The GOTOs were removed from TryGivenObject and NounDomain, and though it involved creating a new function to do so, that means that this version has no GOTOs in it currently.

TryNumber was rewritten for efficiency’s and readability’s sake, namely, using the technique of subtracting a ZSCII character from ‘0’ to translate the ten digits into small numbers.

Some unidentified magic numbers used by PrepositionChain were given names.

UnpackGrammarLine was slightly modified for efficiency. It previously cleared out three arrays before filling them in, but now it fills them in first as far as possible, then clears out only the remaining entries.

And finally but most importantly, some of the array copying, searching, and related loops were replaced by VM-specific opcodes for performance reasons. As well I modified a few loop constructs to use an intermediate global variable to cache the result of array size calculations so it happens once per loop rather than once per iteration. (The original code always did this, but I didn’t hit upon a way of mimicing it while preserving readability until now.) The net result of this is that the I7 expression of the parser code should be able to keep up with, or even exceed, the I6 version’s performance.

Three more functions remain, including the grand-daddy of them all: the monstrous Parser__parse.

Peter_Piers1 · October 7, 2011, 8:54pm

While about 99% of this is way over my head, and I’m still not entirely sure what purpose this will serve, I am always drawn here for the latest progress report. Your efforts are tremendous, and your determination is an inspiration to us all. I salute you.