Extracting Vocabulary from a Compiled Game

Is it possible (at the code level) to extract a list of lexical items from a compiled I6 or I7 game? A friend of mine is looking into a concept that I think I’m not at liberty to disclose, as I may still be under NDA. It seems to me this is what he needs in order to create his exotic interpreter software.

Can it be done?

Absolutely. Internally, the Inform parser represents input words as indices into a sorted array of padded strings, so it shouldn’t be too difficult to locate that array and pull the strings back out of it. (In the Z-machine, the location of this array is stored in the eighth word of the header; in Glulx, finding it is a bit more difficult.)

However, a couple caveats.

One, the parser isn’t limited to words from the dictionary: it can theoretically handle parsing any way it wants, such as comparing text against user-supplied strings stored in RAM (the cube names in Spellbreaker) or parsing a word as a number in binary (one of the puzzles in Suveh Nux). The vast majority of parsing is restricted to dictionary words, but not quite all of it.

Two, the dictionary entries are cut off at a specific “resolution”: ENCYCLOPEDIAS might be stored as ENCYCL (Z3), or ENCYCLOPE (Z5), or ENCYCLOPEDIA (Glulx). This may or may not affect your use case.

Thanks! I’ve passed your answer on to my friend. He’s the one who is contemplating building a new type of interpreter. I may be back later with more questions.

For starters, this: “…in Glulx, finding it is a bit more difficult.” Can you tell us exactly how that works?

In Glulx the dictionary table isn’t a VM feature. It’s just an array in memory. The compiler generates code that looks at the correct address, but to extract that address, you have to look into the compiled code and do some hunting around.

The problem is much easier if you’re the one who compiled the code in the first place. The Inform compiler generates a debug file (gameinfo.dbg) which lists all these addresses. (Look for the “#dictionary_table” entry in the XML.)

1 Like

The other caveat is that this dictionary is a pretty bad data source for most purposes. I remember some old Z-machine interpreters did tab completion (on the command line) based on the game dictionary. I suspect the results on a modern game would be 30% useful for gameplay, 30% spoilers, and 40% weird-ass synonyms that the author threw in just in case a player might type them. Browsing through the list would be distracting at best.

Thanks, Zarf. I’ve passed your info on to my friend. Based on what you’ve said, I’m afraid he’s going to have to rethink his whole approach.

(Though it could be potentially useful if, for example, you’re trying to slim down the possibility space for a voice-recognition system. Something being in the dictionary is no guarantee that it’s useful to the player, but something not being in the dictionary can be a pretty good indication that it’s not useful.)

1 Like