I’m using a speech recognition layer implemented outside Inform 7 to feed input into my Inform 7 game. I’d like to bias my speech recognition algorithm towards the words that my game can understand. Is there any way to just get a list of all those words?
I don’t think there’s an officially supported way to do this. However, you could try applying the following regular expression on the generated Inform 6 source:
(?<!!.*)'(.[^&<>=|,.+'-\s]+)'(?!.*")
That should give you all the single-quoted dictionary words (like 'this'
), which are all the ones the parser can even attempt to understand.
In theory, one should be able to add:
Include (- Message "----------"; Trace dictionary; -) after "Output.i6t".
and then see the complete list of words in the story’s dictionary as part of compilation output on the Results/Progress tab.
In practice, this seems to work when compiling for Z-machine, but not for Glulx (though Glulx output does tell how many words are in the dictionary and is formatted in a way that suggests it is intended to output the list of words).
Note that this is shows the dictionary entries as encoded, which means words can be truncated, especially those including special characters.
The Glulx version of “Trace dictionary” only got implemented this year, so it’s not available in the current I7 release.
In the meantime, you can add the following to your game (Z-Machine or Glulx):
Include (-
#Ifdef TARGET_ZCODE;
[ ListAllDictWords i wsc da des dec;
da = HDR_DICTIONARY-->0;
wsc = da->0; ! word separator count
des = (da+wsc+1)->0; ! dictionary entry size (in bytes)
dec = (da+wsc+2)-->0; ! dictionary entry count
print "Words recognized by this story (", dec, "):^^";
for (i = da+wsc+4: i<da+wsc+4+(des*dec): i=i+des)
print " ", (address) i, "^";
new_line; new_line;
];
#Endif;
Ifdef TARGET_GLULX;
[ ListAllDictWords da dec ce;
da = #dictionary_table;
dec = da-->0;
print "Words recognized by this story (", dec, "):^^";
for (ce = da+WORDSIZE: dec>0: ce=ce+1+DICT_WORD_SIZE+2+2+2) {
print " ", (address) ce, "^";
dec--;
}
new_line; new_line;
];
#Endif;
-).
To list all known dictionary words:
(- ListAllDictWords(); -).
When play begins: [or set up your own debugging verb]
list all known dictionary words.
I suspect that if you’re tuning speech recognition, you’re going to want to divide this list up further by hand. You’ve got
- Words that you expect the player to use
- Words that you threw in because the player might use them (off-the-wall synonyms); you don’t want them getting in the way of primary vocabulary
- Words that I7 throws into every game which might not be relevant. (Every number from “one” to “thirty”, for example. “Lit”, “lighted”, and “unlit”.)
I’m now up to two extensions “by Otis T Dog” in my external dir (the other being Unavailable Things).
Thanks @ArdiMaster, that regexp could well be useful, but @otistdog wow you nailed it, thanks!
I was a little pleased to see my game understands more than 1,000 words.
This is totally a good idea. A fair amount of work but probably worth doing.
This is perhaps more annoying than the working solution you have already… but worth noting that it’s possible to dump the dictionary of any Zcode file, since it’s stored in a fairly structured way. (I’m working on a Z-machine that takes handwriting input, where I can’t modify the input stories, and this approach has been working well.) ZTools might be helpful if you end up going down that road?
I’m not sure if there’s any equivalent way to grab the dictionary from a Glulx game. (From my brief look at the spec, Glulx seems more general purpose and thus rather more difficult to extract data from.)
Incidentally, my experience from using this data for handwriting recognition is that:
- You want a fairly strong bias towards commands that the player will either use constantly (directions,
look
) or use to get help (help
orabout
, mostly) since if they don’t work it’s likely to constantly irritate the user. Thankfully a pretty short list. - For other words, a modest bias is helpful, but not so strong that it makes it hard to recognize words outside the dictionary. (Users don’t know the dictionary – and they probably shouldn’t – so they can’t tell a recognition failed because it’s not a dictionary word… it just feels like the thing not working well.)
I haven’t yet found much payoff from going any finer than that. Your mileage may very of course!