How do AnalyseTokens and PrepositionChain work?

nelsnelson · September 4, 2014, 3:13am

I am having some trouble understanding how AnalyseTokens and PrepositionChain work.

[code][ AnalyseToken token;
if (token == ENDIT_TOKEN) {
found_ttype = ELEMENTARY_TT;
found_tdata = ENDIT_TOKEN;
return;
}
found_ttype = (token->0) & $$1111;
found_tdata = (token+1)–>0;
];

[ PrepositionChain wd index;
if (line_tdata–>index == wd) return wd;
if ((line_token–>index)->0 & $20 == 0) return -1;
do {
if (line_tdata–>index == wd) return wd;
index++;
} until ((line_token–>index == ENDIT_TOKEN) || (((line_token–>index)->0 & $10) == 0));
return -1;
];[/code]

Specifically, I’m having difficulty with this line from AnalyseToken:

found_ttype = (token->0) & $$1111;

and also these sections from PrepositionChain:

(line_token-->index)->0 & $20
(line_token-->index)->0 & $10

What exactly is happening with these binary operations? What is the value in (line_token–>index)->0 supposed to represent?

When modifying the parser code to print out this value during parsing of commands like ‘say hi to mary’, I see this value:

What does this value represent?

zarf · September 4, 2014, 4:06pm

The UnpackGrammarLine() function and its subordinate functions read the grammar table, which is generated by the I6 compiler. The grammar table’s structure is described in inform-fiction.org/source/tm/chapter8.txt (section 8.6).

Note that modern games always use the “GV2” option. Also, in Glulx, the format is slightly different (see eblong.com/zarf/glulx/technical.txt).

So the line in AnalyseToken takes apart the token-type byte (as described in the first link). found_ttype winds up as 2 for a fixed word, 4 for an attribute filter (usually “animate”), 3 for a grammar token, 5 for a grammar token, 6 for a general parsing routine.

The PrepositionChain routine is going through a list of fixed word alternatives (e.g. “in”/“on”/“into”/“onto”). These are marked by the “next two bits” (as described in the first link).

nelsnelson · September 4, 2014, 4:16pm

So I added a bunch of other debugging prints in the AnalyzeToken and PrepositionChain routines, and this is what showed up:

[code]>say hi to mary
AnalyseToken token: 6219
AnalyseToken (token->0): 1
AnalyseToken $$1111: 15
AnalyseToken token: 6222
AnalyseToken (token->0): 66
AnalyseToken $$1111: 15
AnalyseToken token: 6225
AnalyseToken (token->0): 1
AnalyseToken $$1111: 15
AnalyseToken token: 6219
AnalyseToken (token->0): 1
AnalyseToken $$1111: 15
PrepositionChain $20: 32
PrepositionChain index: 1
PrepositionChain line_token: 4224
PrepositionChain (line_token–>index): 6222
PrepositionChain (line_token–>index)->0: 66
PrepositionChain (line_token–>index)->0 & $20: 0
AnalyseToken token: 6222
AnalyseToken (token->0): 66
AnalyseToken $$1111: 15
AnalyseToken token: 6225
AnalyseToken (token->0): 1
AnalyseToken $$1111: 15
(Mary)
There is no reply.

[/code]

My guess is that 6222 must be the preposition token, and is the address to the dictionary word ‘to’?

I still don’t really understand what 66 represents, but it did show up in the AnalyseTokens routine debug output as the second token. But it’s always 66, even when topic is ‘hello’ instead of ‘hi’.

zarf · September 4, 2014, 4:42pm

The “token” argument of AnalyseToken is the address of the grammar token. (token->0) is a packed field of three values.

Have you looked at the Inform Technical Manual page? The 66 breaks down into 01 / 00 / 0010 (binary). The third field is type 2, meaning preposition; the second indicates that it’s a single preposition rather than a list of alternatives; the first field is not used in parsing. The preposition word address is the two (or four) bytes after the 66 byte; it winds up in found_tdata.

nelsnelson · September 4, 2014, 9:55pm

Oh, I see now, I think. So, token is sort of like a tuple, but just a conventional set of bits to reduce memory consumption.

Using (address) on line_tdata–>index seems to prints out the preposition from the grammar line.

I was not familiar with chapter 8 of the Technical Manual, no.

I think I see what you are talking about now, though, thanks zarf.

Type Means Data contains Top bits 0 illegal (never compiled) 1 elementary token 0 "noun" 00 1 "held" 2 "multi" 3 "multiheld" 4 "multiexcept" 5 "multiinside" 6 "creature" 7 "special" 8 "number" 9 "topic" 2 'preposition' dictionary address 01 3 noun = Routine routine packed address 10 4 attribute attribute number 00 5 scope = Routine routine packed address 10 6 Routine routine packed address 10

So the top two bits ‘01’ in ‘01 / 00 / 0010’ is what indicates that the found_tdata is a preposition, right?

zarf · September 4, 2014, 9:58pm

The top two bits indicate that found_tdata is a dictionary word address. The bottom four bits indicate that it is a preposition. A preposition always has a dict word in found_tdata, so the parser doesn’t really have to look at the top two bits (and it doesn’t).

Ron_Newcomb · September 7, 2014, 4:26pm

I re-wrote that parser in Inform 7 under the extension “Original Parser”, which might help you understand how it works, being that Inform 7 is rather English-like. It’s on the Inform 7 extensions page.

but yeah, zarf is right: bitfields and bitwise ANDing.