Working with the tool UnZ for unpacking z-files got me thinking about the grammar version 2 and ways of making it even more compact by only use 2 bytes per token and reintroducing the parsing routines table and adjectives table.
Using the definition of the current grammar version 2 the Inform Technical Manual, Ch. 8 as a starting point, this is a first draft:
GV3 is a variant of GV2 with a more compact data structure. GV3 only use
2 bytes for each token and removes the need for the ENDIT marker. In GV3
an individual grammar table has the format:
<number of grammar lines> 1 byte
followed by that many grammar lines. A grammar line have the form:
<action number> <token 1> ... <token N>
----2 bytes---- -2 bytes- -2 bytes-
The action number is actually contained in the bottom 10 bits of the word
given first: the top five contains the number of tokens in this grammar
line, which leaves.
action_number & $400
as a bit meaning "reverse the order of the first and second parameters
if this action is to be chosen".
There can be anything from 0 to 31 tokens, and each occupies two bytes,
arranged as:
<token type> <token data>
-- byte ---- --- byte ---
Token type bytes are divided into the top two bits, the next two and the
bottom four.
The "next two bits" are used to indicate alternatives. In a sequence of
tokens
T1 / T2 / T3 / ... / Tn
then T1 will have $$10 in its "next two bits", and each of T2 to Tn will
have $$01. Tokens not inside lists of alternatives always have $00. (Note
that at present only prepositions are allowed as alternatives, but the
format is designed to open the possibility of extending this to all tokens.)
The bottom four are the "type" of the token. The top two indicate what kind
of data is contained in the token data. Strictly speaking this could be
deduced from the bottom six bits, but it's convenient for making backpatching
GV3 tables a simple matter within the compiler.
Type Means Data contains Top bits
0 illegal (never compiled)
1 elementary token 0 "noun" 00
1 "held"
2 "multi"
3 "multiheld"
4 "multiexcept"
5 "multiinside"
6 "creature"
7 "special"
8 "number"
9 "topic"
2 'preposition' adjective number 01
3 noun = Routine parsing-routine-number 10
4 attribute attribute number 00
5 scope = Routine parsing-routine-number 10
6 Routine parsing-routine-number 10
GV3 identify a particular preposition or parsing-routine using a numbering system.
GV3 numbers parsing-routines upwards from 0 to 255, in order of first use.
A separate table translates these into routine packed addresses: the
"preactions" table. The preactions table is a simple --> array.
Prepositions are also identified by their "adjective number". Adjective
numbers count downwards from 255 in order of first use. They are translated
back into dictionary words using the "adjectives table".
The adjectives table starts with two bytes containing the number of
"adjectives" in the table. Each "adjective" entry then are three bytes:
<dictionary address of word> <adjective number>
----2 bytes----------------- ---- byte---------
These entries are stored in reverse order (i.e., lowest adjective number
first). The constant #adjectives_table refer to this table.
As in GV2, fake actions in GV3 are numbered from 4096 upwards.
Note that although GV3 reintroduces the preaction and adjective table,
the omission of the ENDIT marker and two byte tokens instead of three
byte, should produce a more economical grammar table.
Comparison table between the different grammar versions:
Limit in:
GV1 GV2 GV3
Prepositions per game 76 unlimited 256
Parsing routines (general ones,
noun= filters, scope= routines
all put together) per game 32 unlimited 256
Tokens per grammar line 6 unlimited 31
Actions per game 256 4096 1024
Inform verbs per game 256 256 256
(The manual says that there’s a possible 4096 actions per game but is that correct, because the action number is only 10 bits?)
I don’t think this would mean big changes either to the compiler or the relevant libraries (PunyInform, nudge, nudge) because the basic format is still there.
If there is any interest I’m happy to test out modifications to both compiler and library.
What do you think? Is it worth it, and are there more suggestions?
The prepositions could be numbered from 0 upward and then let the adjective table a simple → array (2 bytes per entry), where the position indicates the preposition number.
Repurpose top bits for preposition and parsing-routine and use them for numbering, setting the max number of each to 1024 instead of 256
@fredrik also suggested that it would be nice to find room in a grammar line definition to indicate that this individual line is meta.
The basic GV1 format is still there but it hasn’t been seriously tested in a very long time. Like z3 support five years ago, I expect that many bugs will turn up once you get into it.
Do you have an estimate of how much memory this would save for a typical game?
I havn’t counted in detail but I’ll guess normally somewhere between 500-1000 bytes (1 byte per grammar line and then 1 byte per token, expect for preposition and parsing routines that instead will cost an extra byte). I’ll try to make a calculated example tomorrow.
(To be clear, this would be for Z-code only. Glulx only supports GV2 and I don’t see any reason to change that.)
I’m happy to look at an I6 compiler change if you write it up. Try to minimize code churn and maximize readability, please :)
It would be good to have an experimental library change at the same time, so that we can tell whether the updated compiler is producing usable output. That doesn’t have to be clean or readable – a hacked-up 6.12 library that supports GV3 would be sufficient. We could compile Advent.inf and play through it and say yes, this grammar table works.
I havn’t seen any and have a hard time seeing that many being needed. Galatea has 120, Anchorhead has 144 and Jigsaw has 148.
Besides changes to the format the Inform6 language also will need a small change to accommodate a meta designation on an individual syntax. Suggestions?
The documentation for GV2 says the following regarding alternatives for prepositions:
In reality it’s $10 to start a list of alternatives, each member inside the list have $11 and the last alternative is marked $01. Are libraries dependent on this sequence and should it be preserved in future GVs?
Make it as similar to the original as possible, unless you have a reason to change it. If changing it saves space, change it. I can’t remember seeing any code outside the parser which cares about the details of how grammar lines are stored, but the parser may require less modification if the same values are used as for the old grammar version.
In reality it’s $10 to start a list of alternatives, each member inside the list have $11 and the last alternative is marked $01.
Huh. The library routine that uses this bits is PrepositionChain(); it looks like it’s the same in I6 and I7. I think it doesn’t care – that is, it only looks at the low bit of “next two” when scanning T2 to Tn. However, we should keep the existing compiler behavior wherever possible, so I say the Tech Manual comment is just a mistake.
To be completely consistent, we could allow reverse at the verb level, which sets the reverse flag on each individual syntax line? (Same handling as meta, I mean.)
Then we could also think about negative flags:
Verb meta 'restart'
* -> Restart
* noun -> SwitchOn ~meta;
Verb reverse 'show'
* creature held -> Show
* held 'to' creature -> Show ~reverse;
I vote against both of these ideas. :) But I thought I’d mention them for the sake of completeness.
While I think reverse on the verb level could do more harm than good, I think a negative meta flag is a good thing. It would mean that you could easily extend an existing meta verb with with non-meta grammar lines. E.g. “save duckling” or “restore painting”.
In Inform 6, you can set the meta global variable in an action routine to say that you’ve decided that this is now a meta action. If you know that you’d always set it for a certain verb, you can mark the verb as meta instead. This tells the parser to set the meta global variable so you don’t have to.
So, if you only set meta for some grammar lines, it should not be set on the verb level. If you extend the grammar with a line that has ~meta, the meta flag on the verb should be unset.