Is the world ready for a new Inform6 grammar format?

heasm66 · March 1, 2024, 1:15pm

Working with the tool UnZ for unpacking z-files got me thinking about the grammar version 2 and ways of making it even more compact by only use 2 bytes per token and reintroducing the parsing routines table and adjectives table.

Using the definition of the current grammar version 2 the Inform Technical Manual, Ch. 8 as a starting point, this is a first draft:

GV3 is a variant of GV2 with a more compact data structure.  GV3 only use
2 bytes for each token and removes the need for the ENDIT marker.  In GV3 
an individual grammar table has the format:

    <number of grammar lines>          1 byte

followed by that many grammar lines.  A grammar line have the form:

    <action number>  <token 1> ... <token N>
    ----2 bytes----  -2 bytes-     -2 bytes-

The action number is actually contained in the bottom 10 bits of the word
given first: the top five contains the number of tokens in this grammar 
line, which leaves.

    action_number & $400

as a bit meaning "reverse the order of the first and second parameters
if this action is to be chosen".

There can be anything from 0 to 31 tokens, and each occupies two bytes, 
arranged as:

    <token type>   <token data>
    -- byte ----   --- byte ---

Token type bytes are divided into the top two bits, the next two and the
bottom four.

The "next two bits" are used to indicate alternatives.  In a sequence of
tokens

    T1 / T2 / T3 / ... / Tn

then T1 will have $$10 in its "next two bits", and each of T2 to Tn will
have $$01.  Tokens not inside lists of alternatives always have $00.  (Note
that at present only prepositions are allowed as alternatives, but the
format is designed to open the possibility of extending this to all tokens.)

The bottom four are the "type" of the token.  The top two indicate what kind
of data is contained in the token data.  Strictly speaking this could be
deduced from the bottom six bits, but it's convenient for making backpatching
GV3 tables a simple matter within the compiler.

    Type  Means                       Data contains              Top bits
    0     illegal (never compiled)
    1     elementary token            0   "noun"                 00
                                      1   "held"
                                      2   "multi"
                                      3   "multiheld"
                                      4   "multiexcept"
                                      5   "multiinside"
                                      6   "creature"
                                      7   "special"
                                      8   "number"
                                      9   "topic"
    2     'preposition'               adjective number         	 01
    3     noun = Routine              parsing-routine-number     10
    4     attribute                   attribute number           00
    5     scope = Routine             parsing-routine-number     10
    6     Routine                     parsing-routine-number     10

GV3 identify a particular preposition or parsing-routine using a numbering system.  
GV3 numbers parsing-routines upwards from 0 to 255, in order of first use.  
A separate table translates these into routine packed addresses: the 
"preactions" table.  The preactions table is a simple --> array.

Prepositions are also identified by their "adjective number".  Adjective
numbers count downwards from 255 in order of first use.  They are translated
back into dictionary words using the "adjectives table".

The adjectives table starts with two bytes containing the number of
"adjectives" in the table. Each "adjective" entry then are three bytes:

    <dictionary address of word>  <adjective number>
    ----2 bytes-----------------  ---- byte--------- 

These entries are stored in reverse order (i.e., lowest adjective number 
first).  The constant #adjectives_table refer to this table.


As in GV2, fake actions in GV3 are numbered from 4096 upwards.


Note that although GV3 reintroduces the preaction and adjective table,
the omission of the ENDIT marker and two byte tokens instead of three 
byte, should produce a more economical grammar table.

Comparison table between the different grammar versions:

                                        Limit in: 
										GV1    GV2          GV3

    Prepositions per game               76     unlimited    256
    Parsing routines (general ones,
       noun= filters, scope= routines
       all put together) per game       32     unlimited    256
    Tokens per grammar line             6      unlimited    31
    Actions per game                    256    4096        1024
    Inform verbs per game               256    256          256

(The manual says that there’s a possible 4096 actions per game but is that correct, because the action number is only 10 bits?)

I don’t think this would mean big changes either to the compiler or the relevant libraries (PunyInform, nudge, nudge) because the basic format is still there.

If there is any interest I’m happy to test out modifications to both compiler and library.

What do you think? Is it worth it, and are there more suggestions?

heasm66 · March 1, 2024, 1:20pm

Two things I’ve already thought of are:

The prepositions could be numbered from 0 upward and then let the adjective table a simple → array (2 bytes per entry), where the position indicates the preposition number.
Repurpose top bits for preposition and parsing-routine and use them for numbering, setting the max number of each to 1024 instead of 256

@fredrik also suggested that it would be nice to find room in a grammar line definition to indicate that this individual line is meta.

zarf · March 1, 2024, 3:49pm

The basic GV1 format is still there but it hasn’t been seriously tested in a very long time. Like z3 support five years ago, I expect that many bugs will turn up once you get into it.

Do you have an estimate of how much memory this would save for a typical game?

heasm66 · March 1, 2024, 8:32pm

I havn’t counted in detail but I’ll guess normally somewhere between 500-1000 bytes (1 byte per grammar line and then 1 byte per token, expect for preposition and parsing routines that instead will cost an extra byte). I’ll try to make a calculated example tomorrow.

heasm66 · March 2, 2024, 10:26am

I did a theoretical calulation on Curses (release 16). The result should be something like:

                               GV1     GV2     GV3
A. # grammar lines             291     280     280
B. # tokens                      -     450     450
C. # unique prepositions        34       -      34
D. # unique parsing routines     7       -       7
E. # verb actions              146     149     149

Theoretical size:
GV1: A*8+E+C*4+D*2 = 2.624 bytes
GV2: (A+B)*3+E     = 2.339 bytes  
GV3: (A+B+C+D)*2+E = 1.691 bytes

zarf · March 2, 2024, 6:33pm

Well, it sounds good.

(To be clear, this would be for Z-code only. Glulx only supports GV2 and I don’t see any reason to change that.)

I’m happy to look at an I6 compiler change if you write it up. Try to minimize code churn and maximize readability, please :)

It would be good to have an experimental library change at the same time, so that we can tell whether the updated compiler is producing usable output. That doesn’t have to be clean or readable – a hacked-up 6.12 library that supports GV3 would be sufficient. We could compile Advent.inf and play through it and say yes, this grammar table works.

zarf · March 2, 2024, 6:38pm

Glulx only supports GV2 and I don’t see any reason to change that.

Hm, I wonder if it’s worth promoting the “meta flag for grammar line” idea to Glulx. Do you want to get that into the GV3 plan for Z-code?

fredrik · March 4, 2024, 12:09pm

Has any Z-code game ever come close to 500 actions? If not, maybe we could limit the action number to 9 bits, and use one bit for the meta flag?

heasm66 · March 4, 2024, 12:57pm

I havn’t seen any and have a hard time seeing that many being needed. Galatea has 120, Anchorhead has 144 and Jigsaw has 148.

Besides changes to the format the Inform6 language also will need a small change to accommodate a meta designation on an individual syntax. Suggestions?

fredrik · March 4, 2024, 1:14pm

Make it similar to the reverse keyword, since both set flags on grammar line level, i.e.

Verb 'help'
  *                -> Hint meta
  * creature       -> Help;

If they can only come in one order, I think first meta then reverse feels more natural.

heasm66 · March 4, 2024, 1:17pm

Sounds reasonable.

The order shouldn’t matter.

Should meta on verb level also set this flag on each idividual syntax line (dictionary word still get the flag)?

Should a meta flag on a line be ignored or elevated to verb if GV1 or GV2 are compiled?

heasm66 · March 4, 2024, 1:31pm

The documentation for GV2 says the following regarding alternatives for prepositions:

In reality it’s $10 to start a list of alternatives, each member inside the list have $11 and the last alternative is marked $01. Are libraries dependent on this sequence and should it be preserved in future GVs?

fredrik · March 4, 2024, 2:08pm

Make it as similar to the original as possible, unless you have a reason to change it. If changing it saves space, change it. I can’t remember seeing any code outside the parser which cares about the details of how grammar lines are stored, but the parser may require less modification if the same values are used as for the old grammar version.

zarf · March 4, 2024, 3:34pm

Accepting both orders will not be a problem.

Yes, definitely.

Report it as an error. (It’s an error now.)

zarf · March 4, 2024, 3:43pm

(That’s Inform Technical Manual chapter 8.6.)

In reality it’s $10 to start a list of alternatives, each member inside the list have $11 and the last alternative is marked $01.

Huh. The library routine that uses this bits is PrepositionChain(); it looks like it’s the same in I6 and I7. I think it doesn’t care – that is, it only looks at the low bit of “next two” when scanning T2 to Tn. However, we should keep the existing compiler behavior wherever possible, so I say the Tech Manual comment is just a mistake.

zarf · March 4, 2024, 11:31pm

Going a wee bit farther:

To be completely consistent, we could allow reverse at the verb level, which sets the reverse flag on each individual syntax line? (Same handling as meta, I mean.)

Then we could also think about negative flags:

Verb meta 'restart'
    *  -> Restart
    * noun  -> SwitchOn ~meta;

Verb reverse 'show'
    * creature held  -> Show
    * held 'to' creature  -> Show ~reverse;

I vote against both of these ideas. :) But I thought I’d mention them for the sake of completeness.

heasm66 · March 5, 2024, 6:58am

It’s not that useful but I like the symmetry of reverse on verb level and negative flags, though.

fredrik · March 5, 2024, 7:22am

While I think reverse on the verb level could do more harm than good, I think a negative meta flag is a good thing. It would mean that you could easily extend an existing meta verb with with non-meta grammar lines. E.g. “save duckling” or “restore painting”.

heasm66 · March 5, 2024, 7:38am

Valid point. How important is the meta-flag on the dictionary word and what rules should determine when to set it when not all grammar lines are meta?

fredrik · March 5, 2024, 7:54am

In Inform 6, you can set the meta global variable in an action routine to say that you’ve decided that this is now a meta action. If you know that you’d always set it for a certain verb, you can mark the verb as meta instead. This tells the parser to set the meta global variable so you don’t have to.

So, if you only set meta for some grammar lines, it should not be set on the verb level. If you extend the grammar with a line that has ~meta, the meta flag on the verb should be unset.