Seeking Dialog source code for optimization experiment

We’ve added a bunch of new features to the Dialog compiler in the last year, and I’m very happy with how it looks now! The $8000 bug in particular removed the most important RAM limitation from the Z-machine backend, allowing much larger games to be made.

But those larger games are starting to hit other limits now. Wise-Woman’s Dog is at about 97% of the total file size limit for Z-machine version 8, most of that in ROM—strings and routines. My next priority is getting that ROM usage down through new optimizations.

There’s not much to be done for routines, because the compiler guts are horrifically complicated and I don’t grok them well enough to make fundamental changes there. For strings, though, the Z-machine has two optimizations built in that Dialog doesn’t currently use: abbreviations and custom alphabet tables.

For the abbreviations, I want to start with the most basic possible implementation, so people can go in and make further enhancements later. In particular, I want to start with:

  • A default stock of abbreviations, not customized per game but built into the compiler
  • A greedy matching algorithm to apply them, not Wagner’s perfect match algorithm

Once this is implemented, we (probably meaning other people, not me) can go back and add other enhancements, like automatically calculating abbreviations for each game and using a better matching algorithm. But the basic implementation should help in the meantime, and it’ll give people a base to build from.

Which means first, I need a basic set of abbreviations that should be broadly applicable across games: things like " the " and "You " that are common in library messages and parser IF descriptions. There are tools like ZAbbrev to compute these automatically, but they need a set of strings to optimize.

And that’s where you come in! I’m going to be making a debug build of the compiler that dumps all its ZSCII strings to an external file, compiling a whole bunch of different Dialog games with that debug build, sticking the files together, and feeding that to ZAbbrev to make the standard abbreviation set. I already have all of my games—Miss Gosling, Familiar Problems, Wise-Woman’s Dog, Stage Fright—as well as Impossible Stairs, which is open-source. But if you’d like to contribute to this endeavor, let me know, and you can either send me your full source, or build the debug compiler and send me the output. My goal is to get a representative sample of Dialog games so that the default abbreviations will be widely applicable to anything people write.

If you’re interested, let me know here or in DMs. I don’t know when this project will really take off (possibly not for a month or two), but I think once a basic implementation is in place, people will be well-equipped to improve it.

1 Like

FWIW I just implemented abbreviations in my compiler a few hours ago.

I’d recommend having Dialog produce an Inform-style gametext.txt file. The format is really simple – each line starts with a letter, a colon, a space, then the text. I for information, O for objects, H for inline text was enough for my purposes.

Then you can feed it to either version (C# or C) of ZAbbrev. The output is a bunch of lines like

Abbreviate “The text to abbreviate” ! comments about savings

which are then simple enough to process as input to the compiler.

In my compiler, I have a 95x95 2D array indexed by the first two characters in the abbreviation. If there are multiple abbreviations starting with the same two characters, there is an additional 96 element array of next indices to link them all up.

That way, actually finding the abbreviation takes very little time. I assume longer abbreviations are listed earlier so we don’t need to use an exhaustive search.

Hope this helps,

-Dave

1 Like

That’s the sort of thing I’m going for! The problem is that Dialog isn’t really designed architecturally to read optimization data from an external file, so I want to start with a built-in default set of abbreviations to make sure everything works first. If that’s a success, other people can potentially make it read abbreviation data from another file, or (even better) compute them itself.

Dialog doesn’t use object names for anything except debugging, so everything is going to be H, but I can definitely make it print all the strings in that format. There’s a function called encode_chars that takes ZSCII text and converts it to pentets, and that’s where I’m planning to hook in: first, to dump all the strings passed to that routine, and then to implement the abbreviating.

Here’s the abbreviation matching code at the top of my encoding loop:

	while (srcSize && (!destSize || offset < destSize)) {
		if (!forDict && srcSize>=2 && src[0]>=32 && src[0]<127 && src[1]>=32 && src[1]<127) {
			uint8_t j = abbreviations_lut[src[0]-32][src[1]-32];
			while (j != 255) {
				if (!strncmp(abbreviations[j],src,abbreviation_lengths[j]))
					break;
				else
					j = abbreviations_next[j];
			}
			if (j != 255) {
				storeCode((j>>5)+1);
				storeCode(j&31);
				src += abbreviation_lengths[j];
				srcSize -= abbreviation_lengths[j];
				continue;
			}
		}
		—srcSize; // rest of encoding follows...

And here’s the code that actually builds the LUT after I’ve encoded all of the abbreviations

	const char *abbreviations[96];
	uint8_t abbreviation_lengths[96];
	uint8_t abbreviations_lut[95][95]; // memset this to 255
	uint8_t abbreviations_next[96];    // memset this to 255
	uint8_t abbreviation_count;

	for (uint8_t i=0; i<abbreviation_count; i++) {
		// Use a 2D LUT to identify the first two characters of the abbreviation.
		// If there's more than one, they form a linked list.
		auto *al = &abbreviations_lut[abbreviations[i][0]-32][abbreviations[i][1]-32];
		while (*al != 255)
			al = &abbreviations_next[*al];
		*al = i;
	}

If you make it a 127x127 array instead of 95x95 you can include newlines pretty easily.

1 Like

Anyway, I sort of derailed this thread with implementation details most don’t care about. I’m happy to delete my messages if they’re distracting.

Oh no, it’s good! I’m currently planning to go even simpler than that, since Dialog is not a fast or low-resource compiler already: go linearly through the list of abbreviations from longest to shortest, strncmp each one, apply it and skip ahead if relevant. My priority is just to get something working, and then other people are welcome to improve it; but it’s easier to improve than to write in the first place.

1 Like

I just ran into this myself. Originally I planned to declare the abbreviations in the source code (eventually placing them in an include file that could be replaced procedurally), but then I ran into an interesting catch-22.

My compiler operates in two passes. The first pass ignores string literals since otherwise they have to be allocated on the heap (this was a memory leak I just fixed earlier today). But the abbreviations need to be in place in time for the final text processing in the second pass! In my case it was easier then to specify a command line option naming the abbreviations file that is then read separately.

Anyway, both my compiler and interpreter are intended to run in relatively constrained environments (like Raspberry pi Pico or ESP-32, each of which have plenty of code space but only 256/512k of RAM)

-Dave

You actually don’t need to do an z-text output function. unz can extract the gametext.txt-file from any compiled story-file (z-macine, of course). The gametext-file can then be used as input for zabbrev.

3 Likes

Oh, even better! I’d still prefer to have authors’ explicit permission, but then there’s no need to send me your source code; I’ll just grab it from the files on IFDB.

I’m curious if that will actually make a measurable difference in compile time. Running 96 strcmp’s on every character of printable text in the game might add up to a signifcant amount of processing time.

That’s where a 2D array can help – it’s small enough to fit in the L1/L2 cache and quickly eliminates many candidates. On the other hand, assuming your library version of strcmp is reasonably-well optimized, it’s going to reject mismatches pretty quickly as well.

Another option is to have a 1D array, while still keeping the linked list of candidates, but I’m probably overthinking all of this.

-Dave

1 Like

That’s a good point! It looks like Inform 6 uses a 1D array of offsets into the abbreviation array, which is alphabetized. That shouldn’t be hard to implement, and if it’s fast enough for I6 (which runs on old hardware) it should be fast enough for us.

Here are the results on the first four games:

Abbreviate " doesn't ";                     !    64x10, saved   500
Abbreviate "Uncle Joe";                     !    53x11, saved   465
Abbreviate "already ";                      !    86x 8, saved   507
Abbreviate " through";                      !   105x 8, saved   621
Abbreviate " in the ";                      !   311x 8, saved  1857
Abbreviate " little ";                      !    67x 8, saved   393
Abbreviate " of the ";                      !   238x 8, saved  1419
Abbreviate " at the ";                      !   107x 8, saved   633
Abbreviate "to the ";                       !   376x 7, saved  1871
Abbreviate "Watson ";                       !   279x 8, saved  1665
Abbreviate " small ";                       !    73x 7, saved   356
Abbreviate "ed to ";                        !   172x 6, saved   682
Abbreviate "metal ";                        !    97x 6, saved   382
Abbreviate ", and ";                        !   269x 7, saved  1336
Abbreviate "looks ";                        !   145x 6, saved   574
Abbreviate ". You ";                        !   150x 8, saved   891
Abbreviate " this ";                        !   179x 6, saved   710
Abbreviate " again";                        !    80x 6, saved   314
Abbreviate " your ";                        !   449x 6, saved  1790
Abbreviate " that";                         !   408x 5, saved  1218
Abbreviate " says";                         !   158x 5, saved   468
Abbreviate "round";                         !   231x 5, saved   687
Abbreviate " you ";                         !   874x 5, saved  2616
Abbreviate " from";                         !   246x 5, saved   732
Abbreviate "ould ";                         !   212x 5, saved   630
Abbreviate " and ";                         !   663x 5, saved  1983
Abbreviate " some";                         !   205x 5, saved   609
Abbreviate "This ";                         !   164x 6, saved   650
Abbreviate ", but";                         !   263x 6, saved  1046
Abbreviate "It's ";                         !   321x 7, saved  1596
Abbreviate " with";                         !   411x 5, saved  1227
Abbreviate "thing";                         !   333x 5, saved   993
Abbreviate " down";                         !   157x 5, saved   465
Abbreviate "Davis";                         !    73x 6, saved   286
Abbreviate " the ";                         !  1661x 5, saved  4977
Abbreviate "house";                         !   117x 5, saved   345
Abbreviate " his ";                         !   222x 5, saved   660
Abbreviate ", the";                         !   173x 6, saved   686
Abbreviate " back";                         !   158x 5, saved   468
Abbreviate "Your ";                         !    95x 6, saved   374
Abbreviate " are ";                         !   133x 5, saved   393
Abbreviate ". The";                         !   194x 7, saved   961
Abbreviate " for";                          !   353x 4, saved   700
Abbreviate "here";                          !   476x 4, saved   946
Abbreviate "ight";                          !   304x 4, saved   602
Abbreviate " it ";                          !   364x 4, saved   722
Abbreviate "tion";                          !   309x 4, saved   612
Abbreviate "ough";                          !   175x 4, saved   344
Abbreviate "You'";                          !   170x 6, saved   674
Abbreviate " is ";                          !   419x 4, saved   832
Abbreviate "able";                          !   196x 4, saved   386
Abbreviate " the";                          !   448x 4, saved   890
Abbreviate "can ";                          !   245x 4, saved   484
Abbreviate " to ";                          !   972x 4, saved  1938
Abbreviate " of ";                          !   567x 4, saved  1128
Abbreviate "The ";                          !   307x 5, saved   915
Abbreviate "n't ";                          !   433x 5, saved  1293
Abbreviate " you";                          !   450x 4, saved   894
Abbreviate "have";                          !   249x 4, saved   492
Abbreviate "ing ";                          !   887x 4, saved  1768
Abbreviate "You ";                          !   594x 5, saved  1776
Abbreviate "Ada";                           !   170x 4, saved   334
Abbreviate "ind";                           !   333x 3, saved   330
Abbreviate " it";                           !   442x 3, saved   439
Abbreviate "ing";                           !   871x 3, saved   868
Abbreviate " in";                           !   663x 3, saved   660
Abbreviate ". I";                           !   166x 5, saved   492
Abbreviate "'s ";                           !   579x 4, saved  1152
Abbreviate " an";                           !   294x 3, saved   291
Abbreviate "way";                           !   259x 3, saved   256
Abbreviate " a ";                           !   892x 3, saved   889
Abbreviate "st ";                           !   346x 3, saved   343
Abbreviate "ver";                           !   357x 3, saved   354
Abbreviate "all";                           !   461x 3, saved   458
Abbreviate "ter";                           !   446x 3, saved   443
Abbreviate "en ";                           !   303x 3, saved   300
Abbreviate "ear";                           !   379x 3, saved   376
Abbreviate "...";                           !   115x 6, saved   454
Abbreviate "see";                           !   371x 3, saved   368
Abbreviate "es ";                           !   346x 3, saved   343
Abbreviate " on";                           !   508x 3, saved   505
Abbreviate "ent";                           !   467x 3, saved   464
Abbreviate " be";                           !   437x 3, saved   434
Abbreviate " no";                           !   397x 3, saved   394
Abbreviate "and";                           !   541x 3, saved   538
Abbreviate "re ";                           !   385x 3, saved   382
Abbreviate "he ";                           !   391x 3, saved   388
Abbreviate "rea";                           !   356x 3, saved   353
Abbreviate "out";                           !   612x 3, saved   609
Abbreviate "ly ";                           !   635x 3, saved   632
Abbreviate "ed ";                           !   570x 3, saved   567
Abbreviate "er ";                           !   635x 3, saved   632
Abbreviate "hat";                           !   333x 3, saved   330
Abbreviate ". ";                            !  1186x 3, saved  1183
Abbreviate ", ";                            !  1287x 3, saved  1284
Abbreviate "e.";                            !   346x 3, saved   343

Several of these are names that appear repeatedly in certain games, so I’ll be dropping those. The “…” is because of my own personal style. But the rest seem generally relevant. I’ll try running the experiment with a different set of four games and see if it digs up anything else good.

(I was going to wait until that second set of games had finished processing to post, but it’s currently 6% of the way through the refining process and my CPU is at 185° Fahrenheit / 85° Celsius, so I’m expecting to be here for A While.)

Second batch:

Abbreviate "points for ";                   !    53x11, saved   465
Abbreviate " direction";                    !    31x10, saved   236
Abbreviate " from the ";                    !   134x10, saved  1060
Abbreviate " into the ";                    !   145x10, saved  1148
Abbreviate " seems to ";                    !    71x10, saved   556
Abbreviate "You can't ";                    !   112x12, saved  1108
Abbreviate "Suddenly,";                     !    43x11, saved   375
Abbreviate " yourself";                     !    65x 9, saved   446
Abbreviate " continue";                     !    55x 9, saved   376
Abbreviate "something";                     !    50x 9, saved   341
Abbreviate "currently";                     !   100x 9, saved   691
Abbreviate "There's ";                      !   137x10, saved  1084
Abbreviate "massive ";                      !    55x 8, saved   321
Abbreviate "already ";                      !    86x 8, saved   507
Abbreviate " of the ";                      !   212x 8, saved  1263
Abbreviate " in the ";                      !   312x 8, saved  1863
Abbreviate "They're ";                      !    45x10, saved   348
Abbreviate " through";                      !   144x 8, saved   855
Abbreviate " to the ";                      !   342x 8, saved  2043
Abbreviate "It's a ";                       !   265x 9, saved  1846
Abbreviate " large ";                       !   148x 7, saved   731
Abbreviate " leads ";                       !    64x 7, saved   311
Abbreviate "This is";                       !    88x 8, saved   519
Abbreviate "You're ";                       !   140x 9, saved   971
Abbreviate "narrow ";                       !    54x 7, saved   261
Abbreviate "wooden ";                       !   120x 7, saved   591
Abbreviate "stone ";                        !   124x 6, saved   490
Abbreviate ". The ";                        !   150x 8, saved   891
Abbreviate "appear";                        !    85x 6, saved   334
Abbreviate " your ";                        !   325x 6, saved  1294
Abbreviate ", you ";                        !   117x 7, saved   576
Abbreviate "small ";                        !    95x 6, saved   374
Abbreviate "thick ";                        !    96x 6, saved   378
Abbreviate " that";                         !   218x 5, saved   648
Abbreviate " you ";                         !   553x 5, saved  1653
Abbreviate "round";                         !   260x 5, saved   774
Abbreviate " with";                         !   309x 5, saved   921
Abbreviate " from";                         !   189x 5, saved   561
Abbreviate "It's ";                         !   237x 7, saved  1176
Abbreviate " and ";                         !   891x 5, saved  2667
Abbreviate " the ";                         !  1576x 5, saved  4722
Abbreviate "north";                         !   132x 5, saved   390
Abbreviate "spell";                         !    82x 5, saved   240
Abbreviate "have ";                         !   110x 5, saved   324
Abbreviate "south";                         !   134x 5, saved   396
Abbreviate "floor";                         !    87x 5, saved   255
Abbreviate "You ";                          !   690x 5, saved  2064
Abbreviate "open";                          !   133x 4, saved   260
Abbreviate "ing ";                          !   844x 4, saved  1682
Abbreviate " of ";                          !   658x 4, saved  1310
Abbreviate " to ";                          !   571x 4, saved  1136
Abbreviate "door";                          !   161x 4, saved   316
Abbreviate "n't ";                          !   229x 5, saved   681
Abbreviate " you";                          !   315x 4, saved   624
Abbreviate "here";                          !   396x 4, saved   786
Abbreviate " is ";                          !   348x 4, saved   690
Abbreviate "The ";                          !   293x 5, saved   873
Abbreviate "able";                          !   162x 4, saved   318
Abbreviate "east";                          !   163x 4, saved   320
Abbreviate "ight";                          !   281x 4, saved   556
Abbreviate "tion";                          !   190x 4, saved   374
Abbreviate "his ";                          !   226x 4, saved   446
Abbreviate "ward";                          !   178x 4, saved   350
Abbreviate "side";                          !   169x 4, saved   332
Abbreviate "down";                          !   188x 4, saved   370
Abbreviate "the";                           !   311x 3, saved   308
Abbreviate "'s ";                           !   239x 4, saved   472
Abbreviate "re ";                           !   347x 3, saved   344
Abbreviate "en ";                           !   238x 3, saved   235
Abbreviate "est";                           !   359x 3, saved   356
Abbreviate "and";                           !   545x 3, saved   542
Abbreviate ". A";                           !   171x 5, saved   507
Abbreviate "out";                           !   305x 3, saved   302
Abbreviate " a ";                           !   840x 3, saved   837
Abbreviate "ver";                           !   441x 3, saved   438
Abbreviate "ock";                           !   269x 3, saved   266
Abbreviate " in";                           !   528x 3, saved   525
Abbreviate " it";                           !   324x 3, saved   321
Abbreviate "see";                           !   271x 3, saved   268
Abbreviate " be";                           !   342x 3, saved   339
Abbreviate "ed ";                           !   616x 3, saved   613
Abbreviate "ly ";                           !   539x 3, saved   536
Abbreviate "an ";                           !   314x 3, saved   311
Abbreviate "ing";                           !   827x 3, saved   824
Abbreviate "ain";                           !   315x 3, saved   312
Abbreviate "ear";                           !   306x 3, saved   303
Abbreviate "ent";                           !   337x 3, saved   334
Abbreviate "rea";                           !   279x 3, saved   276
Abbreviate "es ";                           !   353x 3, saved   350
Abbreviate "ter";                           !   447x 3, saved   444
Abbreviate "way";                           !   204x 3, saved   201
Abbreviate "all";                           !   475x 3, saved   472
Abbreviate ", ";                            !   806x 3, saved   803
Abbreviate ". ";                            !   913x 3, saved   910
Abbreviate "s.";                            !   279x 3, saved   276
Abbreviate ".~";                            !   137x 4, saved   268

This one feels more game-specific, but there’s some good generic stuff in here as well. Combining these should give me a good list.

The third batch would have been Wise-Woman’s Dog alone, but it’s too big for ZAbbrev to handle (…oops), so I’ll be starting with this.

1 Like

This automatically converts into the following C code.

// Generated automatically by maker.py
// Intended to be included directly into backend_z.c

#define N_ABBREVS 96
const char* abbreviations[N_ABBREVS] = {
	// ' ' (32)
	" from the ",
	" into the ",
	" doesn't ",
	" through",
	" to the ",
	" in the ",
	" little ",
	" of the ",
	" at the ",
	" this ",
	" again",
	" your ",
	" that",
	" says",
	" you ",
	" from",
	" and ",
	" some",
	" with",
	" down",
	" the ",
	" his ",
	" back",
	" are ",
	" for",
	" it ",
	" is ",
	" the",
	" to ",
	" of ",
	" you",
	" it",
	" in",
	" an",
	" a ",
	" on",
	" be",
	" no",

	// T (84)
	"They're ",
	"This ",
	"The ",

	// a (97)
	"already ",
	"able",
	"all",
	"and",

	// t (116)
	"to the ",
	"thing",
	"tion",
	"ter",

	// e (101)
	"ed to ",
	"en ",
	"ear",
	"es ",
	"ent",
	"ed ",
	"er ",
	"e.",

	// , (44)
	", and ",
	", but",
	", the",
	", ",

	// l (108)
	"looks ",
	"ly ",

	// . (46)
	". You ",
	". The",
	". I",
	"...",
	".\"",
	". ",

	// I (73)
	"It's ",
	"It's ",

	// r (114)
	"round",
	"re ",
	"rea",

	// o (111)
	"ould ",
	"ough",
	"out",

	// Y (89)
	"Your ",
	"You'",
	"You ",

	// h (104)
	"here",
	"have",
	"he ",
	"hat",

	// i (105)
	"ight",
	"ing ",
	"ind",
	"ing",

	// c (99)
	"can ",

	// n (110)
	"n't ",

	// ' (39)
	"'s ",

	// w (119)
	"way",

	// s (115)
	"st ",
	"see",
	"s.",

	// v (118)
	"ver",
};

// Maps codepoints to starting indices in the abbreviations array
const uint8_t abbrev_index[255] = {96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 0, 96, 96, 96, 96, 96, 96, 90, 96, 96, 96, 96, 57, 96, 63, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 69, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 38, 96, 96, 96, 96, 77, 96, 96, 96, 96, 96, 96, 96, 41, 96, 88, 96, 49, 96, 96, 80, 84, 96, 96, 61, 96, 89, 74, 96, 96, 71, 92, 45, 96, 95, 91, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96};

So there’s an array of abbreviations, which are sorted from longest to shortest within each starting character, and then an array of starting points indexed by the character’s value. (Most of these are 96, which is the end of the array, so the search loop will immediately exit—there are no abbreviations starting with \t or q, for example.)

I think this indexing will make lookup efficient enough; it’s the same thing Inform 6 currently does, and that runs at a decent speed. These abbreviations won’t be the most appropriate for every game, but I think they’ll be appropriate enough for some net savings. Choosing game-specific abbreviations can come later.

2 Likes

Success!

1 Like

A 1d array is better for Unicode anyway. My 2d approach is fine for English but explodes quickly if we start including characters above 128 or 256.

Congratulations!

1 Like