Dialog feature request: input preprocessing

Making an “official” feature request, since the original is likely to get buried:

It’s been explained before that a “preprocess” entry point (altering the parse buffer as a list of numbers before the tokenizer gets its hands on it) would remove charset-independence, which would be an issue.

However, Dialog does now have an equivalent to C’s chars: single-character dictionary words, which are already used for character-level manipulations, without imposing a specific character set on the language as a whole.

Would it be possible to add an entry point (preprocess $CharsIn into $CharsOut), allowing authors to manipulate the player’s raw input before it gets tokenized? This would be useful for e.g. recognizing numbers with decimal points, which is easy in the “preprocessing” stage but extremely annoying after tokenization has already happened.

Mechanically, this seems relatively straightforward (though somewhat beyond my abilities, since I don’t have much experience with Dialog’s internals): for Z-machine, change the Z_AREAD instruction on line 2312 of runtime_z.c to have a null pointer as its second argument, call out to a special routine that converts the array of values into a list of characters (like a simplified version of R_PARSE_INPUT), call the entry point, turn it back into an array of values, call Z_TOKENIZE and return to the rest of R_GET_INPUT. It would be somewhat more difficult on the Å-machine, since it would involve a change in the fundamental input mechanism, but hopefully not too large of one.

The question is—would this be a good direction for Dialog? I’m not sure if it would have unforeseen consequences down the line. And, if it’s not the direction the language wants to go in, what’s the recommended way of handling things like decimal numbers, input transformations (like stripping diacritics or altering capitalization in ways beyond what the Z-machine does automatically), and so on?

(As a side note, is Dialog on Github/Gitlab/Bitbucket/etc? I know there was talk of that a while ago.)

The request is noted. Thanks!

There are a couple of issues to sort through, mostly related to performance on low-end systems. See, while it’s true that Dialog can work with characters (i.e. single-character words) and words created at runtime from such characters, those words are slower to work with than the ones that are known at compile-time and end up in the game dictionary. The Dialog runtime environment uses tagged values—16-bit words where a few of the upper bits determine the type, and the remaining bits are data. A word in the game dictionary is one kind of tagged value, where the data is an index into the dictionary; a single-character word is another, where the data is a character code (the two categories have different tag bits, to keep them apart). But a word that is constructed at runtime is currently represented by a linked list of single-character words on the heap (the data bits of the value are a pointer into the heap area). Working with such lists incurs some overhead, which is negligible every now and then, but might have a noticeable impact if it happens on every single word of input.

But while performance on 8-bit systems is a stated design goal, so is elegance. The language contains an integrated feature called “removable word endings” which was added to support a German convention of IF, namely that words typed by the player are stripped from adjective endings until they match any word in the game dictionary. Now, if the tokenization process were to be opened up to the Dialog programmer, then removable word endings could be dealt with in the standard library, and the language definition could potentially be simplified.

One of the benefits of designing a high-level language is that the internal data representation isn’t set in stone. I’ve been toying with the idea of switching to a more efficient storage model for arbitrary words/strings. But I haven’t worked out all the details yet, and right now my design strategy is CDD (comp-driven development).

But this is the kind of idea that’ll sit at the back of my head, silently plotting to take over at the slightest sign of weakness.

2 Likes

Ahh, makes sense. The question of “how to parse 123.4kHz in Dialog?” has been gnawing at me for the past few days, so I’ve been working on different implementations for it; the current one involves breaking every word into characters, preprocessing, then joining them all back up again. But it sounds like that’s going to cause serious performance issues.

Would it be worth checking the dictionary when using (join $ into $), and representing it as a standard dictionary word if possible? I’m not really sure what the expected use case is for that predicate; I’m assuming what I’m doing is not what you had in mind.

Yeah, this is supposed to happen at some point. All of the code is open source, as you know, but the source markup for the manual is still unreleased, and I have some local test cases and scripts for putting the tarball together, etc. It takes a bit of work to sort that out and prepare a nice repository structure. The loose plan has been to reach a point where more people than myself are publishing works in Dialog, to let the language gel and eventually leave beta, and somewhere around that time create a public repository and issue tracker.

I’d rather not commit to a deadline. Even long-term assurances like “before the end of the year” or “before next NarraScope” have a tendency to return and bite one in the back. =)

3 Likes

Actually, this is already done. Thus, if the dictionary contains ‘foobar’ and you attempt to join [f o o b a r] then you’ll get a word that’s represented by a simple 16-bit value. If baz is a removable word ending, and you attempt to join [f o o b a r b a z], then you’ll get a word that prints as foobarbaz, matches @foobar, and is internally represented by a pair of integers (indicating foobar and baz in the dictionary).

In other words, join operations are slow because they involve tokenizing the word. But once you have the word, comparisons are very fast, because a word that is represented by a linked list can only ever match another linked-list word. When doing a lookup (e.g. dispatching on a verb or action name), there’s no need for a special case to deal with linked-list words. They have different tag bits, they simply won’t match any of the cases, and that’s the desired behaviour.

2 Likes