Suggestion: treat ^L as whitespace

I had the idea to use ^L, the Form Feed character, commonly used as a page separator, to add some structure to my source files. Dialog gives the “Ignoring control character” message, but it would be nice to treat this as whitespace instead. Could be as simple as an extra else if in lexer_getc to return 0x20 if it sees 12.

2 Likes

There’s an open feature request to treat U+00A0 (non-breaking space) as whitespace as well; it shouldn’t be hard to add U+000C at the same time. Are there any other characters that people commonly use as whitespace?

I specifically don’t want to commit to handling every Unicode whitespace character, because I worry about edge cases, and Dialog currently doesn’t rely on any external libraries for Unicode support. But dealing with the more common ones seems reasonable enough.

This is what I use in various projects:

bool char_is_space (uchar_t ch)
{
	// note that `NUL` is a control char, hence it is considered
	// whitespace here.

	if (ch == ' ')  return true;
	if (ch == 0x7F) return true;  // DEL
	if (ch == 0xA0) return true;  // nbsp

	if (ch < 0x0020)                  return true;  // C0 control chars
	if (0x0080 <= ch && ch <= 0x009F) return true;  // C1 control chars
	if (0x2000 <= ch && ch <= 0x200B) return true;  // various

	if (ch == 0x2028) return true;  // line separator
	if (ch == 0x2029) return true;  // paragraph separator
	if (ch == 0x205F) return true;  // mathematical space
	if (ch == 0x3000) return true;  // ideographic space

	return false;
}
1 Like

Assuming that you are doing this because you are using Emacs, the page delimiter is configurable. You could just use the same convention that the standard library uses (a full width sequence of % characters) and then navigate/narrow based on that:

(add-hook 'dialog-mode-hook
          (lambda ()
            (setq-local page-delimiter
                        (concat "^" (make-string fill-column ?%)))))

(If you modify the value of fill-column using the same hook then you just need to make sure that setting the page-delimiter value runs after that.)

1 Like

Realizing I never came back to update here!

This is a great idea that’s unfortunately very annoying to implement in Dialog, because all lexing is done in bytes—decoding the UTF-8 into Unicode characters is only done afterward. So at the time whitespace matters, each non-ASCII character is broken up into a multi-byte sequence that the lexer really isn’t prepared to handle.

Here’s how NBSPs are detected. It’s not pretty. This macro is invoked in each place where the lexer is looking for whitespace between tokens (there are three separate places), and it converts the sequence $C2 $A0 into a single space while leaving $C2 [anything else] alone.

Yeah working in UTF-8 can be a pain, but much worse in a one-character-at-a-time context like that.

The best solution would probably be to refactor the lexer to do UTF-8 decoding as it reads the file, and operate on characters instead of bytes, so maybe that will be a future improvement. For now, though, I’m a bit afraid of breaking what works…

1 Like