New built-in for wordsplitting

I’ve started messing with the rudiments of a Glulx assembler, but before I get too deep into the weeds on that, I want to test my ability to add builtins and new syntax. So I’m currently working on the “split recognized prefixes from unrecognized suffixes” part.

Turns out, the Z-machine code isn’t too bad, but adding a builtin that can be called as a multiquery is horrifyingly complex in the IR. So for now, I think I’m going to have this return a list instead. Less memory-efficient, but much easier. My current thought for a signature is (split $Word into recognized prefixes $Prefixes and suffixes $Suffixes). Thoughts on that?

My additions will also be Z-machine-specific at first, since I’m a bit leery of altering the Å-machine’s definitions and interpreters, and it would need new opcodes for basically any new builtins. On Å-machine, these predicates will simply fail, so (or) can be used to fall back on a software implementation for now.

Also, since this thread has somewhat become my place to document experiments and discoveries, here’s annotated pseudocode for how splitting and joining dictionary words works.

Pseudocode
def split_word(word): # Returns list of single-character words
	var tmp, accumulator, buffer
	
	word = deref(word)
	
	if word & $E000:
		# Extended dictionary word, stored on the heap (label 1)
		word = word & $1FFF
		tmp = word[0] # 16 bits
		if tmp > $8000:
			return tmp # Unknown word, stored as list of characters (label 5)
		# Regular word + ending
		accumulator = tmp[2] # 16 bits
		word = tmp
	
	elif word < $2000:
		# (label 2)
		fail()
	
	elif word < $3E00:
		# Regular dictionary word (label 3)
		accumulator = []
	
	elif word >= $4000:
		# Integer (label 4)
		tmp = word & $3FFF
		if tmp == 0:
			return [tmp] # PUSH_LIST_V
		accumulator = []
		do: # (label 7)
			word = tmp % 10
			word = word | $4000
			accumulator = [word | accumulator] # PUSH_PAIR_VV
			tmp = tmp / 10
		while tmp != 0
		return accumulator
	
	else:
		# Single-character dictionary word (label 6)
		return [tmp] # PUSH_LIST_V
	
	# Prepend characters to list (label 9)
	buffer = scratch_space_addr
	# Convert `word` to a pointer into the dictionary table
	word = word & $1FFF
	word = word * 6
	word = word + dict_table_addr
	print_to_buffer(word, buffer) # Uses output stream 3
	tmp = buffer[0] # Length of what was printed
	buffer ++ # Pointer to the actual text
	
	do: # (label 10)
		word = buffer[tmp] # 8 bits
		if $30 <= word <= $39:
			# Convert digit character to int
			word = word + ($200 - $30)
		word = word + $3E00 # Convert to dictionary word (label 12)
		accumulator = [word | accumulator] # PUSH_PAIR_VV
	while --tmp > 0
	
	return accumulator

def join_word(chars): # Returns a single word, in whatever format
	var tmp, buffer, tmp2
	
	chars = deref(chars)
	tmp = chars & $E000
	if tmp != $C000: # Not a list
		fail() # (label 1)
	
	tmp = chars & $1FFF
	tmp2 = tmp[2] # 16 bits
	tmp2 = deref(tmp2)
	if tmp2 == []: # We were given a singleton list
		tmp2 = tmp[0]
		if $3E00 <= tmp2 < $4000: # It's a single character
			return tmp2
	
	# (label 3)
	tmp2 = heap_top # Save this for later
	buffer = malloc(1+134+1+2*1) # Why do we need 134 words of memory specifically?
	buffer[0] = 2*134 # Length byte
	tmp = buffer + 2
	tmp = join_word_sub(chars, 2*134, tmp)
	if tmp == 0: # join_word_sub returns 0 to indicate error
		heap_top = tmp2 # Deallocate the memory we used
		fail() # (label 1)
	
	# (label 2)
	buffer[1] = tmp
	tmp = buffer + 2+2*134
	tmp[0] = 1
	tokenize(buffer, tmp)
	tmp = parse_input(tmp, buffer)
	heap_top = tmp2
	
	tmp = deref(tmp)
	if tmp == []:
		fail() # (label 1)
	tmp = tmp & $1FFF
	return tmp[0]

def join_word_sub(chars, bufsize, buffer): # Prints each element of `chars`, in sequence, into `buffer`; returns total number of characters written, or 0 on error
	# Note that despite the name, `chars` doesn't need to consist of characters! It can also contain dictionary words (of all sorts) and integers
	var element, original, tmp
	original = buffer # Save this for later
	
	do:
		chars = chars & $1FFF
		element = chars[0]
		element = deref(element)
		
		if element >= $4000:
			# It's a number (label 3)
			if bufsize < 8: return 0
			element = element & $3FFF
			print_to_buffer(element, buffer) # Uses output stream 3
			finalize_after_printing()
		elif element >= $3E00:
			# It's a single-character word
			if bufsize == 0: return 0
			element = element & $00FF
			if element <= $0020: return 0
			if element in '.,";*()': return 0
			buffer[0] = element
			bufsize --
			buffer ++
		elif element & $E000 == $E000: # (label 4)
			# It's an extended word
			element = element & $1FFF
			tmp = element[0]
			if tmp < $8000:
				tmp = element
			# (label 8)
			tmp = join_word_sub(tmp, bufsize, buffer)
			if tmp == 0: return 0
			buffer = buffer + tmp
			bufsize = bufsize - tmp
		elif element < $2000: # (label 5)
			# It's an object, that's not supposed to be there!
			return 0
		else:
			# It's a regular word
			if bufsize < 12: return 0
			# Convert element to a pointer into the dict table
			element = element & $1FFF
			element = element * 6
			element = element + dict_table_addr
			print_to_buffer(element, buffer) # Uses output stream 3
			finalize_after_printing()
		
		# (label 6)
		chars = chars[2]
		chars = deref(chars)
		element = chars & $E000
	while element == $C000 # As long as the type bits keep indicating a pair
	
	if chars != []: return 0
	
	return buffer - original # Number of chars written
	
	def finalize_after_printing(): # (label 7)
		# turn off output stream 3
		element = buffer[0] # Now being used as a temporary to hold the number of bytes written
		tmp = buffer+2
		buffer[0:element] = tmp[0:element] # Move everything two bytes backward in memory, to overwrite the size word
		buffer = buffer + element
		bufsize = bufsize - element
1 Like

Actually, that’s a better question to poll the crowd on.

If I add the builtin to split words into recognized prefixes and whatever comes else, would you want the output for @architecture to be:

  • [a arch architect] [rchitecture itecture ure]
  • [[a rchitecture] [arch itecture] [architect ure]]

?

Second one, but without the outer list, would be my preference. This would be consistent with how split anywhere predicate works, getting all possible splits by iterating over a multiquery. The author can then collect them in a list if they want to.

Edit: Heh, apparently I can’t read.

Yeah, my preferred solution would be to iterate over the options with a multiquery, but every builtin that can be multiqueried needs a great deal of special support in the compiler. I was hoping I could write a Z-machine routine that worked like a Python generator (iterate over the solutions one by one), but it seems like the closest I can come is building and returning a list.

Wait a sec! I can have the Z-machine runtime routine generate a list on the heap, but have the compiler then generate a IS_ONE_OF instruction afterward to iterate over that list. Which means someone with a better understanding of the compiler could replace it with a better implementation later without altering the interface.

In that case, I’m going to run with *(split $Word into recognized $Prefix and $Suffix) for now. I think that should work well.

2 Likes

Nice. It won’t be as efficient as a true generator of course, but

  1. the original plan returned the list anyway, so it is not worse than that, and
  2. we are running text adventures, not crunching numbers for a supercomputer.

Indeed! The results won’t be as efficient as the current stemmer, which is implemented with low-level Z-machine code, but they’ll be a lot more customizable and will let this be handled in the library instead of in a black box.

I have no idea how to adapt this to the Å-machine, either, but it’s possible to write an equivalent in Dialog code, which will be even less efficient but even more portable.