I have no idea if anyone will find this interesting but my Christmas project this year is Mangle: A text manipulation virtual machine. I probably need to explain that.
Imagine a CPU designed to work with words (& characters) and which has an instruction set tuned for working with them.
Let’s start with the “hello world” of Mangle:
start:
ins_str "Hello"
halt
If you run it you get the following:
➜ mangle git:(main) ✗ ./mangle examples/ex_01.mangle
hello
Okay not very inspiring. How about this:
# Pig Latin: "string" -> "ingstray"
# If starts with vowel: "apple" -> "appleway"
start:
load tv, "aeiou"
load tc, "bcdfghjklmnpqrstvwxyz"
# Check if starts with vowel
load rm, 0
match tv
jmp_ok vowel_start
consonant_start:
load rm, 0
match tc
jmp_fail move_done
# Move first char to end
move_char 0, @last_index
jmp consonant_start
move_done:
# Append "ay"
load rc, @len
ins_str "ay"
halt
vowel_start:
# Just add "way"
load rc, @len
ins_str "way"
halt
If you run this program you get the following output:
➜ mangle git:(main) ✗ ./mangle examples/pig_latin2.mangle <rain_in_spain
ethay
ainray
inway
ainspay
allsfay
ainlymay
onway
ethay
ainplay
At a guess some of you are also interested in conlangs and that’s really the reason I built Mangle. Here is a program for generating words in the Shivan language (from my friend Chris Bateman’s 1995 sci-fi RPG Outlands).
start:
# load the shivan syllables. We map
# A=C1 B=V1 C=C2 D=V2 E=C3 F=V3
load ta, {y:2, p:1, sh:1, k:1, h:1, s:1, b:1, m:1, n:1, r:1, v:1, t:1, d:1, g:1, th:1, l:1, kh:2}
load tb, {aa:2, o:1, i:3, a:6, e:3, u:1, ia:2}
load tc, {n:2, k:2, sh:1, h:1, t:1, s:2, r:2, l:1, th:1, d:1, m:2, p:1, z:2}
load td, {o:3, i:4, a:6, e:3, u:3}
load te, {p:3, d:2, sh:1, s:2, m:1, n:2, r:2, t:2, th:1, l:3}
load tf, {o:5, a:6, i:3, u:5}
clear
load r0, rr
mod r0, 100
lte r0, 20
jmp_ok one_syllable
lte r0, 80
jmp_ok two_syllables
jmp three_syllables
one_syllable:
comment "One Syllable"
load r1, 1
jmp gen_syllables
two_syllables:
comment "Two Syllables"
load r1, 2
jmp gen_syllables
three_syllables:
comment "Three Syllables"
load r1, 3
gen_syllables:
eq r1, 0
jmp_ok finished
gen_syllable:
call roll_2d10
lte r4, 10
jmp_ok gen_cvc
lte r4, 14
jmp_ok gen_cv
lte r4, 18
jmp_ok gen_vc
jmp gen_vcv
gen_cvc:
comment "CVC"
ins_char ta
ins_char tb
ins_char tc
dec r1
jmp gen_syllables
gen_cv:
comment "VC"
ins_char ta
ins_char tb
dec r1
jmp gen_syllables
gen_vc:
comment "VC"
ins_char td
ins_char te
dec r1
jmp gen_syllables
gen_vcv:
comment "VCV"
ins_char td
ins_char te
ins_char tf
dec r1
jmp gen_syllables
finished:
halt
# Generate a value from 2-20 and put it in R4
roll_2d10:
load r4, 0
load r5, rr
mod r5, 10
inc r5
add r4, r5
load r5, rr
mod r5, 10
inc r5
add r4, r5
ret
Here are some Shivan words:
➜ mangle git:(main) ✗ ./mangle --n 5 examples/shivan.mangle
itanlo
ta
zel
aariyma
sukayziash
You might ask why I built this when I could have just written an Elixir, Ruby or Clojure program to mess about with strings. Partly it’s curiousity, I’d had the idea of a “string vm” in my head, and partly because I have some crazy ideas about “evolving” programs to generate langauges and I figured that would be a lot easier to do in a language closer to an assembly language instruction set.
If you’ve never written assembly language (and I guess you maybe have to be of a certain age…) it might look a little intimidating but since there are a limited number of concepts and instructions it’s actually relatively simple to grapple with. I think.
Some things of note. Mangle has a set of general purpose registers r0..r7 which do the usual things registers do. It also has table vector registers ta..tz which can support digraph characters and that can both generate:
load r0, ta # generate a character from the table register a
and match
match ta # match a character from the table register a
There are also string registers s0..s3 for playing with bits of strings as you go.
You can write subroutines using call/ret which uses a PC stack although so far I haven’t needed a general purpose stack. I “reserve” r4..r7 for subroutines and r0..r3 for the main program. My roll_2d10 subroutine returns its value in r4.
rc is the “cursor” register and many operations, such as ins_char or shuffle are anchored around the cursor. inc rc or dec rc are easy ways to move around the word. rm is the match start register and @match_len a special value containing the length of the match.
The rr register is special in that it returns a random value if you load from it load r0, rr. But if you load a value into rr:
load rr, "fourty two"
you seed the random number generator. load rr, @word seeds the random number generator with the word you fed the VM (if any).
There are many instructions for adding, moving, deleting, swapping, and shuffling characters.
If anyone is curious and might like to play with it I will be publishing the source shortly and also put it on the web.