Mangle VM

I have no idea if anyone will find this interesting but my Christmas project this year is Mangle: A text manipulation virtual machine. I probably need to explain that.

Imagine a CPU designed to work with words (& characters) and which has an instruction set tuned for working with them.

Let’s start with the “hello world” of Mangle:

start:
  ins_str "Hello"
  halt

If you run it you get the following:

➜  mangle git:(main) ✗ ./mangle examples/ex_01.mangle

hello

Okay not very inspiring. How about this:

# Pig Latin: "string" -> "ingstray"
# If starts with vowel: "apple" -> "appleway"

start:
  load tv, "aeiou"
  load tc, "bcdfghjklmnpqrstvwxyz"

  # Check if starts with vowel
  load rm, 0
  match tv
  jmp_ok vowel_start

consonant_start:
  load rm, 0
  match tc
  jmp_fail move_done

  # Move first char to end
  move_char 0, @last_index
  jmp consonant_start

move_done:
  # Append "ay"
  load rc, @len
  ins_str "ay"
  halt

vowel_start:
  # Just add "way"
  load rc, @len
  ins_str "way"
  halt

If you run this program you get the following output:

➜  mangle git:(main) ✗ ./mangle examples/pig_latin2.mangle <rain_in_spain
ethay
ainray
inway
ainspay
allsfay
ainlymay
onway
ethay
ainplay

At a guess some of you are also interested in conlangs and that’s really the reason I built Mangle. Here is a program for generating words in the Shivan language (from my friend Chris Bateman’s 1995 sci-fi RPG Outlands).

start:
  # load the shivan syllables. We map
  # A=C1 B=V1 C=C2 D=V2 E=C3 F=V3
  load ta, {y:2, p:1, sh:1, k:1, h:1, s:1, b:1, m:1, n:1, r:1, v:1, t:1, d:1, g:1, th:1, l:1, kh:2}
  load tb, {aa:2, o:1, i:3, a:6, e:3, u:1, ia:2}
  load tc, {n:2, k:2, sh:1, h:1, t:1, s:2, r:2, l:1, th:1, d:1, m:2, p:1, z:2}
  load td, {o:3, i:4, a:6, e:3, u:3}
  load te, {p:3, d:2, sh:1, s:2, m:1, n:2, r:2, t:2, th:1, l:3}
  load tf, {o:5, a:6, i:3, u:5}

  clear

  load r0, rr
  mod r0, 100
  lte r0, 20
  jmp_ok one_syllable
  lte r0, 80
  jmp_ok two_syllables
  jmp three_syllables

one_syllable:
  comment "One Syllable"
  load r1, 1
  jmp gen_syllables

two_syllables:
  comment "Two Syllables"
  load r1, 2
  jmp gen_syllables

three_syllables:
  comment "Three Syllables"
  load r1, 3

gen_syllables:
  eq r1, 0
  jmp_ok finished

gen_syllable:
  call roll_2d10
  lte r4, 10
  jmp_ok gen_cvc
  lte r4, 14
  jmp_ok gen_cv
  lte r4, 18
  jmp_ok gen_vc
  jmp gen_vcv

gen_cvc:
  comment "CVC"
  ins_char ta
  ins_char tb
  ins_char tc
  dec r1
  jmp gen_syllables

gen_cv:
  comment "VC"
  ins_char ta
  ins_char tb
  dec r1
  jmp gen_syllables

gen_vc:
  comment "VC"
  ins_char td
  ins_char te
  dec r1
  jmp gen_syllables

gen_vcv:
  comment "VCV"
  ins_char td
  ins_char te
  ins_char tf
  dec r1
  jmp gen_syllables

finished:
  halt

# Generate a value from 2-20 and put it in R4
roll_2d10:
  load r4, 0
  load r5, rr
  mod r5, 10
  inc r5
  add r4, r5
  load r5, rr
  mod r5, 10
  inc r5
  add r4, r5
  ret

Here are some Shivan words:

➜  mangle git:(main) ✗ ./mangle --n 5 examples/shivan.mangle
itanlo
ta
zel
aariyma
sukayziash

You might ask why I built this when I could have just written an Elixir, Ruby or Clojure program to mess about with strings. Partly it’s curiousity, I’d had the idea of a “string vm” in my head, and partly because I have some crazy ideas about “evolving” programs to generate langauges and I figured that would be a lot easier to do in a language closer to an assembly language instruction set.

If you’ve never written assembly language (and I guess you maybe have to be of a certain age…) it might look a little intimidating but since there are a limited number of concepts and instructions it’s actually relatively simple to grapple with. I think.

Some things of note. Mangle has a set of general purpose registers r0..r7 which do the usual things registers do. It also has table vector registers ta..tz which can support digraph characters and that can both generate:

load r0, ta # generate a character from the table register a 

and match

match ta # match a character from the table register a

There are also string registers s0..s3 for playing with bits of strings as you go.

You can write subroutines using call/ret which uses a PC stack although so far I haven’t needed a general purpose stack. I “reserve” r4..r7 for subroutines and r0..r3 for the main program. My roll_2d10 subroutine returns its value in r4.

rc is the “cursor” register and many operations, such as ins_char or shuffle are anchored around the cursor. inc rc or dec rc are easy ways to move around the word. rm is the match start register and @match_len a special value containing the length of the match.

The rr register is special in that it returns a random value if you load from it load r0, rr. But if you load a value into rr:

load rr, "fourty two"

you seed the random number generator. load rr, @word seeds the random number generator with the word you fed the VM (if any).

There are many instructions for adding, moving, deleting, swapping, and shuffling characters.

If anyone is curious and might like to play with it I will be publishing the source shortly and also put it on the web.

10 Likes

I can’t imagine why I might need this, and yet I would like to see the source and read the docs :smiley:

4 Likes

I’ve spent quite a bit of time revisiting core design choices and refactoring the code base. A change with a lot of ripples was making all registers explicit. I’m not quite ready to release this into the wild. But here’s an example program that rotates all the vowels of a word right.

start:
  load w, in
  load tv, "aeiou"

collect_vowels:
  gte rm, w.len
  jmp_ok rotate_vowels

  match w, rm, tv
  jmp_ok collect_vowel
  inc rm
  jmp collect_vowels

collect_vowel:
  loadchar r1, w, rm
  insert s0, s0.len, r1
  inc rm
  jmp collect_vowels

rotate_vowels:
  rotr s0

replace_vowels:
  load r0, 0
  load rm, 0

find_vowel:
  gte rm, w.len
  jmp_ok done

  match w, rm, tv
  jmp_ok replace_vowel
  inc rm
  jmp find_vowel

replace_vowel:
  loadchar r1, s0, r0
  replace w, rm, r1
  inc rm
  inc r0
  jmp find_vowel

done:
  halt 0

So:

➜  mangle git:(main) ✗ ./mangle examples/vowel_rotate.mangle <words
T odecutain
T bueatufil
T saqeuoi
T eatumoboli
T eaduince
T fucateios
T ubstameios
T qeusteinnoari
T ceoporeta
T rodaitain

How this program works is that it loads the input word into w and the English vowels into table register v. Then it scans w for vowels (using the matchinstruction) and appends them to string register s0.

So for “eduation” s0 ends up containing “euaio” (I note that education contains each English vowel once, neat!).

Then it rotates s0 one place to the right giving “oeuai”.

Then it re-traverses w and each time it finds a vowel it replaces it with the next vowel in s0. It keeps track of the “next vowel” index in r0.

This is kind of a silly example and you could do this much more elegantly in, say, ruby:

def rotate_vowels(word)
  vowels = %w[a e i o u]

  vowel_enum = word.chars
                   .select { |c| vowels.include?(c) }
                   .rotate(-1)
                   .each

  word.chars.map { |c| vowels.include?(c) ? vowel_enum.next : c }.join
end

puts rotate_vowels("education") # => "odecutain"

However, the point of Mangle is not particularly the kind of elegance nor conciseness you’d get from Ruby or Elixir. Mangle is designed to be a tool for a genetic programming system to express itself.

That said this example has motivated me to add call_ok/call_fail which should streamline this example a little by turning collect_vowel and replace_vowel into subroutines.

2 Likes

With call_ok my example is just a little neater:

start:
  load w, in
  load tv, "aeiou"

collect_vowels:
  gte rm, w.len
  jmp_ok rotate_vowels

  match w, rm, tv
  call_ok collect_vowel
  inc rm
  jmp collect_vowels

collect_vowel:
  loadchar r1, w, rm
  insert s0, s0.len, r1
  ret

rotate_vowels:
  rotr s0

replace_vowels:
  load r0, 0
  load rm, 0

find_vowel:
  gte rm, w.len
  jmp_ok done

  match w, rm, tv
  call_ok replace_vowel
  inc rm
  jmp find_vowel

replace_vowel:
  loadchar r1, s0, r0
  replace w, rm, r1
  inc r0
  ret

done:
  halt 0

My mangle code takes 33 lines to do what Ruby can do in 8. I’ll take it. Performance is a less happy picture where Ruby can handle /usr/share/dict/words in about 1.3s and Mangle takes closer to 13s. That said, a naive home-brew VM (implemented in Elixir) being only 10x as slow as one that has had a team optimising it for years, maybe isn’t so bad.

2 Likes

They say “build one to throw it away” but I’ve gone one better and thrown 2 away because I was unhappy with some of the design & implementation choices I had made. I’ve spent entirely too much time building this but had a lot of fun in the process and now have something that is working more or less how I want.

Here is another of the Mangle examples: fremen.mangle

start:
  # Generate a word using the Fremen language schema defined in the Outlands
  # RPG Fifth Edition 1995. Copyright Chris M. Bateman. Page 187

  # Syllable Structure CVC, CV
  # Syllable C
  load ta, {j:2, f:1, kr:1, n:1, sh:1, t:1, d:1, m:1, b:1, r:1, h:1, k:1, s:1, l:1, z:1, w:1, gh: 2}
  # Syllable V
  load tb, {u:4, a:5, i:4, ai: 2, e: 4}
  # Syllable C
  load tc, {m:3, q:1, th:1, jj:1, s:1, k:1, h:1, n:1, kh:1, d:1, l:1, t:1, z:2, b:3}

  # Syllable Structure VC
  # Syllable V
  load td, {u:6, a:5, i:5, ai: 2, e:2}
  # Syllable C
  load te, {m:2, n:1, h:2, k:4, d:4, b:2, kh:1, l:1, f:2}

  clear w

  call roll_d10
  eq i4, 1
  jmp_ok one_syllable
  lt i4, 8
  jmp_ok two_syllables
  jmp three_syllables

# i0 stores the syllable count
one_syllable:
  comment "One Syllable"
  load i0, 1
  jmp gen_syllables

two_syllables:
  comment "Two Syllables"
  load i0, 2
  jmp gen_syllables

three_syllables:
  comment "Three Syllables"
  load i0, 3

gen_syllables:
  eq i0, 0
  jmp_ok finished

gen_syllable:
  call roll_2d10 # roll is returned in i4
  lte i4, 12
  jmp_ok gen_cv
  lte i4, 16
  jmp_ok gen_cvc
  jmp_ok gen_vc

gen_cv:
  comment "CV"
  insert w, w.len, ta
  insert w, w.len, tb
  jmp next_syllable

gen_cvc:
  comment "CVC"
  insert w, w.len, ta
  insert w, w.len, tb
  insert w, w.len, tc
  jmp next_syllable

gen_vc:
  comment "VC"
  insert w, w.len, td
  insert w, w.len, te
  jmp next_syllable

next_syllable:
  dec i0
  jmp gen_syllables

finished:
  halt

# Generate a value from 1-10 and put it in i4
roll_d10:
  load i4, r
  mod i4, 10
  inc i4
  ret

# Generate a value from 2-20 and put it in i4
roll_2d10:
  load i4, 0
  load i5, r
  mod i5, 10
  inc i5
  add i4, i5
  load i5, r
  mod i5, 10
  inc i5
  add i4, i5
  ret

Here’s Mangle in action generating new Fremen words:

➜  mangle git:(main) ✗ ./mangle --quiet --n 10 examples/fremen.mangle
ju
fadeghi
lashinu
fighuz
tuhe
ghisait
su
jahu
sas
nudaju

The VM has 43 instructions (including comparisons like eq, gt, lte and different forms of call and jmp). The Fremen example uses just 17 instructions and 8 registers.

The VM has a set of index registers (integer based for characters and offsets), string registers, table registers, and flags.

The table registers are used both for matching characters and also for generating characters. So match w, c, tc, tv will succeed if w[c] is any of the characters in table c and w[c+1] is any of the characters in table v. While load i0, tv will load a random grapheme from table v based on the relative frequency of the table entries.

Grapheme support includes single characters, di-graphs ab and tri-graphs kth. When inserting into a string register you’ll get all the characters. When loading into an index register you will get a random character from the selected grapheme. The latter isn’t a behaviour I’ve needed yet (typically I use di & tri graphs for generating not matching) but seemed the most reasonable way to handle the case without throwing an error.

The index registers are for calculation and string offsets. The string registers are for managing substrings. For example, here is a program that uses a table register and string register to rotate all the vowels in a word one place to the right:

start:
  load w, in
  load tv, "aeiou"

collect_vowels:
  gt c, w.li
  jmp_ok rotate_vowels

  match w, c, tv
  call_ok collect_vowel
  inc c
  jmp collect_vowels

collect_vowel:
  load_char i5, w, c
  insert s0, s0.len, i5
  ret

rotate_vowels:
  rotr s0

replace_vowels:
  load c, 0
  load i0, 0 # index into vowels

find_vowel:
  gt c, w.li
  jmp_ok done

  match w, c, tv
  call_ok replace_vowel
  inc c
  jmp find_vowel

replace_vowel:
  load_char i5, s0, i0
  replace w, c, i5
  inc i0
  ret

done:
  halt 0

It loops through the characters of the input word matching vowels (match w, c, tv uses the table register v as a matcher) and collecting them into s0. It then uses rotr to rotate s0 to the right. Then it loops the word register w again and replaces each vowel it finds with the next vowel from s0.

➜  mangle git:(main) ✗ ./mangle --compare --quiet examples/vowel_rotate.mangle <words
christmas -> chrastmis
france -> frenca
mallam -> mallam
education -> odecutain
beautiful -> bueatufil
sequoia -> saqeuoi
automobile -> eatumoboli
audience -> eaduince
quebec -> qeubec
facetious -> fucateios
abstemious -> ubstameios
questionnaire -> qeusteinnoari
cooperate -> ceoporeta
radiation -> rodaitain

My main purpose for this tool is generating new words either from whole cloth (the Shivan and Fremen generator examples) or by modify existing language words (the vowel rotation example) using a program built from a limited set of easy to understand instructions running on a machine with quite a restricted set of features, i.e. the whole thing is easy to write and reason about.

I want to tidy up the documentation which has gotten seriously of out of date as I have rewritten the code. Then I will do a release in case this idea tickles anyone else. If you have any conlangs you’d like it to generate I am happy to try and write them.

For those who might be wondering, I paired with Claude Code to work on the original implementation of Mangle. At first I was impressed and working test-first (essentially doing XP with Claude as the partner with the keyboard) had something that demonstrably worked.

But I quickly decided the code quality did not meet my standards so, painful as it was, I started again and wrote the parser and instruction set implementation by hand. Claude was still useful in helping me identify the source of some of the more complex (or stupid) bugs in my code and tests.

2 Likes

Belatedly I realised that because of the words I had started playing with long ago I was being very ASCII-centric. In fact it was relative straightforward to make Mangle unicode aware (Elixir has excellent unicode support).

As a proof of concept I hacked together a poor man’s Quenya generator

➜  mangle git:(main) ✗ ./mangle --quiet --n 20 examples/quenya.mangle
cuntëmat
irusëva
meremo
vutuë
ñuiha
pifeine
ceiaimú
oren
hwilióhwix
pëtsirtër
muyaur
áí
liññó
neauriñpu
sániya
puera
sortulen
ñeñnapó
vaiapefut
feiotéhwirir
2 Likes

I think this might be the tool I was contemplating for years. At some point I would love to give this a spin!

1 Like

The instruction set has settled down, I’ve done about as much performance optimisation as I plan to. Mangle is about 6-8x slower than the equivalent Ruby code which is probably about as good as it gets without using NIFs for the hot paths or switching to another language. The main thing I will work on from here is debugging. There’s a –-trace option that dumps a lot of information about the VM state but it could be better. I’d like a gdb style, break, single-step, and inspect.

I’m not quite ready to publish yet but I have been working on the

programmers_guide_v0.2.pdf (125.9 KB)

Which covers the VM, instruction set, operations, and so on.

The main thing I have added recently are a loop instruction copied from the 8086 instruction set. It’s purely a syntax sugar since:

load i0, 5
something:
  someop
  dec i0
  eq i0, 0
  jmp_fail something

and

load i0, 5
something:
  someop
  loop i0, something

Are equivalent and you’re not saving many instructions. But I think it makes the intent much clearer.

I have also added FSM based matchers. You build an FSM by loading into one of the new m registers. Example:

load m0, "bn"
load m0, "bt"
load m0, "bp"
match w, 0, m0, i0

Will match any of those 3 strings. You can add strings of varying lengths and i0 will be populated with the length of the match. This makes, for example, finding potentially variable length prefixes or suffixes much more convenient. The matchers are not presently greedy, I’m not sure if in practice that is an issue.

Lastly you can now provide additional input to the VM by passing initial values for registers, e.g.

./mangle -i0 1 -s0 "hello" examples/foo.mangle

Will pre-populate the values of i0 and s0 when the program starts running. This would allow you to, for example, vary operation or pass in a random seed directly. E.g.

start:
  load r, s0

Would seed the random generator with the string passed from the command-line.

As an aside, when I built mangle the path of least resistance was to store strings as character lists which Elixir handles quite naturally. But list performance is poor for indexed access and mutation beyond the start of the list. I experimented with both tuples (e.g. {?h, ?e, ?l, ?l, ?o}) and Erlang :array’s but despite them having better performance characteristics in general for a mixed use case, with such low values of N the constants dominated. So char lists it is!

If you think I am insane to have put this much effort into something this niche you’d at least be in a category of 2 but what can I say? I like it.

3 Likes