Mangle VM

I have no idea if anyone will find this interesting but my Christmas project this year is Mangle: A text manipulation virtual machine. I probably need to explain that.

Imagine a CPU designed to work with words (& characters) and which has an instruction set tuned for working with them.

Let’s start with the “hello world” of Mangle:

start:
  ins_str "Hello"
  halt

If you run it you get the following:

➜  mangle git:(main) ✗ ./mangle examples/ex_01.mangle

hello

Okay not very inspiring. How about this:

# Pig Latin: "string" -> "ingstray"
# If starts with vowel: "apple" -> "appleway"

start:
  load tv, "aeiou"
  load tc, "bcdfghjklmnpqrstvwxyz"

  # Check if starts with vowel
  load rm, 0
  match tv
  jmp_ok vowel_start

consonant_start:
  load rm, 0
  match tc
  jmp_fail move_done

  # Move first char to end
  move_char 0, @last_index
  jmp consonant_start

move_done:
  # Append "ay"
  load rc, @len
  ins_str "ay"
  halt

vowel_start:
  # Just add "way"
  load rc, @len
  ins_str "way"
  halt

If you run this program you get the following output:

➜  mangle git:(main) ✗ ./mangle examples/pig_latin2.mangle <rain_in_spain
ethay
ainray
inway
ainspay
allsfay
ainlymay
onway
ethay
ainplay

At a guess some of you are also interested in conlangs and that’s really the reason I built Mangle. Here is a program for generating words in the Shivan language (from my friend Chris Bateman’s 1995 sci-fi RPG Outlands).

start:
  # load the shivan syllables. We map
  # A=C1 B=V1 C=C2 D=V2 E=C3 F=V3
  load ta, {y:2, p:1, sh:1, k:1, h:1, s:1, b:1, m:1, n:1, r:1, v:1, t:1, d:1, g:1, th:1, l:1, kh:2}
  load tb, {aa:2, o:1, i:3, a:6, e:3, u:1, ia:2}
  load tc, {n:2, k:2, sh:1, h:1, t:1, s:2, r:2, l:1, th:1, d:1, m:2, p:1, z:2}
  load td, {o:3, i:4, a:6, e:3, u:3}
  load te, {p:3, d:2, sh:1, s:2, m:1, n:2, r:2, t:2, th:1, l:3}
  load tf, {o:5, a:6, i:3, u:5}

  clear

  load r0, rr
  mod r0, 100
  lte r0, 20
  jmp_ok one_syllable
  lte r0, 80
  jmp_ok two_syllables
  jmp three_syllables

one_syllable:
  comment "One Syllable"
  load r1, 1
  jmp gen_syllables

two_syllables:
  comment "Two Syllables"
  load r1, 2
  jmp gen_syllables

three_syllables:
  comment "Three Syllables"
  load r1, 3

gen_syllables:
  eq r1, 0
  jmp_ok finished

gen_syllable:
  call roll_2d10
  lte r4, 10
  jmp_ok gen_cvc
  lte r4, 14
  jmp_ok gen_cv
  lte r4, 18
  jmp_ok gen_vc
  jmp gen_vcv

gen_cvc:
  comment "CVC"
  ins_char ta
  ins_char tb
  ins_char tc
  dec r1
  jmp gen_syllables

gen_cv:
  comment "VC"
  ins_char ta
  ins_char tb
  dec r1
  jmp gen_syllables

gen_vc:
  comment "VC"
  ins_char td
  ins_char te
  dec r1
  jmp gen_syllables

gen_vcv:
  comment "VCV"
  ins_char td
  ins_char te
  ins_char tf
  dec r1
  jmp gen_syllables

finished:
  halt

# Generate a value from 2-20 and put it in R4
roll_2d10:
  load r4, 0
  load r5, rr
  mod r5, 10
  inc r5
  add r4, r5
  load r5, rr
  mod r5, 10
  inc r5
  add r4, r5
  ret

Here are some Shivan words:

➜  mangle git:(main) ✗ ./mangle --n 5 examples/shivan.mangle
itanlo
ta
zel
aariyma
sukayziash

You might ask why I built this when I could have just written an Elixir, Ruby or Clojure program to mess about with strings. Partly it’s curiousity, I’d had the idea of a “string vm” in my head, and partly because I have some crazy ideas about “evolving” programs to generate langauges and I figured that would be a lot easier to do in a language closer to an assembly language instruction set.

If you’ve never written assembly language (and I guess you maybe have to be of a certain age…) it might look a little intimidating but since there are a limited number of concepts and instructions it’s actually relatively simple to grapple with. I think.

Some things of note. Mangle has a set of general purpose registers r0..r7 which do the usual things registers do. It also has table vector registers ta..tz which can support digraph characters and that can both generate:

load r0, ta # generate a character from the table register a 

and match

match ta # match a character from the table register a

There are also string registers s0..s3 for playing with bits of strings as you go.

You can write subroutines using call/ret which uses a PC stack although so far I haven’t needed a general purpose stack. I “reserve” r4..r7 for subroutines and r0..r3 for the main program. My roll_2d10 subroutine returns its value in r4.

rc is the “cursor” register and many operations, such as ins_char or shuffle are anchored around the cursor. inc rc or dec rc are easy ways to move around the word. rm is the match start register and @match_len a special value containing the length of the match.

The rr register is special in that it returns a random value if you load from it load r0, rr. But if you load a value into rr:

load rr, "fourty two"

you seed the random number generator. load rr, @word seeds the random number generator with the word you fed the VM (if any).

There are many instructions for adding, moving, deleting, swapping, and shuffling characters.

If anyone is curious and might like to play with it I will be publishing the source shortly and also put it on the web.

10 Likes

I can’t imagine why I might need this, and yet I would like to see the source and read the docs :smiley:

4 Likes

I’ve spent quite a bit of time revisiting core design choices and refactoring the code base. A change with a lot of ripples was making all registers explicit. I’m not quite ready to release this into the wild. But here’s an example program that rotates all the vowels of a word right.

start:
  load w, in
  load tv, "aeiou"

collect_vowels:
  gte rm, w.len
  jmp_ok rotate_vowels

  match w, rm, tv
  jmp_ok collect_vowel
  inc rm
  jmp collect_vowels

collect_vowel:
  loadchar r1, w, rm
  insert s0, s0.len, r1
  inc rm
  jmp collect_vowels

rotate_vowels:
  rotr s0

replace_vowels:
  load r0, 0
  load rm, 0

find_vowel:
  gte rm, w.len
  jmp_ok done

  match w, rm, tv
  jmp_ok replace_vowel
  inc rm
  jmp find_vowel

replace_vowel:
  loadchar r1, s0, r0
  replace w, rm, r1
  inc rm
  inc r0
  jmp find_vowel

done:
  halt 0

So:

➜  mangle git:(main) ✗ ./mangle examples/vowel_rotate.mangle <words
T odecutain
T bueatufil
T saqeuoi
T eatumoboli
T eaduince
T fucateios
T ubstameios
T qeusteinnoari
T ceoporeta
T rodaitain

How this program works is that it loads the input word into w and the English vowels into table register v. Then it scans w for vowels (using the matchinstruction) and appends them to string register s0.

So for “eduation” s0 ends up containing “euaio” (I note that education contains each English vowel once, neat!).

Then it rotates s0 one place to the right giving “oeuai”.

Then it re-traverses w and each time it finds a vowel it replaces it with the next vowel in s0. It keeps track of the “next vowel” index in r0.

This is kind of a silly example and you could do this much more elegantly in, say, ruby:

def rotate_vowels(word)
  vowels = %w[a e i o u]

  vowel_enum = word.chars
                   .select { |c| vowels.include?(c) }
                   .rotate(-1)
                   .each

  word.chars.map { |c| vowels.include?(c) ? vowel_enum.next : c }.join
end

puts rotate_vowels("education") # => "odecutain"

However, the point of Mangle is not particularly the kind of elegance nor conciseness you’d get from Ruby or Elixir. Mangle is designed to be a tool for a genetic programming system to express itself.

That said this example has motivated me to add call_ok/call_fail which should streamline this example a little by turning collect_vowel and replace_vowel into subroutines.

2 Likes

With call_ok my example is just a little neater:

start:
  load w, in
  load tv, "aeiou"

collect_vowels:
  gte rm, w.len
  jmp_ok rotate_vowels

  match w, rm, tv
  call_ok collect_vowel
  inc rm
  jmp collect_vowels

collect_vowel:
  loadchar r1, w, rm
  insert s0, s0.len, r1
  ret

rotate_vowels:
  rotr s0

replace_vowels:
  load r0, 0
  load rm, 0

find_vowel:
  gte rm, w.len
  jmp_ok done

  match w, rm, tv
  call_ok replace_vowel
  inc rm
  jmp find_vowel

replace_vowel:
  loadchar r1, s0, r0
  replace w, rm, r1
  inc r0
  ret

done:
  halt 0

My mangle code takes 33 lines to do what Ruby can do in 8. I’ll take it. Performance is a less happy picture where Ruby can handle /usr/share/dict/words in about 1.3s and Mangle takes closer to 13s. That said, a naive home-brew VM (implemented in Elixir) being only 10x as slow as one that has had a team optimising it for years, maybe isn’t so bad.

2 Likes

They say “build one to throw it away” but I’ve gone one better and thrown 2 away because I was unhappy with some of the design & implementation choices I had made. I’ve spent entirely too much time building this but had a lot of fun in the process and now have something that is working more or less how I want.

Here is another of the Mangle examples: fremen.mangle

start:
  # Generate a word using the Fremen language schema defined in the Outlands
  # RPG Fifth Edition 1995. Copyright Chris M. Bateman. Page 187

  # Syllable Structure CVC, CV
  # Syllable C
  load ta, {j:2, f:1, kr:1, n:1, sh:1, t:1, d:1, m:1, b:1, r:1, h:1, k:1, s:1, l:1, z:1, w:1, gh: 2}
  # Syllable V
  load tb, {u:4, a:5, i:4, ai: 2, e: 4}
  # Syllable C
  load tc, {m:3, q:1, th:1, jj:1, s:1, k:1, h:1, n:1, kh:1, d:1, l:1, t:1, z:2, b:3}

  # Syllable Structure VC
  # Syllable V
  load td, {u:6, a:5, i:5, ai: 2, e:2}
  # Syllable C
  load te, {m:2, n:1, h:2, k:4, d:4, b:2, kh:1, l:1, f:2}

  clear w

  call roll_d10
  eq i4, 1
  jmp_ok one_syllable
  lt i4, 8
  jmp_ok two_syllables
  jmp three_syllables

# i0 stores the syllable count
one_syllable:
  comment "One Syllable"
  load i0, 1
  jmp gen_syllables

two_syllables:
  comment "Two Syllables"
  load i0, 2
  jmp gen_syllables

three_syllables:
  comment "Three Syllables"
  load i0, 3

gen_syllables:
  eq i0, 0
  jmp_ok finished

gen_syllable:
  call roll_2d10 # roll is returned in i4
  lte i4, 12
  jmp_ok gen_cv
  lte i4, 16
  jmp_ok gen_cvc
  jmp_ok gen_vc

gen_cv:
  comment "CV"
  insert w, w.len, ta
  insert w, w.len, tb
  jmp next_syllable

gen_cvc:
  comment "CVC"
  insert w, w.len, ta
  insert w, w.len, tb
  insert w, w.len, tc
  jmp next_syllable

gen_vc:
  comment "VC"
  insert w, w.len, td
  insert w, w.len, te
  jmp next_syllable

next_syllable:
  dec i0
  jmp gen_syllables

finished:
  halt

# Generate a value from 1-10 and put it in i4
roll_d10:
  load i4, r
  mod i4, 10
  inc i4
  ret

# Generate a value from 2-20 and put it in i4
roll_2d10:
  load i4, 0
  load i5, r
  mod i5, 10
  inc i5
  add i4, i5
  load i5, r
  mod i5, 10
  inc i5
  add i4, i5
  ret

Here’s Mangle in action generating new Fremen words:

➜  mangle git:(main) ✗ ./mangle --quiet --n 10 examples/fremen.mangle
ju
fadeghi
lashinu
fighuz
tuhe
ghisait
su
jahu
sas
nudaju

The VM has 43 instructions (including comparisons like eq, gt, lte and different forms of call and jmp). The Fremen example uses just 17 instructions and 8 registers.

The VM has a set of index registers (integer based for characters and offsets), string registers, table registers, and flags.

The table registers are used both for matching characters and also for generating characters. So match w, c, tc, tv will succeed if w[c] is any of the characters in table c and w[c+1] is any of the characters in table v. While load i0, tv will load a random grapheme from table v based on the relative frequency of the table entries.

Grapheme support includes single characters, di-graphs ab and tri-graphs kth. When inserting into a string register you’ll get all the characters. When loading into an index register you will get a random character from the selected grapheme. The latter isn’t a behaviour I’ve needed yet (typically I use di & tri graphs for generating not matching) but seemed the most reasonable way to handle the case without throwing an error.

The index registers are for calculation and string offsets. The string registers are for managing substrings. For example, here is a program that uses a table register and string register to rotate all the vowels in a word one place to the right:

start:
  load w, in
  load tv, "aeiou"

collect_vowels:
  gt c, w.li
  jmp_ok rotate_vowels

  match w, c, tv
  call_ok collect_vowel
  inc c
  jmp collect_vowels

collect_vowel:
  load_char i5, w, c
  insert s0, s0.len, i5
  ret

rotate_vowels:
  rotr s0

replace_vowels:
  load c, 0
  load i0, 0 # index into vowels

find_vowel:
  gt c, w.li
  jmp_ok done

  match w, c, tv
  call_ok replace_vowel
  inc c
  jmp find_vowel

replace_vowel:
  load_char i5, s0, i0
  replace w, c, i5
  inc i0
  ret

done:
  halt 0

It loops through the characters of the input word matching vowels (match w, c, tv uses the table register v as a matcher) and collecting them into s0. It then uses rotr to rotate s0 to the right. Then it loops the word register w again and replaces each vowel it finds with the next vowel from s0.

➜  mangle git:(main) ✗ ./mangle --compare --quiet examples/vowel_rotate.mangle <words
christmas -> chrastmis
france -> frenca
mallam -> mallam
education -> odecutain
beautiful -> bueatufil
sequoia -> saqeuoi
automobile -> eatumoboli
audience -> eaduince
quebec -> qeubec
facetious -> fucateios
abstemious -> ubstameios
questionnaire -> qeusteinnoari
cooperate -> ceoporeta
radiation -> rodaitain

My main purpose for this tool is generating new words either from whole cloth (the Shivan and Fremen generator examples) or by modify existing language words (the vowel rotation example) using a program built from a limited set of easy to understand instructions running on a machine with quite a restricted set of features, i.e. the whole thing is easy to write and reason about.

I want to tidy up the documentation which has gotten seriously of out of date as I have rewritten the code. Then I will do a release in case this idea tickles anyone else. If you have any conlangs you’d like it to generate I am happy to try and write them.

For those who might be wondering, I paired with Claude Code to work on the original implementation of Mangle. At first I was impressed and working test-first (essentially doing XP with Claude as the partner with the keyboard) had something that demonstrably worked.

But I quickly decided the code quality did not meet my standards so, painful as it was, I started again and wrote the parser and instruction set implementation by hand. Claude was still useful in helping me identify the source of some of the more complex (or stupid) bugs in my code and tests.

2 Likes

Belatedly I realised that because of the words I had started playing with long ago I was being very ASCII-centric. In fact it was relative straightforward to make Mangle unicode aware (Elixir has excellent unicode support).

As a proof of concept I hacked together a poor man’s Quenya generator

➜  mangle git:(main) ✗ ./mangle --quiet --n 20 examples/quenya.mangle
cuntëmat
irusëva
meremo
vutuë
ñuiha
pifeine
ceiaimú
oren
hwilióhwix
pëtsirtër
muyaur
áí
liññó
neauriñpu
sániya
puera
sortulen
ñeñnapó
vaiapefut
feiotéhwirir
2 Likes

I think this might be the tool I was contemplating for years. At some point I would love to give this a spin!

1 Like

The instruction set has settled down, I’ve done about as much performance optimisation as I plan to. Mangle is about 6-8x slower than the equivalent Ruby code which is probably about as good as it gets without using NIFs for the hot paths or switching to another language. The main thing I will work on from here is debugging. There’s a –-trace option that dumps a lot of information about the VM state but it could be better. I’d like a gdb style, break, single-step, and inspect.

I’m not quite ready to publish yet but I have been working on the

programmers_guide_v0.2.pdf (125.9 KB)

Which covers the VM, instruction set, operations, and so on.

The main thing I have added recently are a loop instruction copied from the 8086 instruction set. It’s purely a syntax sugar since:

load i0, 5
something:
  someop
  dec i0
  eq i0, 0
  jmp_fail something

and

load i0, 5
something:
  someop
  loop i0, something

Are equivalent and you’re not saving many instructions. But I think it makes the intent much clearer.

I have also added FSM based matchers. You build an FSM by loading into one of the new m registers. Example:

load m0, "bn"
load m0, "bt"
load m0, "bp"
match w, 0, m0, i0

Will match any of those 3 strings. You can add strings of varying lengths and i0 will be populated with the length of the match. This makes, for example, finding potentially variable length prefixes or suffixes much more convenient. The matchers are not presently greedy, I’m not sure if in practice that is an issue.

Lastly you can now provide additional input to the VM by passing initial values for registers, e.g.

./mangle -i0 1 -s0 "hello" examples/foo.mangle

Will pre-populate the values of i0 and s0 when the program starts running. This would allow you to, for example, vary operation or pass in a random seed directly. E.g.

start:
  load r, s0

Would seed the random generator with the string passed from the command-line.

As an aside, when I built mangle the path of least resistance was to store strings as character lists which Elixir handles quite naturally. But list performance is poor for indexed access and mutation beyond the start of the list. I experimented with both tuples (e.g. {?h, ?e, ?l, ?l, ?o}) and Erlang :array’s but despite them having better performance characteristics in general for a mixed use case, with such low values of N the constants dominated. So char lists it is!

If you think I am insane to have put this much effort into something this niche you’d at least be in a category of 2 but what can I say? I like it.

4 Likes

In trying to describe what Mangle is I realise I need to first explain what a Virtual Machine is. I hope I have hit upon a useful analogy.

  • What is a Virtual Machine?
  • Think of it like a small, purpose-built computer, defined entirely by rules. A familiar example is a calculator.
  • A calculator has internal state such as the number you are currently typing and the running total.
  • It has a small set of instructions like +, -, ×, ÷ and =.
  • The calculator “runs a program” made up of button pushes on its keyboard.
  • For example we can think of 10÷2= as a tiny “program” that produces the result 5.
  • When you type 10, then ÷, then 2, then =, the calculator reads each instruction in order, updating its internal state as it goes, before finally displaying the result 5.
  • This combination of state, a defined instruction set, and step-by-step process is what makes it a virtual machine.
  • Mangle works the same way but also deals in strings of characters and adds instructions for doing things like matching, inserting, or replacing characters.
  • The main difference is that instead of pressing buttons you write Mangle instructions into a file and run them all at the same time.
  • But the basic process is entirely the same.

What do you think?

Btw… I realise that this has only a tenuous connection to IF, despite my using this tool in creating my own fictional words for my fictional worlds. So if this is annoying people I will take it elsewhere (assuming there is an elsewhere anywhere!)

I, too, think this is only marginally connected to IF. But’s it’s not annoying me in any way. I won’t use it, but being a programming language nerd I find this interesting.

1 Like

Yeah, parser IF developers specifically tend to enjoy VMs, weird programming languages, and text manipulation, so I think this is fine here.

4 Likes

I’ve released v1.0.0 of Mangle now with binaries for Linux, macOS, and Windows. I am pretty happy with this thing, especially how I was able to very quickly build an interactive debugger (a necessity really).

If you’re interested in how to build a VM the source is there and pretty easy to understand I think (modulo unfamiliarity with Elixir). If anyone wants me to walk them through part of it to learn some Elixir I’d be happy to. I’m a big fan of parser combinators and there are two parsers in Mangle.

There is a programmers guide and a 10-min intro video that runs some Mangle programs and explains what is going on.

1 Like

Yippee! :grin:

1 Like

Phase 2 of this project is an evaluator. Given a language spec, how “fit” is any given word to that language. At this point I don’t know how to specify a language. In the past I’ve used n-gram analysis of a dictionary to come up with a fitness evaluator, that might be good enough.

One enhancement of that might be to also combine word frequency (or, for some definition of it “familiarity”) into the analysis. So not just “how often does this n-gram appear in English words” but “How ‘English’ is this word anyway?” (which feels somewhat recursive).

1 Like

I am ready for the user pool of this tool to be N=1 but if you do happen to make anything interesting with it I’d be made up. If memory serves you have good programming experience but, in any case, I’m happy to help you out if you get stuck.

2 Likes

You may have already considered this, but I’ve found entropy measurements like surprisal to be very good for that kind of thing (usually the surprisal conditioned on the preceding letter). I helped a colleague gather a corpus of keysmashes at one point by using that to detect words that were definitely not English (and probably not any language using the Latin alphabet).

1 Like

Interesting counter point to the n-gram analysis approach, thank you. I’ll dig a little deeper into this.

Excellent! Once the powerlines are repaired after the incoming icy destruction has passed, I’ll make sure to give this a spin! :star_struck:

I’ve released Mangle VM 1.1.0. It contains two main changes:

  • Support for negative values (load i0, -1)
  • A new jmp_table instruction

The latter makes it easier to conditionally use different behaviours, e.g.

load i0, r
mod i0, 4
jmp_table i0, label1, label2, label3, label4

In this case we pick a number from 0-3 and then jump to a different label (0=>label1, 1=>label2, and so on).

This replaces somewhat laborious code like:

eq i0, 0
jmp_ok label1
eq i0, 1
jmp_ok label2
eq i0, 2
jmp_ok label3
eq i0, 3
jmp_ok label4
1 Like