ISHML: New JavaScript Library for Parser IF

lft · August 13, 2019, 7:00am

The traditional way of doing this is to always tokenize based on the longest match. So, if “>>” is a valid token, that never comes out as two instances of “>” (assuming that is also a valid token). For identifiers, you use a regex that doesn’t include whitespace, e.g. “[a-z]+”. Then you have a separate regex for whitespace that doesn’t produce a token.
Of course, just because this is the traditional way, it isn’t necessarily the best approach in your context where you want to return every possible parse. Perhaps it could be a configurable option.

bikibird · August 26, 2019, 4:12pm

Thanks for this. It got me thinking.

Currently the parser only has regex support for separators. Tokens are looked up in a trei data structure. The parser operates in two passes, one to tokenize the string and the other to parse the tokens.

I’m thinking about changing this around a bit to a “tokenize as you go” design. When a terminal grammar rule is hit, the rule will bite a token (including alternative matches) off the input string. Or, if specified as an option for the rule, match a regex pattern to create a token on the fly. This would certainly make handling numbers easier.

vaporware · September 14, 2019, 2:12pm

What are the advantages of using this over, say, the Antlr JavaScript target? The latter can deal with (some?) left recursion, it has a compact regex-like syntax, and it can target multiple platforms with the same grammar.

Using a parser generator-like system for IF is an interesting idea… Infocom made a move in that direction with their V6 games, but I’m not sure what it achieved.

bikibird · September 14, 2019, 4:38pm

Thanks for this question. While I don’t have any experience with ANTLR, early on I did play with PEGjs, which has similar goals. The idea of not having to figure out how to write a parser was certainly appealing as I contemplated the ISHML project.

Ultimately, I decided to not go that route. Although ISHML is currently only a parser, it will eventually cover all aspects of writing IF in Javascript. The idea of sending someone to a third party tool to make a parser added more complexity and user unfriendliness than I wanted. Also, the code generated by PEGjs was hard to follow. If you ever lost the BNF, etc., you would have a very hard time reconstructing it from the generated Javascript. An important value for ISHML is self-documenting code. PEGjs really didn’t provide that.

ANTLR and PEGjs are designed to be general purpose, but the ISHML parser is specialized to interactive fiction. The parser is doing more than parts of speech tagging. When the parser finds a match in the lexicon, it returns a definition. The definition is an arbitrary payload. This allows the parser to bind the results to the story model For example, for the verb “take”, it might return a function for taking. For the noun “slipper”, it might return a data structure representing the slipper. While you could possibly do something like this in PEGjs (and I suspect ANTLR), ISHML has facilities built-in to make it fairly intuitive to someone familiar with the IF domain. So the intimidation factor is greatly reduced and parsing doesn’t seem so impossibly difficult to understand.

I did figure out how to add regex support and that will be available the next time I release.

For more info on ISHML and where it’s headed, read this blog post.

vaporware · September 17, 2019, 12:19pm

I see. Yeah, I think what I missed the first time through was the way those definitions enable the token filtering in rules.

I had the impression that the author would need to write separate grammar for each action (and/or object), but if I understand correctly, the filtering system would let you (the system author) write general rules to parse things like “command with a verb and two noun phrases separated by a preposition”, and then the game author would be able to reuse them just by adding their new words to the lexicon with definitions that pass the right filters. Very cool - like a more flexible version of the Z-machine dictionary.

One disadvantage I’ve seen in tools like ANTLR has been the way they report errors. Out of the box, you only get super-concrete messages like “expected INT, FLOAT, ‘{’, but got STR”… but for anything more, you end up writing twice as many grammar rules to cover all the error cases. Any plans for a domain-specific error reporting system in ISHML?

bikibird · September 17, 2019, 2:30pm

Yes, exactly that.

In the next version to be released, if the text cannot be fully interpreted, you will get all possible partial interpretation sorted by the length of the remaining uninterpreted text. So, if you try to take something non-existent, you would get the gist populated with the definition for the verb “take” and the remainder would contain the rest of the string. This allows you to write error messages like “You want to take the shiny thingamabob, but I don’t know what that is.”

vaporware · September 20, 2019, 12:29am

Is there a way to identify which rule caused the parse failure for each interpretation? I think that usually has more effect on the choice of error message than the parts that were parsed successfully.

Draconis · September 20, 2019, 2:07am

I’m not quite sure I understand—if I type VZLONB THE BOOK, which rule would cause that to fail? It seems like it would be all of them, since all of them rejected it on the first word (it didn’t match the verbs any of them were looking for).

vaporware · September 20, 2019, 4:02am

Well, there are a few ways a command could successfully start with VLONZB, given the right definitions. For example:

As a verb: VLONZB THE BOOK, fitting the rule <verb> <noun phrase>
As a direction: VLONZB, fitting the rule <direction>
As part of a noun phrase: VLONZB, LEND ME YOUR TOWEL, fitting the rule <noun phrase> <comma> <command>

If VLONZB isn’t actually defined in a way that makes any of those work, then knowing the ways it could have matched helps us generate the right error message. We tried interpreting it as a verb, a direction, and an object being ordered; of those, “verb” is the most likely interpretation, so maybe we’d print an error message about not knowing how to vlonzb.

For another example, consider GIVE SMALL ORANGE VLONZB, where the offending word could be read:

As the first or second word of a noun phrase, in a command with no preposition separating the objects, as in GIVE THE SMALL ORANGE THE VLONZB, or GIVE INSPECTOR SMALL THE ORANGE VLONZB
As the third word of the first noun phrase in the same sort of command, as in GIVE THE SMALL ORANGE VLONZB A KISS
As the third word of the first noun phrase in a command with a preposition between objects, as in GIVE THE SMALL ORANGE VLONZB TO THE GUARD

In that case, we need to decide between messages like:

I don’t see any vlonzb here.
I don’t see any orange vlonzb here.
You’ll have to tell me what you want to give the small orange vlonzb.
You’ll have to tell me who you want to give the small orange vlonzb to.

Rather than just knowing which rule failed (which, indeed, will always be “all of them”), I think maybe the way to model this would be to let each (sub)rule set an optional failure context, which will be returned if the parse fails inside that rule. Something like:

var reversedCmd = ISHML.Rule()
    .snip("verb", verb.clone())
    .snip("indirectObject", nounPhrase.clone())
    .snip("directObject", nounPhrase.clone());

reversedCmd.verb.failure = {state: "verb"};

reversedCmd.indirectObject.failure = {state: "nounPhrase", which: "indirectObject"};

reversedCmd.directObject.failure = {state: "nounPhrase", which: "directObject"};

bikibird · September 20, 2019, 2:54pm

Currently, there isn’t a way to do that, but I can see how it would be useful. I’ll look into it.

What would you be looking for in terms an error object?

bikibird · December 15, 2019, 8:15pm

I want to thank everyone for their feedback on ISHML. Based on what you’ve told me, I’ve reworked the API quite a bit and a new release of ISHML (0.1.2) is now available. This release is not backward compatible with the prior release. If anyone needs help upgrading their code, just send it to me in a PM.

Added support for regular expression matching.
Added support for custom error handling in the event of a rule mismatch.
The API is now exposed through global object ishml, not ~~ISHML~~.
ishml.Parser.analyze no longer takes an options argument. Instead, the options are set directly on rules.
Added many new options for configuring ishml.Rule.

I significantly reworked the API documentation and tutorials, which are available at https://whitewhalestories.com.

*If you’ve been paying attention, you will note that I went backward on release numbers. I want to more properly follow SemVer for release numbers and starting with 0 indicates that the API is still under development and should be regarded as unstable. Please regard release 1.1.2 which came before release 0.1.2 as apocryphal and I will try to do better managing my releases in the future.

dcsan · July 6, 2020, 12:25am

This looks like an amazing piece of work. I’m still reading the docs but I wanted to confirm some of my understanding:

Is this mainly a parser that would be used parsing “inputs” from a user interacting with some IF content? eg the examples like `take the ruby slipper` as user actions?

Would you say it is relevant for authors to write a story and then use the parser to try and understand fairly freeform literature, to turn it into some type of object system for a game?
eg There is a book on the table
Or would you say it’s most useful as a user input parser, combined with a more structured approach to defining the story and world (I think that’s the direction I would take)

Are there any examples of content that leverage this system?

Most of the docs provided focus on the API details and parsing individual sentences.

How do the results compare with NLP “dependency parsing”

eg spaCy provides this API:

The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”.

Which seems similar in terms of results, although I’m not sure about the underlying path you took to reach your goal.
Perhaps because of the lexicon the ISHML results can be more specified to include concepts like an “npc” that doesn’t exist in spacy’s concept of “natural language”.

Your fluent API looks very nice for navigating parser results.
JS recently has nullish coalescing to make it easier for others to create these types of API. pointfree is another alternative style from functional programming eg with ramdaJS

Thanks!

bikibird · July 6, 2020, 8:06pm

Thanks for the kind words. Regarding what ISHML is and isn’t…

Eventually ISHML will be a complete library for writing interactive fiction in JavaScript. It encourages a highly legibly, “fluent” coding style. The target audience is folks who have at least an introductory understanding of JavaScript.

ISHML is still under active development.

The main components of ISHML are:

Parsing

This portion has been released and I know there are other who have experimented with it, but I’m not aware of any completed projects of note.

Generally, the lexicon, rules, and parser are designed to work together to match a user’s input to data objects in a program. The data in the application can be anything and may be unrelated to interactive fiction. (However, I’m particularly interested in applying this to IF.)

The ISHML library implements a recursive descent parser for parsing Context Free Grammars (CFG) that you build using he API. The tutorials include several example scripts. The ISHML parser handles ambiguous input by returning all possible interpretations (according to the grammar it’s processing against.)

Analyzing long passages of free-form text is probably not it’s best application. It is not doing any statistical NLP processing and is definitely not doing any AI. However, by configuring a semantics function for each rule in your grammar, you could in theory also provide some sort of evaluation of the likelihood of each interpretation. So, potentially you could do more of a statistical approach, but it would require more work on your part.

Story World

The ISHML story world feature is still under development.

It’s a fluent API for creating a network (directed weighted graph) of the data (objects, rooms, people, etc.) that makes up the story. ISHML takes a bit of its theming from Moby Dick, Therefore the network is referred to as a “net,” with the data nodes called knots. You create relations by tying a cord between two knots.

There are many ways to navigate and query the net, but as this component is not finished yet, I hesitate to say much about it at this time.

Plot Points

Plot points are still under development.

Plot points are the smallest possible units of story and are strung together in interesting ways to tell a story based on the user’s input. Input may be the form of text typed by the user, links clicked, or drag and drop. From ISHML’s point of view there isn’t a whole lot of distinction regarding the the type of input. Conceptually plot points are similar to TWINE’s passages.

By the end of July I’m releasing a new update that will include a fix to the parsing system. If I’m extremely productive, it will also include an extension grammar rules so that they can be used for procedural text generation.

By the end of the year I hope to release the story world and plot points components. I also hope to release starter code for Infocom style text adventures. The goal is to have a coding experience that is fun and efficient and does not demand to much of novice users of Javascript.

dcsan · July 8, 2020, 9:33am

hi! I finally had a chance to try it out a bit.

I sent a PR here, just to see it working in nodeJS (then we could write tests etc.)

However the parser doesn’t do quite what I was expecting…

---
input:		 take the ruby slipper
verb/key 	 take / take
adj/key 	 ruby / ring  <== why ring?
noun/key 	 slipper / slipper
---
input:		 drop the ring in the tumbler
verb/key 	 drop / drop
adj/key 	 undefined /
noun/key 	 ring / ring

<== possible to build more complex grammars with subject/object?
---
input:		 put on the ring
verb/key 	 put on / wear
adj/key 	 undefined /
noun/key 	 ring / ring
---
input:		 take the glass
verb/key 	 take / take
adj/key 	 undefined /
noun/key 	 glass / slipper
---
input:		 take the tumbler
verb/key 	 take / take
adj/key 	 undefined /
noun/key 	 tumbler / tumbler
---
input:		 take the glass slipper
verb/key 	 take / take
adj/key 	 glass / slipper  <== glass is adj. but also a slipper?
noun/key 	 slipper / slipper

It’s a neat library but I need more time to experiment with the grammar.
also I just used the CDN build so it might be out of date

bikibird · July 10, 2020, 6:06pm

What you are missing is a semantics function defined on the nounPhrase to filter out mismatches between adjectives and nouns. Reread the section on semantics in the tutorial Parsing Part 2 – White Whale Stories – ISHML and try adding a semantics function.

ISHML: New JavaScript Library for Parser IF

Is this mainly a parser that would be used parsing “inputs” from a user interacting with some IF content? eg the examples like take the ruby slipper as user actions?

Are there any examples of content that leverage this system?

How do the results compare with NLP “dependency parsing”

Is this mainly a parser that would be used parsing “inputs” from a user interacting with some IF content? eg the examples like `take the ruby slipper` as user actions?