Anyone tried testing with Selenium or similar

The_Pixie · February 7, 2023, 5:30pm

Twine Version: 2.6.0.0
Story Format: (you can click a tag to choose the format above)

Has anyone tried automated testing of Twine games, for example with Selenium?

What I want is for the testing software to repeatedly click links at random, and see how far it gets before it hits an error.

I can save a transcript:

I think Selenium needs a proper URL, so I have RebexTinyWebServer serving the Twine web page, and I can get Selenium to access the page. But I have to admit I am struggling after that.

Has anyone tried this? And had some success?

cchennnn · February 7, 2023, 10:24pm

I once wrote a script for testing sugarcube with selenium:

gist.github.com

https://gist.github.com/aucchen/808ebf87a8ebd7d6ecb2d4753eff9ba5

automated_player.py

#!/usr/bin/env python

import os
import random
import time

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

This file has been truncated. show original

I’m not sure if it still entirely works…

Greyelf · February 7, 2023, 11:35pm

Generally a test should be a repeatable sequence of actions that generate a known outcome, that way you can run the test more than once and achieved a consistent result if everything is working correctly.

If you are just randomly selecting links then:

there is no guarantee that any specific bug would be found.
that a bug introduced into existing code will be found.
that all bugs will be found.

There are use-cases where some randomness added to a repeatable sequence of actions can help find specific types of bugs, but those use-cases have limited usefulness when testing.

The_Pixie · February 8, 2023, 7:46am

I take your point, but I am not talking about unit testing. I do not think there is enough interaction between different parts of my games to make that worth while (and I do do unit testing in QuestJS; it is pretty much essential there).

Also your points 1 and 3 apply to unit testing too. If I know of a section that might have errors, I can check that bit thoroughly. It is the rest of the game that concerns me.

Perhaps I should say that my games are fairly straightforward, with little more complicated that setting a variable in one passage, and then checking the value in a latter passage. What I want is something that has a high chance of hitting every chunk of code, and so will pick up every missing > or $ or misspelt variable name And the output an go in a word processor to pick out extra or missing spaces around conditional statements.

My thought was that setting it up to do it randomly would be a “quick” and general solution. A big Twine game might have 30 passages in one play through, and hundred of possible routes through it.

If I can set Selenium to do a hundred runs though it with a single button press, and then examine the results, that will be much easier than setting up just a dozen specific paths. And if I am setting up a dozen specific paths, I do not need Selenium, as I am going through it anyway (I just need a way to jump to the problem bit if necessary).

Or is there another way to test the code in a game?

brad · February 8, 2023, 4:03pm

My short Twine game Esther’s uses browser automation (Puppeteer, which uses Selenium under the hood) to exhaustively test the game’s possibility space.

I wrote about this a bit on my blog. My game is using Snowman, not Sugarcube, but I think you could probably take SyntheticPlayer.js as a starting point, adapt it for compatibility with your story, and write full explorations, partial explorations, or custom outside-in test cases.

Greyelf · February 8, 2023, 8:42pm

I wasn’t talking specifically about unit testing, as the same practices are used for things like: User Interface testing (eg. Selenium); Integration testing; Performance testing; etc…

zarf · February 9, 2023, 1:56am

Hold up. That is a test, but it’s not the only way we think about testing in game development.

“Repeatedly click links at random” is a valid thing to do with a choice-based(*) game. I’m not sure if that feature exists in the Twine world, but it’s completely standard if you’re developing a large commercial game in Ink or ChoiceScript. You do a thousand (or five thousand! ten thousand!) random run-throughs of your game, and look at the results.

It’s not a functionality test; it’s a statistical test. But the statistics are very useful! This is how you learn whether your major story paths are well-balanced. If one ending is reached in only 0.05% of runs, you need to know that. (Maybe you want that to be the hard ending, but you still need to know!) If a particular branch is reached in 0% of runs, that may be a logic bug. If chapter outcome X gets three times the hits of chapter Y, maybe you want to split it into X1, X2, and X3.

(* The only reason it’s not standard in parser-based IF is that it’s too hard to generate random meaningful parser inputs. If you just randomly combine nouns and verbs, most of your responses are parser errors and you don’t make meaningful progress.)

pbparjeter · February 9, 2023, 4:07am

I’ve used a plugin called iMacros for other things. I think the scripting is a little more straightforward but I am not sure if it is useful for what you want.

Greyelf · February 9, 2023, 6:13am

I did start that statement with “Generally”, which indicates that it’s not the only methodology that can be used.

Can you explain how making random choices, that aren’t based on the context (1) that those choices appear in, is a good indication of how the end-user will travel through the story’s major story-lines?

(1) When said context is generally one of the means that the end-user uses to make their choices.

cchennnn · February 9, 2023, 7:29am

For an example of statistical balancing with the aid of randomized testing, see Emily Short’s post on balancing Bee:

I wrote scripts for testing dendry: bee/automated_player.py at master · aucchen/bee · GitHub and bee/analytics-2.ipynb at master · aucchen/bee · GitHub for the analysis.

More related to this topic, my python script for random testing of twine/sugarcube games does the same thing - run through the game n times, randomly clicking on links until reaching a dead end, and saving the transcript and the final stats: automated random tester for twine-sugarcube · GitHub

All this is using selenium; I have just updated the script, and tested it on a very simple game, where it seems to work.

zarf · February 9, 2023, 1:55pm

You can’t predict what a user is going to do.

Greyelf · February 9, 2023, 9:29pm

So the discussed project is using random internally to influence / determine what choices (storylet links) to display to the end-user for their selection, and they wanted to balance the frequency of those random(ish) determined choices.

One thing that is unclear is if the display order of the “current” list of choices is consistent, or if that is also randomly influence / determined. As that can also be important due to “first / last item selection” bias that can occur in surveys / questionnaires.

eg. if the current randomly determined list of choices consists of “apple, banana, cherry”, are those choices always displayed in that order each time that specific combination is produced. Or does the system randomise the order so that it might display them as “banana, apple, cherry” or “cherry, apple, banana” etc…

In the case of the above type of project I can see how randomly selecting links could help determine the frequency of each choice’s availability, but I would argue that it is the number of test runs preformed that is the critical part in determining that frequency.

However the original question was above testing for (coding?) errors within a SugarCube based project…

…and there was no mention about needing to determine the availability frequency of each potential choice.

So I’m still unsure how randomly selecting links can guarantee a high enough code coverage to find the majority of “broken code” (1) type errors currently in a project, unless the number of test runs preformed is very high.

And I definitely don’t see how such can be used to find any “logic” errors, because such testing generally require knowledge of the expected outcome.

But the project isn’t mine, and I’m not doing the testing, so what I thing doesn’t really matter.

(1) where “broken code” represents errors types like invalid syntax; using misnamed variables / macros / functions; etc…

zarf · February 10, 2023, 2:29pm

By all means perform a very high number of test runs!

And I definitely don’t see how such can be used to find any “logic” errors

I’m making the assumption that every node should be reachable. If thousands of test runs fail to reach a given node, you want to look at its preconditions to see if you screwed them up.