T3 module for running statistical tests

Dunno how much general interest there is in this kind of thing, but I just put up another git repo, this time for statTest, a T3 module for running statistical tests.

The module provides several base classes, each of which implements a different statistical test: chi square goodness of fit, a simple fenceposting/off-by-one error detector, a normalized root mean square error calculator, and an implementation of the Wald-Wolfowitz runs test.

There are demos for each in the repo, but the basic usage is something like:

        class DemoTest: StatTestRMS
                outcomes = static [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
                pickOutcome() { return(rand(10) + 1); }
        ;

…to define a test, where here StatTestRMS is one of the base classes from the module, outcomes is a List containing the allowed returns values from whatever you’re planning on testing, and pickOutcome() is a method that should pick a single result value via whatever method you’re testing (in this example it’s just the builtin rand() function, but this could be whatever).

Having defined the test, you use it via something like:

        local t = new DemoTest();
        t.runTest();                    // actually runs the test
        t.report();                     // outputs the test-specific report(s)

This is intended to be a simple, modular way to test/verify procgen stuff, random selections, and so on.

This isn’t intended to be a rigorous test of the underlying PRNG(s) or anything like that. I wrote it to catch my own implementation errors. E.g. in writing a in-game poker implementation, making sure I’m not doing anything silly like picking a card via return(cards[rand(52)]) (which looks plausible but contains an off-by-one error because of how rand() works) and so on.

Dunno how many other people out there are worried about this kind of thing. But if you are…well here it is.

5 Likes

Just adding a completely worked example here, including an implementation quirk/feature I didn’t mention above.

Let’s imagine we’re implementing an engine to play rock paper scissors in-game. And, for some reason, we’re concerned with the fairness with which NPCs will be throwing the various moves. Our basic NPC AI for playing rock paper scissors might look something like:

// A naive engine for playing rock paper scissors.
rockPaperScissors: object
        randomNumber() { return(rand(3) + 1); }

        // We can't call this method "throw", because that's a reserved word.
        shoot() {
                local r;

                switch(randomNumber) {
                        case 1:
                                r = 'rock';
                                break;
                        case 2:
                                r = 'paper';
                                break;
                        case 3:
                                r = 'scissors';
                                break;
                }

                return(r);
        }
;

This is a little silly and ornate for what it does, but it’s vaguely in the form of something we might use for more serious stuff. Specifically, we’re designing the decision logic as a sort of black box that we access by a method that’s meant to be called externally: rockPaperScissors.shoot(). That is, the behavior we’re concerned about being ā€œfairā€ is at base dependent on something we’re probably not worried about (the native T3 rand() function), but we want to make sure whatever we’re doing with it isn’t introducing any hidden errors.

Okay, so much for the RPS ā€œengineā€. To test this via the stuff in the statTest module, we just wrap a call to the decision method in a test class:

// Test the rock paper scissors "logic" defined above.
class RPSTest: StatTestFencepost
        // These are the outcomes
        outcomes = static [ 'rock', 'paper', 'scissors' ]

        // Invoke the rock paper scissors selection logic once.
        pickOutcome() { return(rockPaperScissors.shoot()); }
;

The thing to note here is that we can have options that are non-numeric as inputs to our statistical methods. Under the hood the individual tests are doing something like idx = outcomes.indexOf(pickOutcome()) and then using the index values (instead of using the return values from pickOutcome() directly). This means there’s no problem using strings or objects or enums. The only caveat being that elements of outcomes need to be unique (that is, only appear in the array once).

All that’s left is to actually run the test, and that works exactly as the example I posted upthread:

                local t;

                t = new RPSTest();
                t.runTest();
                t.report();

Here’s the complete source as a compilable ā€œgameā€:

#charset "us-ascii"
#include <adv3.h>
#include <en_us.h>

// A naive engine for playing rock paper scissors.
rockPaperScissors: object
        // Our method for selecting a random number.  This version
        // works;  to illustrate how the fenceposting logic can catch
        // off-by-one errors, change this to be just "return(rand(3))".
        randomNumber() { return(rand(3) + 1); }

        // This is a plausible-looking off-by-one error.
        //randomNumber() { return(rand(3)); }

        // We can't call this method "throw", because that's a reserved word.
        shoot() {
                local r;

                switch(randomNumber) {
                        case 1:
                                r = 'rock';
                                break;
                        case 2:
                                r = 'paper';
                                break;
                        case 3:
                                r = 'scissors';
                                break;
                }

                return(r);
        }
;

// Test the rock paper scissors "logic" defined above.
class RPSTest: StatTestFencepost
        // These are the outcomes
        outcomes = static [ 'rock', 'paper', 'scissors' ]

        // Invoke the rock paper scissors selection logic once.
        pickOutcome() { return(rockPaperScissors.shoot()); }
;
versionInfo: GameID;
gameMain: GameMainDef
        newGame() {
                local t;

                t = new RPSTest();
                t.runTest();
                t.report();
        }
;

Compiling this with the -D __DEBUG_STAT_TEST flag and then running in your favorite interpreter will output something like:

StatTestFencepost: passed
[Hit any key to exit.]

If you compile without the -D __DEBUG_STAT_TEST it’ll just exit without output, indicating that it detected no errors (this is intended to make scripted regression testing a little easier).

In the complete example, there’s an alternate randomNumber() method that contains an off-by-one error:

        randomNumber() { return(rand(3)); }

This will return values from 0 to 2, not from 1 to 3. This means it’ll generate a bunch of values that are out of range low and it will never pick the third defined option (scissors). Commenting out the ā€œgoodā€ method, uncommenting the ā€œbadā€ method, re-compiling, and then running the new story file, it will report the problems just mentioned:

StatTestFencepost: ERROR: 3346 outcomes under value
StatTestFencepost: ERROR: bucket 3 (scissors) empty
StatTestFencepost: FAILED
[Hit any key to exit.]

This is probably the test in the module that’s likely to be most useful to most developers, because this kind of bug can be pretty sneaky.

1 Like

Fortunately those types of bugs can be minimized by using rand directly on the options like rand(ā€˜paper’,’rock’,’scissors’) or rand(cards)
It was weird for me, learning TADS first, to discover that ā€œrealā€ languages index from 0… are there any mainstream ones that index like TADS does?

1 Like

MATLAB is the only one I can think of that indexes from one instead of zero.

1 Like

Lots of languages index from 1: Comparison of programming languages (array) - Wikipedia
But it does seem like they’re often more domain-specific than general purpose. Which is a bit odd, because, for example, mathematicians/scientists would be used to counting from 0.

3 Likes

Yeah, for the example as written that’s absolutely true.

Reason why I formatted the problem more elaborately is to better reflect the kind of cases where I’m actually worried about this sort of thing. For example, I’ve got NPCs who, in the process of playing poker, have to repeatedly decide if they want to fold, call, or raise. And in that case treating all options as equally probable and just picking one doesn’t really work. This could be handled 100% deterministically (that is, they always evaluate some algorithm and always pick the same result given the same situation)…but then the AI is trivially exploitable. So they need some variability, just like a human player.

The underlying logic is pretty complicated, but the core mechanic relies on selecting FCR (fold, call, raise) probability tuples from a table of different values that apply to different circumstances (like whether the NPC thinks they have a strong hand vs a weak hand, for example).

Not getting into the poker logic, re-framing this as a rock paper scissors problem the tuple would look something like:

class RockPaperScissorsTuple: object
        rock = nil
        paper = nil
        scissors = nil

        // Threshold values
        _rp = nil
        _total = nil

        construct(r?, p?, s?) {
                rock = (r ? r : 1000);
                paper = (p ? p : 1000);
                scissors = (s ? s : 1000);

                _rp = rock + paper;
                _total = _rp + scissors;
        }

        decision() {
                local i;

                i = rand(_total) + 1;

                if(i <= rock)
                        return('rock');

                if(i <= _rp)
                        return('paper');

                return('scissors');
        }
;

In this case the tuple starts out with each option having the same weight (arbitrarily selected as 1000). The decision logic picks an integer between 1 and the sum of the weights, and picks an option based on where the random value falls.

At first this is just a more complicated and error-prone version rand(options), but this version lets us update the weights to change the AI behavior. If, for example, every time we play we subtract one (or 10, or 100, or whatever we decide the learning rate should be) from the losing move and add one to the weight of the move that would’ve won then, assuming some simple bounds checking, over time we can ā€œfigure outā€ if a player, for example, always picks ā€œrockā€ and then consistently beat them.

1 Like

In the ancient days of m$-derived basics (where RNG output was a float number between 0 and 1 excluded), there was the well-known +1 fenceguard, e.g. INT(6*RND()+1)…

Best regards from Italy,
dott. Piergiorgio.

1 Like

Yeah, having the system rand()-equivalent return a float in [0, 1) is what I’d expect by default. In JavaScript, for example, if you want a random integer between min and max (inclusive) you want something like Math.floor(Math.random() * (max - min + 1)) + min.

I’d gladly be using a float value in [ 0, 1 ) if it wasn’t for the fact that floating point in T3 is soooooo sloooooow. In the decision-making tuple logic I used as an example above, I’d much rather normalize the probabilities of the outcomes by summing the weights and dividing each weight by the sum to get the probability of each outcome as a float in [ 0, 1 ] such that they all sum to 1, and then make a selection by picking a float in [ 0 , 1 ). And update weights by multiplying/dividing by 1.05 or whatever.

But a half a dozen floating point operations whenever you want to do something is very expensive in T3 when you have a bunch of things that want to do this every turn.