To be honest, coming off a moderately complicated CYOA-style project, I think that for sufficiently long or complex twines this falls apart. If you’re doing a lot of state-tracking (For instance in what maga calls a “Branch and Bottleneck” story) there’s plenty of room to tangle yourself up in your own world model or introduce bugs. For Spring Thing, I’m releasing an 11000-word Undum project that uses that kind of structure, and towards the end I would have paid money for some kind of unit test system so I could ensure that my assumptions about the story were correct (And especially that it wasn’t possible to get stuck, cause a runtime error, etc). I’d be very curious about how the Inkle and Choice of Games authors handle this; was Creatures such as We just painstakingly tested by hand? Does ChoiceScript have some kind of testing harness system they use? I can’t imagine 80 Days was developed without unit testing, for instance - that thing is a quarter of a million words long.
Even then, though - in Undum, I can get away with manual testing and a good proofread or two. In a parser game, you can’t get away with that; the parser is a finicky, delicate machine, and worse, people’s attempts at interacting with the parser are unpredictable. You need extensive user testing to figure out what people expect so you can do it. Ultimately: IF parsers are not true “natural language parsers” (a Hard Problem that corporations have invested literally billions of dollars into solving, with mixed results at best) but rather an approximation that understands a simplistic dialect of English. Usability for those parsers relies heavily on knowing possible player inputs ahead of time and ensuring that those inputs can work, which in turn relies heavily on playtesting.
I can definitely think of a lot of improvements that could be made to I7 (Like an automated test runner more advanced than the Skein and the test command, for instance, would be great). But I can’t think of a way to make it hugely easier to write a good parser.
Ultimately, players have expectations built up over years of playing games that meet those expectations as a baseline. People expect to be able to refer to objects in room descriptions because that holds true often enough that when it doesn’t, it’s grating. Yeah, QA for a parser game is really, really surprisingly hard, but people keep consistently succeeding at it, so the bar is set at that point.
With regards to reviews, I think the best you can do is just try to recognise when a scoping error occurs and be kind about it. Maybe more comps should include a “back garden” for playable demos of unfinished work; there’s no shame in pushing back a release date on something, even if you miss a competition, so you can properly finish it and release it when it’s done.