I feel like what this basically boils down to is: there is a chance, however small, of this rule having unforeseen complications, but there is a presently existing reality of low-quality/broken AI entries hampering enthusiasm and participation for these already-niche competitions. “We shouldn’t try to fix things because new things might break” isn’t a tenable path.
I’ll likewise just express my happiness with the rule. Sure, there will be potential edge cases and drama, but we’ve all been dealing with that already (I was on the qualification committee last year, and we had to vet whether games were complying with the AI disclosure rule, which is pretty much exactly the same process that I’d expect would have to happen with the ban). So much has already been said that I think I just have two brief thoughts to add:
-
We know that there were people who hopped onto the Comp site last year, excited to play, and got turned off by seeing a large number of obviously LLM-generated cover art. If it does nothing else, this rule change should pretty clearly fix that problem.
-
I am as down on LLMs as anybody, I think, but it’s also worth acknowledging that people who use them are not by and large bent on slipping their AI-created stuff past the rules as part of a bad-faith trolling campaign; we have had many many examples of AI games and their authors are pretty much all either trying to show off how an LLM can make dialogue more natural or parser-based games more open-ended, or just share a game they think is fun but where they had an LLM write the boring bits. Yeah, the ParserComp guy sucked, but he was engaged in voting manipulation; his LLM games obviously used an LLM. So I have a hard time imagining that there are many people who, upon seeing the new rule, will decide to try to subvert it, and therefore I think it’s appropriate to give most of the potential edge cases the benefit of the doubt.
I guess taking these two together, I see the intent and likely effect of the rule is to eliminate obvious LLM use from the Comp, which is great; I don’t see an intention or much upside to a Stasi-style guilty-until-proven-innocent investigation of non-obvious LLM use, so I’m not too worried about that.
So several times my work has been called AI. It’s a new form of hostile dismissal that variegates the usual profusion of “aesthetically you are a mistake” or “this is extremely embarrassing for you” or “why would you choose to be deliberately annoying” or “may there truly lurk somewhere in the deepest of waters a leviathan sufficiently goliath to devour dissolve the depths of your disgrace” to which most authors too stubborn for sense will become inured. While someone erasing your time and energy by insisting a machine did this is frustrating, it’s no more frustrating than the fact that many people don’t like what you made and that many of them are, in some wandering objectivity you will never reach, right. We can tear our clothes weeping blood in righteous anger against God that the hypothetical Sufferer has someone misunderstand them online, but the reality is that whatever the rules are the most likely outcome for your Submission is that you will be judged and this will be uncomfortable. So what are we to do? I assume, first of all, be incredibly grateful that a surprisingly large team of volunteers have made the mistake of taking time away from their family to build this cool little thing to facilitate the possibility of your creative fulfillment which the rest of the world is seemingly so desperate to destroy. People care that you exist is the impetus of IFComp. That’s worth protecting, even against the possibility that someone somewhere will be unhappy for all the reasons inevitability supplies. To my mind, a rule that cherishes you to the exception of replication excess is true to this spirit.
I just want to be clear that I, too, support an AI-ban rule. I just strongly recommend clarity on how it will be enforced in a sustainable way.
My favorite approach:
- Require authors to check a box promising that they didn’t use any AI in their cover art, and a separate box promising that they didn’t use any AI in their game.
- Remove no games for breaking the AI rule unless the author admits to using AI (or, of course, if the game uses a live LLM service, which is easy to detect/verify).
- Allow judges to flag games/art that they suspect to be AI-generated, and treat the games they flagged as a vote for 1 out of 10. (We already have a process for detecting votes in bad faith.) When a judge flags a game as AI, they would not be expected to “make a good-faith effort” to continue playing, as is currently required by Rule 7.
- If multiple judges flag a game as AI, the IFComp committee would reach out to the author to ask them to make a statement. (The author might admit to using AI at that point.)
- If the author denies using AI, but the game “feels like slop” to the committee, the committee would recommend, but not require, that the author withdraw from the competition, pointing out that the game is certain to rank very poorly.
This way, we can all enforce the AI rule together, banning AI without disqualifying any author who insists that their work was hand-written, but still ensuring that liars cannot win. If one or two judges misidentify a game as AI slop, the game would still get to compete.
Or, you know, just wing it. The IFComp committee could simply review all games that are flagged as AI under a microscope, then make an official ruling that the author has/hasn’t lied, entirely on vibes, and see what happens. How hard could that be, amirite? /s
I like all of these except step 5. I think if the author refuses to admit that it’s AI even tho it feels like “slop”, then step 3 already damns their game to low ranking and the committee doesnt have to recommend anything (or ban the game). Then no rulings on lying have to be made at all (officially). The “no harrassment” rule would discourage/enforce against lying accusations from the public
I also recc that for step 3 you can only flag a game if you played a little of it, at least for the “no ai writing” flag (they should be two separate flags)
I don’t dislike this, honestly, but there’s some caveats. This would certainly ease the burden on the committee and reduce the concerns over potentially arbitrary decisions. Having individual judges make individual decisions would mean no potential for ‘comparing notes’, but… that might be good or bad depending on one’s perspective. I don’t think this will alleviate the concerns over false accusations, although one judge making a snap decision wouldn’t be totally fatal to a game’s ranking.
I would suggest asking judges to play the game in good faith and give a rating regardless, and just check the boxes. If a certain number or proportion of judges flag it, then the ratings are zeroed.
Also how do we pass this suggestion onto the committee instead of hashing it out in the thread they dont all read? email?
I only mentioned this because I’d prefer AI games not show up on the list at all, and so at least recommending that authors withdraw might clean up the results a bit. Ideally, the lowest rated game should be a game like Uninteractive Fiction 2, not like Space Mission: 2045.
In previous threads, people objected to playing and rating “in good faith” when they’re pretty sure the game is AI. If you think the art is AI, it’s hard to swallow giving the game itself even 10 minutes of good-fath time.
Fair play, tho I just think since the committee’s already overworked and this is a high pressure thing, step #5 might be a lil too far for them to adjudicate ykno
Also I think playing the game a little (even if not necessarily “in good faith”, just, enough to see the writing with your own eyes) would be the appropriate price to pay for the opportunity to flag a game. Otherwise you just get flooding from folks who just heard it’s probably ai
I sent an email just now.
For comparison, my understanding of the current status quo for rules is:
- Authors promise that their submission doesn’t break any of the rules at submission time. (And checkboxes for “this was AI generated” already exist.)
- The organizers check for obvious rule violations before the comp goes live. If they have questions about a submission, they can contact the author and ask more specific questions.
- Anyone who notices a rule violation is encouraged to contact the organizers. (Whether this is catching something the organizers missed in step 2, or rules that were only violated after the original release.)
- The organizers can disqualify a game at any time, whether based on information learned in 3 or in any other manner. (And they can reach out to authors for any reason, including that the game reads like AI and may get downvoted.)
This is - pretty close to what Dan suggests? Differences I see:
- Violations are reported via email instead of via checkbox. Personally the status quo seems fine to me. (As a judge, I don’t love the implicit expectation that I should be acting as an AI cop for every entry, though if I happen to notice something I’m happy to send an email.)
- You change the voting expectation for games in the case of slop vibes: judges are not expected to play sloppy games, and they are effectively required to give such games a 1. (Personally on the fence about this, but I think it mainly comes up for games that are both good and heavily reliant on AI, which hasn’t been a problem we’ve faced yet.
) - In the new proposal, organizers do less up-front - they don’t pre-screen entries, and their only response to a bunch of flags is to ask for a statement / withdrawal of the entry. (Less work for organizers, but also fewer tools or opportunities for organizers to intervene, which they can now do at many points in the process.)
I think this is a little unfair! As Zarf mentions above, there are good reasons for the committee to be a bit vague about enforcement, but I don’t think I’ve seen any reason to think they’re planning some kind of vibes-based winging… especially since they have a good track record of pretty fair and low-drama enforcement in the past, and since the proposed ideal case is not super far from what they’re doing already.
There’s no harm in assuming the best of people. While I haven’t always agreed with the committee’s process, it is made up of intelligent people of goodwill. They will do their best to handle this rule fairly and consistently. I don’t think there’s any basis for imagining future tyrannies exercised by the committee. In fact, I find such speculation uncharitable.
We cannot live in fear of liars. Liars are everywhere, and they lie. They mess things up. The existence of liars cannot prevent organizations from making new policies that respond to emerging ethical and community concerns. I think most people that participate in this community respect its norms and institutions, and ultimately policies should reflect their best interests and values.
I also think that it is reasonable—desirable, even—for something called the Interactive Technology Foundation to take positions on interactive fiction technologies. It would be rather bizarre if it didn’t. Sometimes people won’t like those positions. That’s natural. It’s the nature of leadership.
This is new ground. There may be problems! I’m sure that we will treat volunteer organizers with the respect and patience they deserve. In time, we will all reap the benefits of their decisions (I include other event organizers, too) to do what is right and best for this community.
The difference, for the 1000th time, is that there’s no way to prove whether text was generated by AI. The only way to identify AI text is by vibes.
A plan to have the organizers check for AI is, literally, a plan to wing it by vibes. That’s literally the only thing an organizer can do.
That’s what makes this rule different from all the others.
(I’m getting really frustrated having to spell this out, over and over, as if you had no idea how AI detection works.)
If y’all want me to just shut up, maybe one or two more people could just ask a simple question about what’s so hard about screening for AI? “Why is this a big deal? I just don’t get it.”
Yeah, this is reasonable. I’ve done that with some slop but only when playtesting and I had to hold my nose.
Maybe just… have the judges check the box instead of rating? The situation I’m trying to avoid is one where one judge sees something that no one else does and it mars the game’s rating.
You’ve convinced me some kind of crowdsourced solution is ideal, and I would support this; I don’t know if there’s much hope for it to be implemented for this year’s comp at this rate. People will probably have to just wait and see what happens with this year’s comp. I definitely think some level of formalized community involvement, such as having judges tick a box that means the game doesn’t get a rating or gets an automatic 1/10 rating, is better than putting the primary onus onto the organizers, and it also provides a structured way for the community to flag a specific game — and see how many times a game has been flagged, or what percentage of judges flagged it — instead of just having people send emails and hope for the best.
I mean, Dan’s right, there’s precisely one way to identify AI-generated text (at least only one that can’t be circumvented by an adversarial actor, but that’s a totally different story) - and it’s having people read it and check the vibes. There are certain measures you can take to systematize it and there are patterns you can find, but it’s vibes.
But like I said upthread - it’s good to be thinking about the 10% of cases that will be ambiguous in practice, but not good to think about them for 90% of the time. People putting in far more effort than your average comp-rigging Dick Dastardly type are having serious trouble getting their models to produce cogent text for IF - much of it is plainly obvious.
And, as an aside, I suppose - ten years ago a bulleted list of short paragraphs, each preceded by an emoji, would probably have passed the Turing Test. Today it doesn’t.
I agree with the approach IF Comp has suggested.
I’m reluctant to agree with community proposals that integrate the mechanics of upholding the rule specifically with the numerics of IF Comp ratings. We don’t need game mechanics for maintaining a community standard.
Or, you know, just wing it. The IFComp committee could simply review all games that are flagged as AI under a microscope, then make an official ruling that the author has/hasn’t lied, entirely on vibes, and see what happens. How hard could that be, amirite? /s
… I just think since the committee’s already overworked and this is a high pressure thing, step #5 might be a lil too far for them to adjudicate ykno
I’m disappointed at the speed at which we’ve moved from labeling AI submissions, to disallowing them from entry, to arguably needing a process for audience members to do enforcement before we’ve seen if people intend to break the rule (or submit reports) en masse.
If anyone has any concerns, they should presumably email the competition organizers, who can handle things on a case-by-case basis as usual.
The AI rules aren’t different from other rules. The recent age gating rules, transformative use rules, and previous release rules all have edge cases. For the first two, the organizers have had time to discuss the rules at length with me and whether my entries were appropriate.
They also seemed to appreciate that I was engaging with the rules — I didn’t get the sense that I was bothering them over what were only potential issues.
I’ve never submitted reports about someone else’s entry, but I’m sure they’d spend just as much time on that. If the organizers don’t have time to handle things as usual, I’m sure they’ll make this clear.
I think you and Ben might be talking past each other a bit? Like, it is the case that rules about AI are often more vibes-based than “was this previously released?”, though there are edge cases in both. But it’s also the case that since the AI disclosure rule was added, organizers and volunteers have need to go through the “is this maybe AI?” vibes-based check. This rule change is just about consequences, not about whether there’s a process or not a process about AI content.
There can certainly be second-order effects on that process, of course – upping the consequences ups the stakes, as you’ve said! So that could mean that the process will have to be more involved. But since presumably fewer LLM-using entries will be coming in the door given the rule change, organizers will likely have more time to deal with potentially-questionable ones too.
We’ll just have to see, and I agree that it’s helpful to talk about the process with more granular detail, but just thought it might be helpful to be clear that the current checkbox disclosure also involves a process.
EDIT: on the substance (of the process…) I think I’m also not seeing a strong case for bespoke reporting/voting systems around this rule, though emphasizing that if folks are worried than an entry breaks a rule, they should contact the organizers rather than brigade or make public accusations definitely seems helpful.
I was just typing up a response, but I think this captures it pretty fairly.
I do think it is sometimes likely to be extremely clear that the new rule is violated. I’m not sure Dan and I disagree on this either, since Dan listed three examples in the upstream post: “calls out to a known API service”, “doesn’t check the no-AI checkbox”, “author admits to it when the committee reaches out”. Personally I feel there other signals that are “beyond a reasonable doubt” – eg. an item description that says “Certainly! Let me rewrite the description of the magic sword for you, in a way that enhances the dark and gloomy atmosphere of the story.”
But I don’t really think it matters; there are definitely edge cases here, and the edge is large and vague, probably larger and vaguer than any previous rule, and the committee may place it differently than me. There will almost certainly be cases where a game in the comp breaks the sniff test, but there’s not enough there to act on it. This feels a bit sad to me but basically unavoidable. (The proposed changes to procedures don’t avoid it either.)
What I do think we on the forums can do, as a community, is to try and be chill. The organizers have earned our goodwill over many years, they have existing processes that are only semi-public for good reasons, and I have no evidence that they are taking this lightly and quite a bit of evidence that they are being thoughtful about it. If the committee this year disqualifies half the entries for use of the word “delve”, I will reconsider this! But for now I’m happy to express the belief and hope that they won’t do anything silly like this, and to try and take the temperature down a bit.