IFComp 2019 Follow-Up Survey: Responses Requested by Feb 9th

Denk · January 30, 2020, 10:07am

I didn’t say that the judges should rank games they had not played. I was talking about the underlying assumptions behind the method. Here is another quote from the article:

Surely it must make a difference whether the IFcomp organizers assume the above or if they assume the opposite. If they just follow the method blindly, they will be using the same assumptions.

I am quite sure the basic Condorcet method described in the article favours games which are played more.

Take a look at the section " Pairwise counting and matrices". If you first take a look at the first matrix, which corresponds to one ballot. Here all possible pairs are considered since the voter apparently ranked them all.

But what if the voter didn’t play “game D”?
How will the organizer then fill out the matrix?

If you then use the above-mentioned assumption (“Usually, when a voter does not give a full list of preferences they are assumed, for the purpose of the count, to prefer the candidates they have ranked over all other candidates.”), you are still putting ones in the right most column (the D column, which contains a “1” whenever game D lost a comparison) even though game D might be the best of the 4 games.

It would be more fair (but still problematic) if all the numbers in column D were set to zero, since we do not know if game D is better or worse than the others. But still, if a medium game gets played a lot, it will get a lot of “ones” in the corresponding matrix row, whereas a very good game, which is played very little will get very few “ones” in their matrix row. Since the winner is simply found as the game with most “points” in their row after summing the matrices, it can be concluded that the Condorcet method favours games which have many votes.

If anyone can propose how to fill out a ballot matrix in a fair way, when game D hasn’t been played and keeping in mind that perhaps most players didn’t play game D, I would be very interested.

Please also consider that on most ballots, most of the games hasn’t been played, i.e. also consider games E, F, G etc. which have not been played either.

Sorry if the above is confusing to read.

But to sum up:
How do you fill out the ballot matrix (not the sum matrix) shown in the section “Pairwise counting and matrices” of the Wikipedia article, if game D hasn’t been played? You must do it in a way, which doesn’t favor games which gets played a lot AND please consider that there might be several games the judge hasn’t played (game E, F, G etc).

The0didactus · January 30, 2020, 3:23pm

I agree that Condorcet presents to judges a much easier task than the current environment: “Please rank all games you played in the order of quality” is an easier question to answer than “How good is Zozzled on a scale of 1-10”

However, I’m not sure the results are easy to interpret or calculate, particularly with 80 games in the running. (does “20th place” mean something in a Condorcet election? Currently I think getting 20th in IFComp really means something).

I also agree with Denk that the Condorcet system seems to be fatal to games which in the current environment accrue fewer votes. Now maybe we should just accept that accessibility and mass appeal is an important or nigh-essential quality for a modern game, but I’m not entirely onboard with that. It’s possible to fix this, but I don’t see how a solution doesn’t do one of the following if not both:

become so confusing as to border on nonsense for the mathematically untalented
massively bias the few times a niche game is rated (particularly when paired with a “favorite”)

I’m most familiar with Condorcet-style voting (or things that seem similar to it, anyway) in the context of elections and motions, both of which generally assume a small number of candidates or options, and a “pass” or “no opinion” to be equivalent to the lowest rank.

IceCreamJonsey · January 30, 2020, 3:39pm

75-80 games is an overwhelming number, sure. Here are some rando thoughts on what could be done:

An author can only have worked on one game. In terms of limiting games while being fair to people this would accomplish that but as others have stated, this isn’t the issue here, people aren’t entering the max number of games. But it is painless and fair.
Only allow entries from the existing pool of IF dev tools instead of one-off, custom, homebrew engines. Purely in terms of limiting games but letting anyone enter, this maybe funnels off those games to other competitions or release dates.
Split the games into two “Leagues,” parser and CYOA. Now you have two divisions of 40 games instead of one of 80. I guess I’d use sports analogies here - there are people that I know that are obsessed with the Big 12 in college football and don’t know much about the WAC or whatever, so to them the pool of teams is however many teams are currently in the big 12, not all 100+ college teams. So their world of awareness shrinks.
The entry fee would probably cut down on games, not saying I am advocating for it but I would agree that it would probably do the trick. You could whitelist authors that finished in the top X previously (or maybe more usefully, whitelist the next entry for someone in the bottom 10) like I think professional golf does with certain events. Or make a free entry be a prize or something. I remember the Spring Thing used to have a fee and it (as a result?) would have 4 games for its entire comp. So a fee might overly limit the number.
Maybe increase the time between when you can announce and when the comp starts? I dunno.

Just throwing those suggestions out there.

And then in terms of getting votes up, I’d advocate for lowering the requirement for games played to go from 5 to, say, 3, 5 is a lot for people that are generally outside the community. That’s about ten hours (or more, as a lot of games were entered that were more than 2 hours) right off the top before any vote can be cast at all. That’s a lot.

nilsf · January 30, 2020, 4:18pm

That would have excluded Detectiveland which won in 2016.

zarf · January 30, 2020, 4:29pm

The general answer is “you leave it blank and perform the vote calculation using a partial matrix”. Look at the “Ranked Pairs” article in Wikipedia: “unstated candidates are assumed to be equal to [all] the stated candidates”, where “equal” indicates “indifferent”.

Yeah, as someone suggested above, it would be desirable to work with the IFComp committee to test the algorithm on the raw data of previous comps.

You can work out a ranking for the whole comp, yes. (Determine a winner; delete that game from the vote data; re-run the algorithm to determine second place; repeat.)

It is true that these systems produce a lot of ties, particularly if the number of judges or the number of games-played-per-judge is too small. On the other hand, the current Comp voting system has the same problem! It’s “disguised” by running the score calculation out to two decimal places, which is really getting lost in the noise. We want a nice clean ordering of games, so we pretend that the difference between a 7.82 and 7.78 average is meaningful in terms of the consensus of judges.

I’ll stop there, as condorcet/ranked-pairs arguments have been known to rapaciously consume entire forum threads. We could kick off a separate thread if we want to get into algorithmic details.

vivdunstan · January 30, 2020, 4:45pm

Not for me personally. I have significant memory problems from progressive neurological disease, and quickly forget details of a game after I’ve played it. I find rating each one separately works well for me, even if I had to evolve my own ratings guide to do it efficiently on a 1-10 scale.

I must admit I haven’t read all the discussion above. But I’m starting to get a bit concerned that this scheme might exclude me as a judge. I’ve judged IF Comp since it started, all those years ago. Still if it worked better for newer judges then I’d see it as a good thing.

dibianca · January 30, 2020, 5:47pm

I may be in an uncommon position of having submitted one entry per year for the last 6 years. These are the vote counts:

2014 - 67
2015 - 95
2016 - 60
2017 - 56
2018 - 56
2019 - 42

The 2015 result was surely an anomaly caused by a high-profile link from JayIsGames. (Over 5,600 transcripts were generated.)

Otherwise, there does seem to be a downward trend. I don’t know how much of it is because votes are getting spread over a larger number of entries. I wouldn’t mind seeing some comp stats history, like number of judges and average votes per judge.

Denk · January 30, 2020, 11:33pm

Thanks for clarifying that. Leaving them blank and calculating on the partial matrices corresponds to putting zeros in all rows and columns of unplayed games. This method therefore, favours the games which get played the most. Thus I would prefer to keep the existing average rating method, which doesn’t favour the games which get played the most.

EDIT: Zarf has explained a slightly different variant (here) which allows for indifference and unplayed games. This method does not favor the games which gets played the most.

bg · January 30, 2020, 11:49pm

What about prizes for judges? Cash or gift cards, or maybe even donated items. Choose, say, 50 judges at random to receive a prize. If you’re concerned this might attract non-serious judges who are just going to put random ratings on games, you could require written feedback on the games in order to be eligible for these prizes.

zarf · January 31, 2020, 12:59am

No, a blank is not a zero. Unplayed games would be treated as neither higher nor lower than any played game. See the rest of my comment.

FriendOfFred · January 31, 2020, 3:41am

I’d like to push back on the idea that ranking all the games is easier than rating each one from 1 to 10. If I have to put all the games in a ranked list, I can’t say I liked any two equally, which means I have to make a much greater number of hard decisions. Putting each game in one of ten slots sounds much easier.

Also, I don’t like to normalize my ratings. I prefer to save the 10 and the 1 for extraordinarily good or bad entries. So I don’t like the idea of a system that assumes my first-place game is equivalent to a 10 or my last-place game is equivalent to a 1. On the other hand, if the stars aligned and I found several astonishingly great games in one comp, I’d like to be able to reflect that in my vote as well.

aschultz · January 31, 2020, 7:29am

I think there’s a question that remained unasked here, because it’s tough to.

Conscientious judges/reviewers want to be fair. They may feel a small amount of guilt putting their preferences over others, or they may worry they’re just Missing Something. The more games there are, the more likely this is to happen. So this worry and fear increases with the number of games.

Judges/reviewers need some reassurance that they are doing the best they can, and they will average out if enough others play. I think we sort of know this, but I know it can be intimidating to me to have that many choices I need to make.

borg323 · January 31, 2020, 8:54am

Idle thought: what about a system where absolute votes are used to extract the relative order and for tie-breaks? This way the scoring can be kept as is, but there is no need for the 5 game limit, even scoring a single game may be meaningful (in case of a tie), hopefully encouraging more participation.

Disclosure: I had no free time this spring, so even that wouldn’t have helped in my case.

Denk · January 31, 2020, 1:03pm

That’s interesting. I have made a new thread to discuss the details of the Condorcet method:
https://intfiction.org/t/is-the-condorcet-method-suitable-for-ifcomp/44160

bg · January 31, 2020, 1:56pm

For voters who want to have played a similar combination of games as other voters and be able to discuss among themselves, or for voters who are simply overwhelmed by having too many games to choose from, would it help to offer “book clubs,” or teams of voters?

Someone would divide the competition games into sets of, say, 15 games. Each set could include a mix of long and short games, different genres, and well-known and lesser-known authors. Each voter who wants to participate in this, would be randomly assigned to a team. Team 1 plays games from set 1, team 2 plays games from set 2, and so on. People on the same team would have other people they could chat with about the same combination of games. Team members wouldn’t be required to play every game in their set, but would be encouraged to play at least N games from their set before moving onto the rest of the competition. I don’t know whether it’d be run by the comp organizers or whether someone could do it independently on the side.

BitterlyIndifferent · January 31, 2020, 4:10pm

I’ve only been involved in the discussions from the entrants’ side, but this really wasn’t a problem for me. I noticed two things happening:

I’d search for people talking about games that I had played, which was a way for me to explore new ideas from people I didn’t know.
I’d exchange messages with people I did know to ask whether they had played specific games. Then I could recommend the games that I really enjoyed, and a few times they recommended games of their own, which is helpful when you know you can’t get to every title.

I know I’m missing out. Time constraints mean that I can’t play all the games, I can’t take part in every discussion, and I can’t say that there is one universal, discussable experience of participating in IFComp (either as a judge or as an entrant). That’s part of what makes it so compelling.

datalexic · January 31, 2020, 5:46pm

I’m enjoying this discussion a lot—it’s amazing how many different viewpoints and suggestions there have been. People seem to have a wide variety of goals as judges.

As a less competitive person, I enjoy judging but I honestly see IFComp mostly as a festival where I get to play the games I’m intrigued by, signal boost the ones I enjoyed (in part by giving them high scores), and read lots of insightful commentary from others about a wide variety of games. On that level, there’s nothing that really needs to be fixed.

There could definitely be ways to improve the scoring to make the final rankings more accurately reflect judges’ intentions and preferences, as folks have mentioned, whether through a ranked pairs scoring system or a multi-round approach à la XYZZY awards (first round nominate, second round score the top 25). I’m all for that.

I’ll echo some other responses, though, and add that changes that wouldn’t allow judges to pick their own games (making them randomly assigned, grouping judges into cohorts and having them play pre-specifies games, etc.), would make it less likely for me to participate as a judge. Time constraints happen, and I’d prioritize playing the games I’m intrigued by (and providing thoughtful commentary on them) over participating in the scoring aspect of the event.

bikibird · January 31, 2020, 7:25pm

I know everyone is free do decide what merits a 10 vs 5 vs 1, but I think it might be less intimidating for new judges if there were a suggested criteria guide. Experienced judges could continue to do their own thing, but novices would have some reassurance that when they rank something a 7 that they are doing right by the author. I know that was my fear: was I being too harsh or too easy? Being too harsh disadvantaged the work I was judging. Being too easy disadvantaged the works I wasn’t judging.

aschultz · January 31, 2020, 7:33pm

Good point. There are a lot of fears – if we see super detailed reviews, that can be intimidating to us to say “I can’t even rate this on a scale of 1 to 10!” (Note: as an aside, super detailed reviews are good. But I also think that sometimes reviews the writer may think are just overviews can 1) point out something the author didn’t see or 2) act as another prod to the author saying “Hey! This is important for your next project/re-release!” Even a small twist on a standard observation goes a long way. And I think even writing plain-vanilla reviews, even short ones, may help judges feel they justified their score, even if they don’t publish the reviews or their score.)

And I can see how there would be fears that rating things too by your own scale might feel disruptive, but too by-the-book ratings might leave you saying “why bother? I’m not really changing anything.” So that is another trap.

As a competitor I accept that there’s too much to take into account to give an Official Accurate Grade. I’d just like there to be enough judges that, if someone looks back and said they should’ve given higher/lower grades, they realize there were enough judges that they didn’t do anything drastic or horrible to the standings.

Melendwyr · January 31, 2020, 8:24pm

My suggestion would be to consider dividing the competition into categories. However, given the hostility towards subcategorizing IF even for the purposes of theoretical analysis, I can’t imagine the people in charge of the competition being willing to make distinctions between different works.

If they did, it would both solve the issue of comparing apples to oranges, and having an overlarge pool of works to test.