What was the minimum number of ratings any entry got this year?
The lowest number of ratings was 11. The highest was 16. They were actually fairly well distributed.
Good question. The itch.io algorithm is meant to adjust for this situation, but if that game got 5 and all the others got (say) 3 or 4, it would probably still win. If we go for the raw score, then it would definitely win.
EDIT: I can’t see any way to specify a minimum number of ratings before a game is eligible for placings, so this would have to be done manually. You may not know what threshold to apply until you’ve received all the votes. If this was to be adopted, then it may seem unfair to those that receive fewer votes than the threshold. It may also cause entrants to encourage others to vote for their game, which is not permitted.
Perhaps the threshold could be half the number the most-rated game received? Adjust at the organizer’s discretion if this disqualifies too many.
I don’t like itch’s system either, but we could consider whether it might make sense to modify it, in the light of preventing the situation that a game with (e.g.) two 5-votes would win over a game with nineteen 5-votes and one 4-vote.
One idea would be to apply the negative adjustment (according to the itch formula) only to games which do not reach half of the median number of votes.
It’s probably clear, but just to be sure: I don’t mean the 25th percentile, which would punish a quarter of games. Instead, the criterion could be half of the median number, in absolute terms.
In the current jam, the median number of votes was 14, so we would only negatively adjust the scores of games with less than 7 votes. In this case, no game would have been punished, since the game with the least votes still received 11, so it wouldn’t have been affected.
Heh, we could require a minimum number of votes for every submission and just not end the voting period until that happens. That could be amusing.
My only concern there is that some games get relatively few votes for technical reasons—The Abbey of the Hidden Rose seems fun, for example, but it’s literally unplayable for me without a Windows machine. If someone submits a game that’s only playable on (for a pathological example) versions of Android from before 2019, I don’t want the competition to keep running until we find ten people with old Androids who are willing to play and rate it.
The stats:
……………………………………………………….ratings per game in 2026…
| The Abbey of the Hidden Rose.TALP | catventure | 11 | |
|---|---|---|---|
| Big Deal, Oh! (TALP) Play in browser | Andrew Schultz | 12 | |
| The Pattern Beneath (TALP) Play in browser | relei2004 | 12 | |
| Ransack! (TALP) Play in browser | improvmonster | 13 | |
| The Antediluvian Weapon [TALP] Play in browser | Dercomai | 14 | |
| [TALP] Beneath The Exhibition Play in browser | patricksgamecorner | 14 | |
| Adventure in the Crypt (TALP 2026) Play in browser | Candy64 | 14 | |
| The Gnomish Treasury (TALP) Play in browser | Lamp Post Projects | 15 | |
| Epic Expedition (TALP) Play in browser | dgtziea | 16 |
However, this situation was better than several other years, both TALP andParserComp.
Perhaps, unless someone gets an exact average score of 5 stars, be bold and say threshold 3 ratings?
Or just say the organizer will decide a threshold if needed. If it involves the first place it is much worse than if it involves 5th place or so.
On the contrary, as a potential minority opinion, I would find that deeply amusing. Reminds me of Festivus or watching Belgium spend 2 years trying form a government, but with thankfully much smaller stakes. (That isn’t a shot against Belgium btw, I wouldn’t care to publicly examine comparative dysfunction between my country and… well almost anywhere.) It also forces the community to come together and engage with all the submissions before we consider that year done and over, which is nice.
Then again, I dip my mozzarella sticks in Ranch, so adjust your judgement of my judgement appropriately.
(Also, as an aside, the system where IFComp requires a minimum number of ratings to vote is probably a great idea, but has zero enforcement function through itch.io.)
Ideally no one gets disqualified as that is very demotivating. So only “disqualify” when absolutely necessary. And those authors did nothing wrong so we need another word for it ![]()
Sounds like the best of both worlds, though it does not have to be half the median necessarily. Only those who would have been “disqualified” by a threshold gets their scores adjusted?
The itch.io game jam infrastructure has a thing called a ‘rating queue’, whereby a judge has to rate a randomly allocated set of games before they can rate all the other games. You can read about it here.
This is similar to the IFComp rule that you must vote on a minimum number of games. The only difference is that you are allocated which games to rate. This ensures that the ratings are reasonably well distributed and a single judge can’t use sock-puppet accounts to vote on a single game.
There are settings to allow a judge to request a different list (but only once) and to specify their preferred platforms.
I haven’t seen this used in a game jam, but I think it sounds good in theory. What does everyone else think? I’d suggest a minimum number of three games in the list.
If I were running a comp, I’d pick a fixed threshold for the number of ratings a game needs to be ranked. All games below that threshold would be marked as “unranked”. If there are prizes, e.g. money, then these games all get a specific amount (perhaps the same as the lowest ranked game would get).
I think for these smaller comps, the threshold should be somewhere between 5 and 10. There could be proviso of “this threshold may be lowered at the organizer’s discretion”, in case it turned out too many games were under that threshold.
Having a complex equation for the threshold, or a system like itch’s current one, I feel would suffer the same problem which avoiding itch’s algorithm was meant to solve: being or seeming unfair and/or arbitrary.
For the issue of games that may get few ratings because they are impossible for many people to play, e.g. a MacOS binary, my feeling is that is an inherent risk of using such a system. Someone submitting a game which only runs on a Commodore 64 should expect a high chance of not getting enough votes to be ranked. This would be mentioned in the rules or guidelines.
P.S. I agree the rating queue sounds good, especially the “unlock size” feature allowing people to play a few games which they want to play, and the rest are chosen for them.
This goes both ways. The oppositive effect happened to my submission. I’ve taken steps to address it. The original effect was not actually that bad because my entry played in the browser. I felt really good about my entry because rule #1 of the contest was very generous and said that any computing platform can be used though a browser game typically gets more screen time. The offline version was really just for bookkeeping and archival reasons of the contest; at least that was how the intent felt to me, a newcomer.
What I found in practice is that people prefer, almost expected, that a game be available for their preferred system. Some of the judges even stated that they would play them in order based on this criteria and may not even get to the games on the systems they didn’t care for whatever reason given.
The part that was unexpected to me was when a game is published on a system that is “too modern” or requires a minimum O/S requirements. My offline submission was such an entry. It is challenging as a content creator to reach the widest audience. As much as TALP seemed like the center of the universe for me as it was my first game jam dedicated to the genre, my second one ever, i want to also reach the broadest audience. I’ve taken steps to minimize this. In the future I will make sure to submit an entry for the Browser, native Windows 10//11, native Linux, and native Android. So lessons learned and that is a good thing.
Ultimately, platform diversity is good for all of us
Yeah, I tend to agree that a disqualification would be quite harsh.
Especially since it implies that a game with very good, but few votes, could be disqualified, whereas a game about which a lot of voters completely agreed that it is rather bad, would be qualified and thus rank higher (or, technically, get a rank at all).
For example, if the cutoff was at six votes, then a game with three 5-votes and two 4-votes (so, score = 4.6) would be disqualified, whereas a game with eight 1-votes and four 2-votes (so, score = 1.33) would be qualified.
Admittedly, every system will probably have unintuitive results in certain constellations. But going through some of these examples can be useful to be clear what the consequences could be.
I decided to vote for “adjusted scores”. It is not perfect but I don’t like the idea of anyone getting “disqualified” (or “unranked”) and with adjusted scores that is not an issue. With raw scores and no threshold, somebody could submit a game in an obscure format and win the whole thing with a single rating. By keeping itch.io’s result page, people can still see their own average raw score, and get an idea of how well their game was received. No matter what, we should in general support organizers for doing their best.
My final idea: The organizers will only apply itch’ system if they have to, i.e. if some games had very few ratings. Hmm….
Personally I am not too bothered about the automatic adjustments that itch makes to the voting averages. It is better than some alternatives, like just using raw scores no matter how few votes an entry has.
Having said that, I do think it is important to listen to the fact that some people are not happy about the adjustments. Potential entrants could be put off by it, and we don’t want that.
There’s not an ideal solution but some sort of threshold needs to be applied, there are some good ideas in this thread.
I think the important thing is, whatever is decided, the competition needs to be very clear at the outset what the rule is, and communicate it prominently. It’s no good to see if an entry gets a low number of votes and then decide what to do about it.
I collected some data for past jams; maybe that can help to assess possible measures/criteria. (Though I did not yet look at the details in how far the rankings would change with or without itch’s formula.)
| Jam | Entries | Ratings | Avg. Ratings per game | Median | Ratings first to last place | Ratings as ordered list | Half of max. ratings | Half of median | 0.75 * median |
|---|---|---|---|---|---|---|---|---|---|
| Adventuron CaveJam (2019) | 10 [11] | 92 [103] | 9.2 [9.36] | 8 | 18, 9, 8, 12, 9, 7, 6, 8, 8, 7 | 6, 7, 7, 8, 8, 8, 9, 9, 12, 18 | 9 | 4 | 6 |
| Treasure Hunt Jam (2020) | 15 [16] | 80 [86] | 5.33 [5.38] | 5 | 5, 5, 5, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 6, 5 | 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6 | 3 | 2.5 | 3.75 |
| Adventuron Christmas Jam (2020) | 12 | 193 | 16.1 | 14.5 (rnd.15) | 28, 15, 20, 17, 13, 14, 19, 14, 13, 13, 12, 15 | 12, 13, 13, 13, 14, 14, 15, 15, 17, 19, 20, 28 | 14 | 7.25 | 10.875 |
| TALJ 2021 | 10 | 102 | 10.2 | 10 | 14, 10, 11, 10, 10, 11, 8, 10, 10, 8 | 8, 8, 10, 10, 10, 10, 10, 11, 11, 14 | 7 | 5 | 7.5 |
| ParserComp 2021 | 18 | 355 | 19.7 | 20 | 24, 26, 20, 19, 19, 24, 27, 18, 21, 24, 22, 16, 20, 13, 22, 16, 11, 13 | 11, 13, 13, 16, 16, 18, 19, 19, 20, 20, 21, 22, 22, 24, 24, 24, 26, 27 | 13.5 | 10 | 15 |
| TALJ 2022 (*) | 15 [16] | 164 [171] | 10.9 [10.7] | 11 | 11, 11, 14, 11, 17, 11, 11, 11, 12, 10, 9, 11, 9, 8, 8, [7] | [7], 8, 8, 9, 9, 10, 11, 11, 11, 11, 11, 11, 11, 12, 14, 17 | 8.5 | 5.5 | 8.25 |
| TALJ 2023 | 9 | 146 | 16.2 | 17 | 15, 17, 17, 17, 17, 17, 18, 15, 13 | 13, 15, 15, 17, 17, 17, 17, 17, 18 | 9 | 8.5 | 12.75 |
| ParserComp 2023 (Classic Category) | 11 | 114 | 10.4 | 9 | 18, 12, 9, 14, 10, 13, 9, 9, 9, 6, 5 | 5, 6, 9, 9, 9, 9, 10, 12, 13, 14, 18 | 9 | 4.5 | 6.75 |
| TALJ 2024 | 10 | 114 | 11.4 | 11 | 11, 14, 16, 12, 11, 11, 11, 10, 11, 7 | 7, 10, 11, 11, 11, 11, 11, 12, 14, 16 | 8 | 5.5 | 8.25 |
| TALJ 2025 | 7 | 72 | 10.3 | 10 | 9, 10, 10, 11, 10, 12, 10 | 9, 10, 10, 10, 10, 11, 12 | 6 | 5 | 7.5 |
| TALJ 2026 | 9 | 121 | 13.4 | 14 | 15, 16, 14, 13, 11, 12, 14, 14, 12 | 11, 12, 12, 13, 14, 14, 14, 15, 16 | 8 | 7 | 10.5 |
The last three columns contain the thresholds for three different tentative criteria: Draconis’ suggestion (half the number the most-rated game received), my suggestion (half the median number of ratings), and one that’s similar to mine but stricter (three quarters of the median number of ratings).
Edited to add:
(*) For TALJ 2022, it seems that originally 16 games were entered, and therefore Itch calculated the average as 171 / 16 = 10.6875 = (rounded) 10.7, which is what shows on the jam page on itch. The game with 7 votes was probably taken out of the comp later, so the real values are 164 ratings, average 164 / 15 = 10.93.
Edited again to add source links and correct a typo.
Edited again (sigh) to correct an oversight on my part, and some errors due to Itch, which, AFAICT, sometimes seems to have mysterious ways when entries get taken out of comps (cf. the TALJ 2022 issue above)? (Like counting the sum of all votes by including votes for ghost entries, but then calculating the average by dividing that sum through the number of final entries, excluding ghost entries. E.g., for the Treasure Hunt, Itch says “15 entries were submitted”, but also “86 ratings were given to 16 entries” and the “average number of ratings per game was 5.7”. If we calculate 86 / 15 we get 5.7, but the 15 listed entries actually only have 80 votes total, so it’s either 80 / 15 = 5.33 or 86 / 16 = 5,375 ~= 5,38, but 86 / 15 makes no sense.)
Thank you for those stats. That exemplifies different relative “targets” (e.g. half the median)
I am no expert but actually found it very helpful to “discuss” the topic with AI. E.g. I realized that a threshold OR a target number of ratings (e.g. median or fixed number) could depend on whether we are trying to prevent cheating or if we are “just” interested in decreasing the effect of randomness. I guess it is hard to predict all sorts of cheating.
The itch method could be generalized to:
FinalScore=RawScore*(ratings/target)^p
Itch.io uses p=0.5 (squareroot) and target is the median. We could might as well have e.g. p=1/3 (cube root) and target = 5 ratings etc.
But I am not sure if @Warrigal wants to deviate in any way from the two obvious choices (raw or itch’ standard adjustment)?
Thank you for all the thoughtful discussion so far. First of all, I’d like to state that there will no artificial threshold, as this is too hard to estimate in advance. itch.io’s use of the median is a fair one, as this is the midpoint of the ratings distribution.
Secondly, no games will be removed from the rankings, as this would be unfair to the authors after spending so much hard work on writing a game for our enjoyment.
The question is “which is the fairest way of ranking games given that each game may have a different number of rankings?”
After mucking about with spreadsheets and doing a little extra research, I’ve found that if the number of rankings are evenly distributed, then it doesn’t matter which formula you use, as the relative results are very similar. So, I am going to try the ratings queue next year. I might keep the results hidden until I’ve had a chance to make sure there are no anomalies.
Any formula that adjusts for “fairness” is attempting to account for the outlying cases where games receive an extremely high or extremely low number of ratings. After reading the itch.io threads on their formula and the rationale behind it, I’m now convinced that it is actually fair and addresses all the concerns expressed above. In the hypothetical case of one game receiving ratings of 5 in every category in TALJ 2026, but only receiving one vote, this game would have come last.
The fairest formula is generally regarded to be the Bayesian average. When I applied this to this year’s results, the overall results were compressed so that the higher-rating games scored less and the lower-rating games scored more. Positions 4 and 5 were swapped, positions 8 and 9 were swapped and the hypothetical single-rated game mentioned above was placed in the middle. This is the same as the results from using the raw score, except that the hypothetical single-rated game would have won.
FWIW - I like 9th place better than 8th.
Who got the Golden Banana of Discord as IF comp calls it?