Mini-vent about AI: if it failed on "this", how can it be relied on for "that" EDIT: Nothing "mini" about this anymore!

So, I’m sorry about this, but I’ve had this on my mind for a couple of days. And I don’t really have anywhere else I could discuss this on, I don’t have much of an online presence. While I make no secret of my dislike for AI, I hope to be able to address this constructively, so that AI proponents and opponents may engage equally constructively.

It’s very simple. I engaged ChatGPT a couple of days ago, and as part of that interaction, it made such an obvious blunder that I wonder how it can be trusted to do anything at all. I started to play Othello with it (two years ago, I experimented in the same way; at the time it always formated the board incorrectly, and however many times I told it how to format it correctly, it always messed up, and then the moves were all messed up too), and after two moves, I ended the game, because I had placed a piece that should have flipped two of its pieces. It didn’t. Not only did it not, it actually said, after I made the turn, that “no pieces were flipped because there was no sandwich”, or something, I’m paraphrasing.

Setting aside ChatGPT’s sudden pang for bready goodness, I confronted it with what had happened, and asked it why it had not seen that my move should trigger pieces flipping. It responded with something like, it had failed to analyse the entire board and instead had analysed the pieces around a certain area. Something like that.

For the sake of discussion, we could say that that’s the type of mistake that a person could make as well… to be sure… but the whole point of turning to computers as tools is precisely because this is not the type of mistake they make. Unless there’s a bug, which gets complained about and fixed. It’s like, a human person might forget to carry the 2 in calculations, but a calculator does not. If the calculator did, we would not be praising it for its near-human capabilities; we’d get a new calculator.

This incident alone just makes me wonder… and there are AI users and proponents around here, so maybe this is a space where the discussion can turn very AI-supportive… how can you trust a tool like this? Don’t you have to be on top of it permanently to make sure it doesn’t do dumb mistakes like this? What about when it makes mistakes that are not dumb, and that you can’t catch - how do you debug that?

I can see AI being useful as a tool for bringing up code examples, to help you understand certain pieces of code, so that you code it yourself; tutorial- or explainer-style. But if you go any further than that… is it really something that is worth your trust?

EDIT - Keeping in mind that, in Othello, if your move doesn’t flip pieces it’s not a legal move. So this underscores how dumb the mistake was.

4 Likes

The point of turning to AI as a tool is precisely that it behaves more like us. It can do tasks that deterministic computers running hand-written code have so far been unable to do, at the cost of giving up some of that determinism.

The same way I can trust a person: by acknowledging that it’ll be wrong sometimes, and hardening the surrounding system so that occasional mistakes aren’t catastrophic. (And by not asking it to do things that I know it’s bad at.)

This is why AI is so useful for coding. It doesn’t need to write perfect code on the first try, because with tests in place, its mistakes can be caught and corrected immediately.

5 Likes

Doesn’t that make them more fallible? Fallible to the point of being no better than a person, really? Don’t you lose the advantage of using a computer - isn’t it best to just imrpove your skills, or to use the help of a person who really knows their stuff?

I see. That is probably where I, a non-coder, seem to view it differently. I can trust a person to make mistakes, but I expect a program, by virtue of its specificity, to not make them.

Which brings another point to mind. ChatGPT is anything but specific. Should I have instead put my trust on a boardgame-specific LLM? That makes more sense, and is more likely to give me a behaviour I could trust. In which case, what this possibly highligths is that specific, focused, oriented LLMs may be trustworthy, whereas general ones like ChatGPT must always be considered to be as fallible as Cliff Clavin.

I think this is kind of like trying to use a shovel to turn a screw instead of a screwdriver. You could maybe kinda do it but it won’t work very well. That doesn’t mean the shovel is a bad tool, it’s just not the appropriate tool for what you’re trying to do.

LLMs are good at certain things (processing, summarizing, and re-writing text, writing code) and bad at other things (mathematics and calculations, deterministic outputs). Even the things LLMs are good at, require checking, because it’s a probabilistic machine. So people who are serious about using LLMs build guardrails around how they work, their output, additional verification checks, and of course human oversight.

LLMs are not programs in the traditional sense. They’re neural networks that do probabilistic text prediction. You can ask the exact same model the exact same question several times and get different responses every time. Sometimes they’re slightly different, sometimes vastly different.

5 Likes

I see. In essence, it’s a different approach, a different paradigm, that is addressed with a different mindset. So the way I approach it, the way I expect it to work, is somewhat akin to turning to Inform 7 without reading the manual and complaining “Hey, it’s supposed to understand natural language, what gives? Why doesn’t this compile?”

I’m not certain the overall awareness of AI and its usage makes that clear, to be honest. But that is probably the one really big problem about AI; it’s marketed as being a panacea for everything, and while that may be overall true, each independent LLM is best used in a specific function; no LLM can or should be expected to do well at every level.

My takeaway from the discussion as it stands, and I do feel clearer on this - Thank you, Tara McGrew, and thank you, Storyfall - is that I unwittingly used an LLM for a task it wasn’t meant to be particularly good at, and extrapolated a general failing from that; where the truth is more nunanced. I’m also getting the takeaway that AI really is meant to replace people and the way people work. My personal feelings of still prefering to have a person instead, especially if the AI is mimicking what the person does, is not relevant to the scope of this mini-vent! So I think I’m content with this resolution. :slight_smile:

4 Likes

A parenthesis. I spoke at length with ChatGPT about a number of subjects, asking ChatGPT to adopt a more conversational style so I didn’t feel like I was talking to a machine. I asked ChatGPT to challenge me and give me pushback on my views so I didn’t feel like I was in an echo-chamber. After a conversation of about an hour and a half (if not that long, it felt that long), I left with the distinct impression I had been talking to myself. After interacting with two actual people in this thread, I feel that I’ve grown an itsy-bitsy little bit. Food for thought.

3 Likes

To be fair, AI companies generate a tremendous amount of hype and really do try to make LLMs seem like a panacea. Partly that’s for marketing reasons, and partly that’s because the entire AI business is largely a very expensive loss leader and they need to raise massive amounts of funding to keep sustaining until it maybe turns profitable in the next 4-5 years.

In my job, AI has automated the drudgery and let me focus on the things I want to do. In that sense, it’s been a boon. I know not everyone’s experience is like this and many people are suffering from this technology. I suppose that’s how it is with any technology, only we’re in an era when new technology adoption can happen at lightning speed (especially when it comes to software) and the displacement they cause in two years is equivalent to what some automation process in a factory could have taken decades to accomplish.

That’s what a lot of conversations devolve into! AI models are trained to have high sycophancy and make you happy. You’re more likely to be able to get it to seriously disagree with some statement if you say it came from “some other person on the internet”, rather than yourself.

3 Likes

More fallible, yes. But the point of using AI over humans isn’t that it’s infallible; it’s that it’s fast, it doesn’t need to eat or sleep, it has broad knowledge across a huge range of subjects, it can be automated, you can run several of them at once, etc.

Depends on what you mean by “best”. I can hand the AI problems to work on while I go eat dinner. I can ask it for help with some obscure task, for which I don’t know any experts to ask, at 3 AM when the experts are asleep anyway. It’s not as good as having human experts in every field at my beck and call who never tire, but it’s achievable and the latter isn’t.

When there’s a part of the task that can easily be done by regular software, it probably should be. An LLM is probably the wrong tool for something like keeping track of the state of the board.

2 Likes

Since its arrival, GenAI has been misunderstood. The underlying technology is just a very large database that only contains the matrix of training data it’s been given.

Your Othello example is an excellent instance of using an LLM for a computational outcome. In order to succeed, it would need to have been trained on every mathematical state in Othello.

What GenAI can do is help you write a program to play Othello and it probably knows the rules.

I have always approached GenAI with the skepticism of a software engineer. I don’t care that it can be prompted for code. I care that it’s faster than I am and I can use unit tests and reviews to validate the code.

It is a very different experience building software with GenAI.

So I would say most of us that have found success are usually the ones that are good at breaking complex logic into small chunks and rigorously testing both the chunks and the aggregate of chunks.

Failure usually comes from those that expect GenAI to do what you ask without validation.

2 Likes

Right, but what was to me the biggest issue was, what does it matter if it can do all those things when the results are not trustworthy. But this issue was cleared up for me: it needs a different approach, and, as you say, a resilient framework that can handle the odd mistake. I’ll point out a particular thing in there: it has broad knowledge across a huge range of subjects. From what I’m getting from all this, the broader the knowledge, the more superficial it is, right? So you may still prefer to use specific LLMS for specific tasks instead of relying on one to have broad coverage.

By “best” I meant “best to use your skills, and the skills of others at your reach, to create something that you know how it works and how it does what it’s meant to, and how to debug; when opposed to having an LLM work on it and come up with stuff that you may be hazy about the particulars, and can’t fix directly if you need to, and may remain a black box from you, which it wouldn’t be if you’d done it yourself”. Now, this is a very very specific definition of “best”, I now realise, and I really shouldn’t hold anyone else to that definition!

EDIT - And of course, not being a programmer, I’m probably ignorant of how it might very well ALREADY work like that when you’re working on code that other people have coded. My experience as the hobbyist has always been that of the lone programmer, and I tend to forget that things aren’t like that. After all, aren’t some I6 and I7 extensions black boxes to me? And do I not still use them and trust them? Food for thought for myself; challenging my own points. Heck, just using the Inform Library is a black box for me (not for others). But it’s also a tool I can trust to behave in certain ways. Except when it doesn’t. :slight_smile:

I think I’m realising that coding is simply something I am much too far from to use meaningfully as an example of any sort, and I should find other examples if I want to discuss this topic.

This a really interesting example (Othello) because the algorithm for deciding whether an Othello move is legal is so simple, it’s something that could be assigned to a human student as a programming exercise they could complete in a short time. If you told ChatGPT “write a computer program to decide if an Othello move is legal” it might get it right on the first try. If you tell it “play Othello” it gets it bonkers wrong.

I get frustrated that it is increasingly difficult to avoid AI. I was looking at a student made slideshow the other day (I teach chemistry). At the bottom of the Google slide was an invitation “beautify this slide”. Uncertain what that meant, but suspecting it was an integrated AI function, I did a search and discovered this interesting review from Swarthmore. Unpacking the accessibility trap of Google Slides’ “Beautify” feature – Swarthmore College – ITS Blog

4 Likes

It might be helpful to use examples where we did previously have tools with non-deterministic outcomes, but which assigned a probability. You may have seen previous ‘sentiment analysis’ software that could be feed a comment from social media and a subject name and would spit out results like “Opinion of Bob - Positive: 89%”, or image recognition tools that would say something like “Elephant: 75%; small car 34%” where the percentages are a very rough gauge of how likely the result was to be true. In the broader world, we use tools like polls to assess how likely something is to happen even though we know there’s only a possibility of them being correct.

LLMs operate in a similar space; they present answers that are ‘likely to be the next thing humans might have written as a carry on of the current piece of text’ (roughly, but that’s close enough to the real criteria to build a mental model).

This is good, and bad. For constrained spaces like programming syntax, the probability of them being actually correct rather than just sounding like something a human would say goes up. Writing unstructured prose about a problem with logical constraints (like othello) means the probability of a ‘correct sounding’ reply being actually correct goes down.

But there’s no understanding; for instance, LLMs will famously make up the contents of images that don’t actually exist, because in a lot of the material they are trained on there’s plenty of example text that says things like “as you can see from figure 9” and then describes the contents of the image. So if you ask them “what’s in figure 9?” they’ll describe something the image could be even if there isn’t an image at all.

What does this make them good for? That’s an interesting question, but as several other people have already said asking them to do things where you don’t know how to check the quality of the results is definitely unwise. They are not nearly so general purpose as the companies selling them would like you to think - quickly do jobs that are easier to check correctness of than do yourself isn’t actually that large a problem space.

2 Likes

In theory, yes. Or you can fine-tune a general model for your specific task.

But general models are already pretty good at a lot of things. A recent project of mine required low-level programming/analysis on four different antiquated computer platforms, and writing a lot of assembly code for a system that had basically never been seen by anyone before. I would’ve had to scour the internet to find experts in the former (and instantly wear out their patience with my newbie questions), and there are no experts in the latter, but GPT and Claude were able to chug right through it.

In fact, it can already be like that when you’re working on code that you wrote yourself a year earlier! Having to spend some time getting up to speed in an unfamiliar codebase is par for the course.

It can be worse with LLM-written code though, just because they can tolerate worse code. Code that’s so disorganized that a person would have to improve it before doing any more work on it, an LLM will happily keep going with, making it even more disorganized, if you don’t keep an eye on it.

1 Like

Personally, this is the thing that actually bothers me most about how the tools are presented (I have separate concerns about how they are made) and the chat style interfaces. It is very good at sounding like it is having a conversation with you while actually doing nothing of the sort (for instance, this study is one of several about LLMs not meeting social needs https://www.sciencedirect.com/science/article/pii/S0022103126000417). This… doesn’t seem to be very healthy, from the results we’re starting to see. Which is very annoying, because the underlying tech if presented with a sane interface and trained ethically/legally is actually good at some things computers have historically been very bad at.

2 Likes

If you’re using the publicly available GPT, you’re using a very old model. The new models are vastly improved in reasoning.

Just for kicks, this is the othello program Claude Opus 4.6 wrote for me in under 5 minutes.

3 Likes

I have Claude Code creating an Othello game in Typescript and React. My guardrails slow it down, so we’re hitting about 20 minutes.

An article I found interesting makes the case that that’s true only in the same sense that the answers humans give are “likely to be the thing that increased their ancestors’ chances of finding a mate and producing offspring that survived long enough to have offspring of their own”:

1 Like

This is a great example of how GenAI will do the same thing differently:

This is almost identical to the Othello @rileypb created with Claude.

Yeah, I’m aware that the maths of LLMs is a bit different from, say, just building a longer Markov chain dataset. Hence the parenthetical comment :slight_smile:. That doesn’t alter the fact that I think it’s a helpful model for understanding even the most advanced of the current models as it helps you spot the ways that they fail regularly that are distinct from how you would expect a human to fail at a task :person_shrugging:

And here is the code Claude created in one prompt without error (though my guardrails demand unit tests and those are in a second file):

/**
 * Pure-function game engine for Othello.
 *
 * Public interface: createInitialState, makeMove, getValidMoves, isValidMove
 * Owner: game engine (src/game/)
 *
 * Invariants:
 * - Board is always 8×8.
 * - Black moves first.
 * - A move must flip at least one opponent piece.
 * - When neither player can move, the game ends.
 */

import type { Board, CellState, GameState, Player, Position, GameStatus } from './types';

const BOARD_SIZE = 8;

/** Eight compass directions as [rowDelta, colDelta]. */
const DIRECTIONS: readonly [number, number][] = [
  [-1, -1], [-1, 0], [-1, 1],
  [0, -1],           [0, 1],
  [1, -1],  [1, 0],  [1, 1],
];

/**
 * Return the opponent of the given player.
 * @param player - Current player
 * @returns The opposing player
 */
export function opponent(player: Player): Player {
  return player === 'black' ? 'white' : 'black';
}

/**
 * Create a fresh 8×8 board with the four centre pieces placed.
 * @returns A new Board
 */
function createEmptyBoard(): Board {
  const board: Board = Array.from({ length: BOARD_SIZE }, () =>
    Array.from<CellState>({ length: BOARD_SIZE }).fill('empty'),
  );
  // Standard Othello opening: white on d4/e5 diagonal, black on d5/e4.
  board[3][3] = 'white';
  board[3][4] = 'black';
  board[4][3] = 'black';
  board[4][4] = 'white';
  return board;
}

/**
 * Deep-clone a board so mutations don't leak.
 * @param board - Board to clone
 * @returns Independent copy
 */
function cloneBoard(board: Board): Board {
  return board.map(row => [...row]);
}

/**
 * Return positions of opponent pieces that would be flipped
 * if `player` places a piece at (row, col) along one direction.
 * Returns empty array if no flips occur in that direction.
 */
function getFlipsInDirection(
  board: Board,
  row: number,
  col: number,
  dRow: number,
  dCol: number,
  player: Player,
): Position[] {
  const opp = opponent(player);
  const flips: Position[] = [];
  let r = row + dRow;
  let c = col + dCol;

  // Walk outward, collecting opponent pieces.
  while (r >= 0 && r < BOARD_SIZE && c >= 0 && c < BOARD_SIZE && board[r][c] === opp) {
    flips.push({ row: r, col: c });
    r += dRow;
    c += dCol;
  }

  // The line must terminate with our own piece to count.
  if (
    flips.length > 0 &&
    r >= 0 && r < BOARD_SIZE &&
    c >= 0 && c < BOARD_SIZE &&
    board[r][c] === player
  ) {
    return flips;
  }

  return [];
}

/**
 * Return all opponent pieces that would be flipped if `player` places at (row, col).
 * @param board - Current board state
 * @param row - Target row
 * @param col - Target column
 * @param player - Player making the move
 * @returns Array of positions that would be flipped (empty if move is invalid)
 */
export function getAllFlips(board: Board, row: number, col: number, player: Player): Position[] {
  if (board[row][col] !== 'empty') return [];

  const flips: Position[] = [];
  for (const [dRow, dCol] of DIRECTIONS) {
    flips.push(...getFlipsInDirection(board, row, col, dRow, dCol, player));
  }
  return flips;
}

/**
 * Check whether placing at (row, col) is a valid move for `player`.
 * @param board - Current board state
 * @param row - Target row
 * @param col - Target column
 * @param player - Player to check
 * @returns True if the move is legal
 */
export function isValidMove(board: Board, row: number, col: number, player: Player): boolean {
  return getAllFlips(board, row, col, player).length > 0;
}

/**
 * Return all valid moves for `player` on the given board.
 * @param board - Current board state
 * @param player - Player to find moves for
 * @returns Array of valid positions
 */
export function getValidMoves(board: Board, player: Player): Position[] {
  const moves: Position[] = [];
  for (let r = 0; r < BOARD_SIZE; r++) {
    for (let c = 0; c < BOARD_SIZE; c++) {
      if (isValidMove(board, r, c, player)) {
        moves.push({ row: r, col: c });
      }
    }
  }
  return moves;
}

/**
 * Count pieces on the board for each player.
 * @param board - Board to count
 * @returns Tuple [blackCount, whiteCount]
 */
function countPieces(board: Board): [number, number] {
  let black = 0;
  let white = 0;
  for (let r = 0; r < BOARD_SIZE; r++) {
    for (let c = 0; c < BOARD_SIZE; c++) {
      if (board[r][c] === 'black') black++;
      else if (board[r][c] === 'white') white++;
    }
  }
  return [black, white];
}

/**
 * Determine game status from scores and whether any moves remain.
 */
function determineStatus(
  blackScore: number,
  whiteScore: number,
  hasMovesForEither: boolean,
): GameStatus {
  if (hasMovesForEither) return 'playing';
  if (blackScore > whiteScore) return 'black_wins';
  if (whiteScore > blackScore) return 'white_wins';
  return 'draw';
}

/**
 * Create the initial game state with the standard Othello opening.
 * @returns Fresh GameState ready for Black's first move
 */
export function createInitialState(): GameState {
  const board = createEmptyBoard();
  const validMoves = getValidMoves(board, 'black');
  return {
    board,
    currentPlayer: 'black',
    status: 'playing',
    blackScore: 2,
    whiteScore: 2,
    validMoves,
    lastMoveWasPass: false,
  };
}

/**
 * Apply a move and return the resulting game state.
 *
 * DOES: Places `player`'s piece at (row, col), flips captured pieces,
 *       advances the turn (skipping if next player has no moves),
 *       and ends the game if neither player can move.
 * WHEN: Called with a valid move position for the current player.
 * BECAUSE: This is the core state-transition function — every game
 *          action flows through it.
 * REJECTS WHEN: The position is not a valid move (returns null).
 *
 * @param state - Current game state
 * @param row - Row to place piece
 * @param col - Column to place piece
 * @returns New GameState after the move, or null if the move is invalid
 */
export function makeMove(state: GameState, row: number, col: number): GameState | null {
  const { board, currentPlayer } = state;

  const flips = getAllFlips(board, row, col, currentPlayer);
  if (flips.length === 0) return null;

  // Apply the move on a cloned board.
  const newBoard = cloneBoard(board);
  newBoard[row][col] = currentPlayer;
  for (const { row: fr, col: fc } of flips) {
    newBoard[fr][fc] = currentPlayer;
  }

  // Determine next player, handling pass.
  const next = opponent(currentPlayer);
  const nextMoves = getValidMoves(newBoard, next);

  if (nextMoves.length > 0) {
    // Normal turn advance.
    const [b, w] = countPieces(newBoard);
    return {
      board: newBoard,
      currentPlayer: next,
      status: 'playing',
      blackScore: b,
      whiteScore: w,
      validMoves: nextMoves,
      lastMoveWasPass: false,
    };
  }

  // Next player has no moves — check if current player can still go.
  const currentMoves = getValidMoves(newBoard, currentPlayer);
  const [b, w] = countPieces(newBoard);

  if (currentMoves.length > 0) {
    // Pass: turn stays with current player.
    return {
      board: newBoard,
      currentPlayer,
      status: 'playing',
      blackScore: b,
      whiteScore: w,
      validMoves: currentMoves,
      lastMoveWasPass: true,
    };
  }

  // Neither player can move — game over.
  return {
    board: newBoard,
    currentPlayer: next,
    status: determineStatus(b, w, false),
    blackScore: b,
    whiteScore: w,
    validMoves: [],
    lastMoveWasPass: false,
  };
}