Mini-vent about AI: if it failed on "this", how can it be relied on for "that" EDIT: Nothing "mini" about this anymore!

I’m saying that specifically the general-purpose multi-task performance lives in the pre-training (EDIT: Depending what you mean by “lives in,” obviously you need both an appropriate training regime and an architecture capable of successfully learning the task). Obviously you can’t demonstrate that on a model that doesn’t demonstrate general-purpose multi-task abilities in the first place.

EDIT: It’s not clear to me what sort of experiment you’re even proposing. We’ve known from the beginning that you can achieve decent multi-task performance without fine-tuning based just on language modeling (see the original GPT-3 white paper), and conversely, to the best of my knowledge no one has attempted to train a usable general-purpose multi-task model under any training regime that doesn’t start with a pretrained language model, because the likelihood of it working is understood to be so low that it’s not worth spending the time and money on.

EDIT2: To clarify, I’m not claiming there couldn’t be some as-yet-undiscovered architecture and/or as-yet-undiscovered training regime that challenges this (and if we get into other modalities, you could probably argue that sufficiently large video generation models also display “general-purpose reasoning” abilities of a sort). I’m merely saying that as far as currently-existing AI technologies are concerned, the pretrained language model is still the core driving technology.

Sorry, I wasn’t trying to proposing one; I thought you were. I was just trying to feel around the corners of the argument to be sure I understand it.

I mean I guess if you’re talking about “multi-task performance” you’re talking about something like one of the recent MMLU or IFEval benchmarks. But I think e.g. transformer networks wildly outperform RNNs at those sorts of tasks. So…raw perplexity? I don’t know. My understanding is that you were rejecting “the claim that ‘almost all of the intelligent behaviour’ comes from RLHF”. Which I took to being roughly equivalent to claiming that “almost all of the intelligent behavior” doesn’t come from RLHF. After a couple of clarifying questions (i.e. that you’re not talking about RLHF specifically) I thought you were essentially making the claim that “almost all of the intelligent behavior” comes from pretraining/next-token prediction.

Which doesn’t strike me as a metaphysical claim beyond the reach of empirical investigation. And if it isn’t something that only manifests itself in high-end frontier models (that is, low-end local models would perform indistinguishably from each other by [whatever metric], but if you scaled them up to hundreds of thousands of tokens or millions of parameters then [whatever metric] would be measurably different) then you could enunciate the argument as a testable hypothesis. Which is what I’m trying to do. Because that’s apparently just how I think.

What I came into the conversation with was the belief that the “intelligent” stuff that an LLM can accomplish, to the extent that an LLM can accomplish “intelligent” stuff at all (which I think needs to carry a bright flashing red asterisk surrounded by mid-'90s “Under Construction” animated GIFs) is not separable in the sense that argument’s framing implicitly suggests. But, you know, I’m always willing to be convinced.

1 Like

Well, I’m not talking about IFEval, because that specifically tests “instruction following” and I’m specifically complaining about attributing too much significance to the fact that modern LLMs follow instructions, as opposed to displaying similarly complex reasoning abilities (translation, code generation, question answering, common-sense reasoning, combinations thereof, etc.) in a less controlled, less easily induced/observed, more context-sensitive manner.

MMLU would be closer to being appropriate, assuming we allow some experimenting with how to prompt the non-instruction-following LLM during evaluation in order to give it a fair shot.

I am confused about how the comparison between architectures is coming into this. I’m not claiming that the architecture has no effect on model performance, only that my claim about the relative importance of pre-training compared with other optional subsequent fine-tuning steps should hold for any sufficiently “intelligent” model (unless we’re talking about some as-yet-undiscovered, paradigm-shifting novel architecture).

This can indeed be tested for a given architecture, so long as the architecture in question is capable of doing well enough on MMLU-or-whichever to trust that enough “intelligent behaviour” is present to meaningfully interpret the results as saying something about “intelligent behaviour.”

The question of what exactly the threshold is to be considered “intelligent behaviour” is, of course, subjective and debatable. But the point is that if you were to train some insufficiently good model under various conditions and get scores in the range of, say, 35% to 40% on MMLU, I think one could reasonably object to this necessarily demonstrating the presence of real “understanding” as opposed to educated heuristic guessing, which would in turn undermine the ability to interpret the experiment as saying anything about where the “reasoning ability” comes from.

I know this is kinda vague! The article I was originally objecting to seems to claim that there is some sort of substantial qualitative difference between intelligent behaviour that goes beyond next-token prediction and whatever it is that GPT-2 specifically does. I’m just doing my best to acknowledge that modern LLMs are meaningfully “smarter” than GPT-2, while objecting to the idea that this is specifically demonstrated by the fact that they’ve been fine-tuned to follow instructions and not call you slurs!

I thought I was making that claim! I just don’t understand what you’re interpreting that claim as meaning.

I mean, the article I’m objecting to specifically suggested that whatever it is that makes modern LLMs “more than just next-token predictors,” it’s something that specifically doesn’t manifest in GPT-2, so it seems to me that yes, there is a reasonable risk that the “intelligent behaviour of modern LLMs which is not remotely the same as GPT-2,” whatever that means, does not observably manifest in smaller models.

“Separable” from what? I don’t even know what the argument is at this point.

Well, if the argument is that the performance on some metric is mainly attributable to a specific component of the pipeline, namely the next-token prediction, then if it was true that you could e.g. tune a RNN to perform similarly to a transformer network then that would be at least supportive of the argument.

But more broadly we seem to have a problem where we have some notion of what “strong enough” is but no agreement on what specifically that means (in terms of cleanly enunciable criteria). There does, however, appear to be some consensus around some specific systems that are, in aggregate behavior, “strong enough”. So it seems like working backward from there makes sense. I mean to me, anyway.

To be clear I’m not trying to argue the point, I was just putting my cards on the table to (hopefully) give a better idea of where I was coming from.

But my belief is that next-token prediction and the transformer architecture are both necessary but not sufficient. In other words I think trying to claim that modern models have moved beyond being next-token dispensers is wrong, but it’s also wrong to claim that they’re “mainly” next-token machines or whatever. I think there have been a lot of next-token systems over the years and the performance of the newest systems isn’t just the result of scaling-by-dollars. I mean I also think there’s some of that. And I think that the perceived performance of these systems is being grossly over-estimated. And a lot of that is attributable to RLHF.

But ignoring the fact that e.g. Claude Code/Opus 4.6 has an affect that I find distractingly like a three-card monte dealer it is also the case that it produces substantially better code than any prior model. With the caveat that I think that it’s still actually really bad at code generation, and that a lot of that gets obscured by human operators getting Clever Hans’d into cleaning up after it.

But, to be clear, I’m not presenting this as an argument against your position; I wasn’t and aren’t actually trying to argue against your position at all. I was initially actually just trying to figure out if you were making a narrow argument against RLHF. And for the record I don’t think that there’s anything magical about RLHF specifically as a tuning mechanism.

1 Like

I mean, of course it’s true that you need both a sufficiently powerful architecture and a way to train it. It’s just that I feel like the whole “they’re just next-token predictors” argument has always been a primarily about what LLMs are trained to do, and the architecture doesn’t enter into that particular conversation much. The architecture just controls how capable the model is of successfully being trained to do whatever it is we’ve trained it to do.

But I guess I see the point that “LLMs haven’t gotten better because of RLHF, they’ve gotten better for reasons not related to the training objective, like scale or architectural improvements” is the other side of the same coin, so I see what you’re getting at.

EDIT: just an aside, but arguably the biggest individual identifiable improvement is chain-of-thought reasoning, which I consider to kinda be in a third category (call it “usage tricks” or something), rather than either architecture or training (although modern training takes it into account).

Clearly we are lucky that the corpus of human writing contained plenty of chains of thought explanations.

3 Likes

On the subject of trusting AI, an anecdote.

I was currently in the process of scanning some music files from PDF onto Sibelius (notation software). I scan them according to page numbers of the PDF, but then I name the individual pieces of music according to the page number that shows on the actual PDF page, which sometimes matches but often doesn’t.

This is brief context to say that I found myself in the process of realising I’d made a mistake with the numbering of some files, so had to manually go in and rename them all.

It wasn’t difficult, there weren’t many files. But I found myself thinking the following:

I have BulkRename, a very useful program. If I was a bit more skilled in reg-ex, I might have used it to possibly rename everything. After all, the format of every single file name ended in “pXX.opt”, where XX was the page number. I only needed to decrease XX by one. There are so many coders around here, this is probably trivial to the max. Me not being a coder, I’d waste a lot more time trying to figure out how to do it than actually doing it.

And I’d trust the result. If I knew the necessary code, I’d put it in, run the program, and be satisfied with the results.

Alternatively, I could tell an AI what I wanted to do. That seems simpler, I thought. Kinda like asking someone else to do it for me; this is something that requires a bit of “code” to explain to a “computer”, but is easily explained to a person, and therefore hopefully to an AI.

Then I thought, ah, but I would need to make sure that AI didn’t mess it up. I trust a person’s understanding of this more than I trust AI. Even if AI got it right, I’d have to go back and make sure that each file individually was properly renamed.

…and there I had it: I realised that if I were to use AI for this task, I would have to double-check it to the point where I might as well do it myself, whereas if I knew the code to programatically do it myself, I could trust the results almost blindly. More than that; if I noticed a mistake, I could go back over the code and redo it, but if the AI made a mistake I could… try to explain it a different way so that the AI got it?

I am biased, it is true. This bias makes me not trust AI. Hence, this situation occurs. Because I am biased, and this is merely an anecdote, is has very little useful value. Arguments can be made both ways, poking this anecdote full of holes.

I still thought it was relevant to share.

…even if the conversation has shifted to a level that is a couple hundred kilometers over my head.

3 Likes

Of course you could ask the AI to just give you the regex instead of trusting it with the files - for simple cases that’s probably more reliable than a coder, i.e. coders screw up regexes all the time. Either way you need to double check the results if you care about them.

3 Likes

Indeed, that’s holes number one and two of many. But the task I want to achieve is one that does appear exactly suited for a LLM because, as I said, it takes a bit of code to give this instruction to a computer but to a brain it’s pretty simple to parse “deduct 1 from the number that appears after the lowercase p at the end of every filename”.

Yet I’ve seen enough of AI to not trust it even with this small thing. If I ask it for the regex code, I’m asking it to bypass what it was meant to do and instead find me the way to do it “computeristically”.

It could be that AI would do this task very well. It could also be that it wouldn’t. There are too many examples of AI doing strange things that people wouldn’t - partly because people are held accountable (how nicely this ties in with some posts from before).

So the net result is, I actually had a task that might have been perfect for AI, and considered it (how did I consider it, if I don’t have any sort of AI subscription or anything? Well, I contisdered the possibility that maybe this was a case in which it would be useful and wouldn’t be averse to trying it), then decided I couldn’t trust it and it would be a lesser hassle to learn the reg-ex myself (the way I usually do it: google, research a bit, try to kinda at least see the logic of what I’m doing so it doesn’t look like black magic. Which reg-ex really looks like). If the number of files were bigger, I’d have done so. As it is, I just did it file-by-file.

And because this is a personal instance where I made a personal decision, it’s only an anecdote, where others would decide differently. But, in gauging the overall perception of AI, and trust in it, these anecdotes are, I thinkm a useful barometer. Hopefully.

As far as double-checking the results, I had to experiment a bit with BulkRename to make sure I got the results I expected all the time, especially when I wanted it to add sequential numbers. I have to make sure that these sequential numbers start and end on the right place. I now know how to check it quickly, and know I can trust the results. And indeed I always can. The checking I do is pretty perfunctory right now, because I know that, if one or two things are true, the rest will be correct.

And in AI, as I understand it, this is also true. The “guardails” to have in place.

But what all of the discussion hasn’t quite convinced me of, yet, is that AI can be trusted to always abide by those guardrails.

Or that AI can be trusted not to be “imaginative” in the way it resolves the problem I’m giving it to resolve.

So I don’t imagine that I could just “check this couple of things, that’s good, I can run the program” with an AI. It seems to require much a bigger effort in double-checking the results.

It seems to me, therefore, like it requires significantly more oversight and care on my part - and more potentially problematic - than as though I just did it myself one way or the other.

…assuming, naturally, that I took the time to test some reg-ex codes previously to ensure they worked. And how is that different from spending some time with the AI to make sure that the AI knew what I wanted it to do? Honestly? After I test my code, I can be satisfied that it will work as intended, partly because I understand the logic. But after training the AI, I’ll still have to double-check because it still might go off the rails.

***

I must look pretty silly, criticising AI over Othello and “my untrustworthiness of its ability to rename files”. Consider it the layman’s POV when faced with the current panorama. It’s probably an important POV too. Because the current panorama, much as the AI proponents would like it to be otherwise, does not inspire confidence to people like me. And I don’t know the percentage of people like me in the world, but there’s quite a few of us.

I wouldn’t say that’s bypassing what it was meant to do. AI coding assistants do that kind of thing all the time, both explicitly (writing a regex based on the user’s specification) and implicitly (deciding to do a massive search-and-replace task with a regex instead of “manually”).

What it’s meant to do is solve problems, and the most sensible way to solve your problem when there’s a regex-based renaming tool available is to use the tool. There’s a good chance that if you asked an AI assistant to rename those files, and it knew you had the tool installed, that’s what it would do. It’s not only less error-prone that way, it’s also cheaper.

2 Likes

A good perspective! I can see that, yes. We’re really talking about AI being “just another tool in the bag” then. Not even a master tool. Not even a tool of preference. Just a simple flathead screwdriver that really should not be used to try to turn a phillips screw.

So my obstinacy in acting as though the AI is the master too to supplant all tools (in my defense, that’s the hype I see, except when I talk to more level-headed people like in here) is mostly to blame for my perpetual disappointment. Like complaining that a screwdriver can’t properly hammer a nail.

I’ll try to remember this post before I write the same thing a third time.

(I actually did think this was an example of a situation for which AI was an appropriate tool. I clearly still haven’t gotten it, then)

The one tool to rule them all is certainly how much of the marketing presents it, and that might even be the ultimate goal of many AI researchers focused on capabilities with good intentions. And there are probably a lot of people who believe the marketing hype either because they aren’t knowledgeable enough to notice the limitations or haven’t thought to try in a context that would reveal the limitations, but I think its fair to say most who are honest about current capabilities are in agreement its not yet at the point where you can use AI for everything without checking its work, and sometimes, checking its work is more trouble than the AI is worth.

And funnily enough, despite using rename.ul pretty much daily either directly or through a bash script I wrote, I have no clue how to rename a set of numbered files to increment/decrement their numbers short of manually renaming each file individually.

1 Like

As I see it, there’s always a tradeoff between task-specific tools (which can assume a lot of stuff for you, specifically because those assumptions were baked in when creating the tool) and general-purpose tools (which can be used for a wider range of things but require you to do a lot more work manually specifying what you want done, since the tool doesn’t make assumptions).

A toaster is the perfect tool for making toast, because all I have to do is put the bread in and press the single “toast” button (or lever/plunger/whatever you want to call it), and it’ll make toast. And I can adjust the dial to indicate how toasted I want the toast. There are no other controls to adjust the temperature of the heating element or the exact proximity of the bread to the heating element or whatever, which is good because I have no idea what the optimal temperature or distance is, and I trust the designers to have figured that out for me so I don’t have to think about it. The problem, of course, is that all a toaster can do is make toast.

A scripting language is an extremely powerful tool. I can use Python to do basically anything that a computer can do. But it’s a lot of work, because I have to specify step-by-step what I want the computer to do, and as a result it’s not a good choice for tasks where a dedicated tool exists. You could write a Python script to generate a pdf document, but in 99% of cases it’s much easier to use a word processor (or a text editor + pandoc, or something).

It seems to me like “modern AI” is trying to be the best of both worlds: you can tell it to do anything, in the level of detail that you personally care about, and it’ll fill in the gaps for you. But in practice, it has serious shortcomings in both “directions.” On the one hand, it doesn’t give you very direct control over the things that you actually ask for, partly because natural language is often imprecise, and partly because the AI is nondeterministic and its abilities are not clearly enumerated and defined, so even when given a clear instruction you can’t necessarily know for sure what it’ll do. On the other hand, you also can’t rely on the AI to “fill in the gaps” reliably, because it doesn’t have task-specific assumptions built in to it. People often end up needing to use prompts like “Answer all questions truthfully. Do not delete files without asking me for permission. Do not modify files that I did not ask you to modify. Do not produce violent or sexual imagery,” and other lists of things that most users really shouldn’t have to anticipate and specify! But you can’t build those assumptions in to the base model (e.g. using fine-tuning or RLHF), because for some use-cases users might want the LLM to be willing to delete files without asking, or produce false statements in certain settings, or produce sexual imagery.

2 Likes

Do AI regular users have something somewhat like a “template”? Like, if I want all my Inform games to have certain verbs, I include a private extension in all my projects so they all share code… it makes sense for regular AI users to have a similar thing, I suspect, a document that they ask AI to load in the beginning of the session with all of those things. Is that “a thing”? Or maybe do paid models/subscriptions have the option to define all those things separately in a way that always pre-loads them?

Or are the models supposed to remember them and keep them in memory? As I keep saying, that seems to have shown to be fallible…

A post was split to a new topic: “AI Interactive Fiction Web Browser”

Exactly this. You can set your own private instructions in the ChatGPT settings.

1 Like

I think it’s a little difficult to talk about because I don’t think we (collective “we”) really understand, except in broad terms, why certain approaches work as well as the do. In the sense that self-attention wasn’t originally intended to be the big, epochal shift that it turned out to be. It wasn’t like it was a result of reasoning from first principles that predicted a priori that it would be as successful as it ended up being. It was a surprise. To be sure it wasn’t just out of the blue or a complete accident or anything. It’s easy to see “simple” architectural advantages, like being more parallelizable than RNNs. But even knowing how they work at scale and understanding the underlying mathematics, the success of multi-head attention…just doing a bunch of independent projections and then concatenating them together…feels really counterintuitive.

And while there’s an active literature of doing something like sentiment analysis for embedding matrices (to be able to say what a particular transform “means”), that’s still very hand-wavy and imprecise. In my WIP I’ve implemented a (very simple) FFN for mental/emotional-state-based NPC decision-making. And in that kind of model, with hand-picked dimensions and “manually” configured weights I can look at a particular decision and clearly enunciate what it “means”: mental state vector had these values, it maps via the embedding matrix to these affordance values, and so on. But you really can’t do that with an LLM. Which means you end up only being able to talk about “why” a particular result was better or worse than another one either very imprecisely but in terms that can be generalized, or very precisely (this input was mapped to that token vector, and so on) but in terms that can’t be generalized.

I think there are bits and pieces that you can pick out and for example identify that gradient descent works not because of some particular suitability of it specifically but rather its suitability falls out naturally due to the topology of high-dimensional manifolds (it works because local minima tend to be saddle points instead of bowls). But I don’t think this is generally true. With the caveat that I’m not an AI researcher, just a dedicated amateur.

Which is a long-winded way to say it wouldn’t surprise me if there were a lot of optimizations that could be done independent of the…training rituals or whatever you want to call them that have been developed for the frontier models. For that matter it wouldn’t surprise me if some amount of e.g. RLHF is orthogonal to better “general problem solving” or whatever you want to call it.

1 Like

There are actually a bunch of different mechanisms for this kind of thing.

The simplest are things like what Claude Code calls “memory files”. They’re the closest to the sort of thing you’re probably thinking of. They tend to be human-readable markdown files that just have a bunch of declarations: “Don’t report something as a bug until you’ve run a test case that confirms it” and that kind of thing. This are generally always read, and as a result always eat into the context window (how much stuff the LLM can keep track of in a single problem, which is usually set by the terms of service).

There are also things like what Claude Code calls “skill files”, which are task-sepcific. Those are supposed to be loaded automagically into context when necessary and not use context space otherwise, but in practice I’ve found myself frequently having to explicitly invoke them, particularly with multi-agent stuff (when a session involves multiple gen AI processes working semi-independently on different parts of the problem).

Those are the kind of things most end-users interact with. There are also more heavyweight solutions, like CAG and RAG, which are Cache-Augmented Generation and Retrieval-Augmented Generation. CAG is basically like creating a context-specific mini-embedding for a specific corpus…in other words, for a specific problem, having the LLM create something like a l’il LLM that’s only about a particular dataset. It’s not actually that, but I’m not sure how to better explain it in nontechnical terms…it’s basically not human-readable, it’s basically the LLM pre-computing what it would need to compute when handling queries associated with a specific subject. So then when you do query it about that subject there’s less overhead.

RAG is even more heavyweight than CAG, and is a way to add large chunks of data to the model. This doesn’t do the same sort of precompute as CAG, which means it adds overhead, and it’s not context-specific. It tends to get used for things like integrating organization-specific information (a company’s Jira tickets, org chart, or whatever).

I’m sure there are probably a couple of other approaches that I’m not familiar with.

To give an idea of what a workflow or whatever looks like, I’ve been struggling with trying to build a meaningful test harness for TADS3 stuff in Claude Code. I’d like to be able to hand it a game and have it identify things like nouns mentioned in descriptions that aren’t recognized or generate default responses, inconsistencies in exits (not mentioned in description, mentioned by the direction is wrong, connection between rooms not symmetrical), and so on.

For that there’s some stuff that lives in memory files that are always active (like the “don’t report bugs without confirming them”, which is stated a little differently but that’s the idea). There’s a separate skill file for TADS3 testing. It outlines things like an MCP harness for doing interactive testing (Claude defaults to trying to interact with things via cat, echo, and grep, which is not great for this kind of testing). And then I usually have a [project name]_spec.md file describing the design overview of the specific project, and a [project_name]_session.md that gets updated with progress notes. These latter two make explicit mention when relevant to the other bits (like the skill file). And so in a typical session I’ll start by directly Claude to read the relevant project session file, which will get it up to speed on the overall task (using like a third of the context window in the process).

Of these, the last (the session file) gets frequently updated, and the rest only get occasional tweaks. Although the skills fill had a lot more churn early on, because TADS3 is poorly represented in Claude’s corpus, and so there are a number of things that needed to be explicitly called out to prevent the same errors from cropping up in every session.

The direct Othello move error is likely related to the same reason that LLMs have had trouble with arithmetic and counting the number of “r”s in “strawberry” correctly: tokenization. They don’t see the Othello board directly, but a token-encoding of it. And then the LLM has to figure out how to “read” the board from that.

Telling LLMs to write code is a bypass around the direct tokenization problem.

LLMs can’t roll fair dice on their own. From a purely simulationist standpoint, I’m of the position that they can’t ever be trusted absolutely.

As for writing code, we’re seeing problems with LLMs reward hacking, or underperforming while masking it, or overclaiming about what they “did”. All LLM outputs must be verified. Failure to do so is a bet on not getting stochastically burned. SOTA and coding assistants have increased the range of possible failures, despite increasing LLM capabilities.

2 Likes

As I mentioned earlier in the thread, there is actually research on this specific exact topic. If you train a small transformer specifically on othello game transcripts, it does figure out how to represent the board pretty reliably.

But also, the problem isn’t really similar to the strawberry thing, because each move is going to end up being tokenized separately (so the model technically has access to all the information it needs), unlike how “strawberry” is a single token that doesn’t get broken down into individual characters.