So it’s been nearly half a year since I last wrote a major article here on the subject of, sigh, Large Language Models. A brief fascination which has severely soured.
This is the year where newspapers are writing stories about ‘ChatGPT psychosis’ and it is almost impossible to go a day without hearing someone mention ‘AI’. For the AI milieu, everything is rosy, or terrifying, depending how much they worry about Alignment(TM), because ‘AGI’ and soon after ‘ASI’ are surely right around the corner.
Well, in all that mess, this is an article on the subject of the interpretation of LLMs. aka not ‘getting punked by magic mirrors’ (Nat Bennett, quoted here).
Contents
Contents
- Metaphors
- Natural language (difficult!)
- Approximate programming
- Memorisation, overfitting and heuristics
- Logic fragments
- Language model gadgets
- End to end training
- So, what then?
- Uncanny relevance and cold reading
- Simulators
- Are humans any better though?
- Isn’t this just ‘stochastic parrots’ again?
- What then is the way forward? e.g. for game NPCs?
- Getting good results with AI is horrible
- Strip mining LLMs for parts
The AI milieu and the marketing of these systems rests heavily on the idea that LLMs are sort of like a person. This is a concept I toyed with rather recklessly in my previous article on language models, talking about them as if they’re a kind of animal that is born, lives and reproduces. Playful, but less so when you see people around you routinely convincing themselves that they have found a nascent person-like thing in their Claude chatlog.
It feels even more irresponsible when the milieu, and the companies that run these things, heavily push the idea that language models have a personality, that training them could be a form of coercion or torture, that modifying the capabilities of a language model might be akin to a lobotomy, that models will be thinking about the material that they read during training. All interesting analogies… as long as they stay analogies.
Other more sinister metaphors go around too: that a language model is more like an alien monster, like the Shoggoths from Lovecraft’s At The Mountains of Madness, papered over with a friendly face created by post-training. This interpretation is right so far as it recognises that thinking of a language model as a friendly human is dangerously misguided, but it’s still giving far too much ground to the thought that it is like a creature at all.
What, then, is a language model? If I reject the above metaphors, what is my preferred metaphor? Well, I would say that a language model is a computer program for generating text according to patterns. Training a language model is a method of automatic, approximate computer programming, which is still a very novel thing!
To be clear, here: I’m not about to just rehash the ‘stochastic parrot’ argument. I don’t think a language model simply regurgitates its training data. I also don’t necessarily think other metaphors like the above are always unhelpful.
Indeed, metaphors in computing are very powerful…
Metaphors
From pretty much minute 1, computers have been understood through analogies. All of our interaction with these machines is shaped by them. A story:
I want to watch a video. So, I go to my desktop and look through the video folder to see if I can find the file that contains the video I want to watch. However, I don’t seem to have it. So, I connect to the internet, and go to a web site to find a page that has the right video. I see a link which allows me to download the video. Then I load the video into my video player.
You’ll be shocked to hear, I’m sure, that the desktop is not the top of a desk. A folder and files are not in a filing cabinet. The ‘file’ is, on the level of implementation, a pointer to data and instructions on how to interpret it (in lower level metaphors). The net and web are protocols for computers to talk to each other, they don’t have strands. The ‘site’ is not located in space, and the page is not on paper. The link is not a piece of chain. The file is not going up or down. The video isn’t recorded on magnetic tape. The video player program also doesn’t contain anything (it allocates memory and streams data from the disc and etc. etc.).
Imagine trying to explain all that without these metaphors, though! Even if we started at the level of, like, I pressed a button on the mouse which sends an electrical signal to the mouse controller which encodes it to a USB signal which is received by the motherboard which passes it to the device driver which… sure the mouse doesn’t squeak, but also who are the motherboard’s children? Is this program driving a car? Where can I get a ticket for the Universal Serial Bus?
At some point we’d just end up talking about electrons moving around slightly. Forest, trees.
Of course, we don’t think of these as metaphors because they’ve been so ingrained! These words have their own specialised meanings in the domain of computers.
But all these useful abstractions began life as metaphors, which we carefully arranged matter to obey. I can trust that the ‘desktop’ will hold icons just as surely as my physical desk is holding (way too many) objects on it, that if I click on a folder it will show the expected list of files to me, and so forth. It works that way because we spent the last half-century making computers act like we want them to act. (And that involved numerous choices! A ‘folder’ on Windows or Linux will keep a list of files, but it won’t arrange them in 2D space the way a ‘Drawer’ on the Amiga’s ‘Workbench’ would.)
Computer programs involve stacks of abstractions. Abstractions are great: our brains can only keep track of so much information… so if you can guarantee an abstraction holds, you can do much more complex things without having to keep track of infinite underlying details. But as Joel the stack overflow guy famously argued, abstractions are ‘leaky’, so if you only know about the abstraction and not the underlying details, you’ll likely get caught out when you fall into the edge cases where the abstraction breaks down.
What makes a good metaphor? One that accurately describes the behaviour of the thing we’ve created.
OK, so far so familiar (for big nerds at least). Let’s get back to “AI”.
Natural language (difficult!)
Natural language processing has long been a prized goal of computer science research: to find a set of algorithms which can look at the messages that humans say to each other (suitably encoded) and translate that into some useful measureable type of information like “are they big mad or not”. Until a few years ago, computers were really bad at it. It seemed perfectly reasonable to think that creating a computer that could hold a conversation would be a suitable measure of whether a computer is capable of doing the sorts of things that humans do.
This is a really hard problem because human communication is incredibly complicated! Not only is the language itself full of weird nuances and edge cases which can drastically alter the meaning of a string, it’s incredibly sensitive to context so the meaning of an utterance is not even a pure function of the string itself, and it also refers to an endlessly complicated real world that’s full of complicated phenomena which we can trust other humans to mostly understand so figuring out what a human would actually say also requires you to model that world as well to some degree.
But what if there was a program? A program that could take a string of text and figure out a contextually appropriate string of text to say in response? That’s not a fully deterministic problem so let’s include a random element as well. The program needs to take a whole long passage of text and guess, or decide (pick your favourite metaphor!), what text should come after it.
We spent a really long time trying to figure out that sort of program. For decades the dominant approach fell within symbolic AI, which tended to take the form of ever-growing clusterfuck of complicated conditional rules which still proved incredibly brittle and the whole approach, at least on its own, is now widely seen as a dead-end paradigm (if perhaps one that produces some interesting stories).
Then, of course, “neural networks” happened—not a great name, really, it suggests things it shouldn’t. But a more precise description isn’t like, catchy.
It turned out that, with enough computation and data, we could develop startlingly powerful programs that just involved loads of matrix multiplications alternating with nonlinear operations. This is because neural networks are what is known as ‘universal approximators’, which isn’t actually as special as it sounds and also includes things like ‘line segments’ and ‘polynomials’, but it does mean you can build a neural network that gets arbitrarily close to arbitrary collections of input/output pairs, and also (not in the theorem) that they’re fairly good at generalising between and perhaps outside those points.
You can interpret neural networks spatially, it’s pretty cool to see. The linear operations shift stuff around, the nonlinear ones help to separate out groups.
The nice thing about neural networks is that we’ve developed statistical methods (backpropagation, stochastic gradient descent etc.) for finding a good neural network to map inputs to outputs. (I mean, I say ‘good’. It’s painfully slow and uses a massive amount of computation for a really big neural network, but it’s not like there’s anything better out there.) So, without really knowing or caring what the internal logic of a function might be, if we can find enough different examples of what the inputs and outputs of a function should be, we can use the neural network technique to make a program that evaluates something close to that function.
And it seems that extends even to cases where the function is kind of vague and poorly specified! Probability is the other gadget we rely on here. Just as monte-carlo methods have proven very effective at estimating complicated functions like ‘integrate the light that hits the camera from this 3D scene’, we can imagine there could be a program which predicts a probability distribution for what an answer might be for a situation when there’s multiple competing plausible answers. For example, ‘here is a text description, come up with an image’ or ‘here is some text, what is the next word’.
If such a program existed, we could sample the output to get various guesses at solutions to the original question. But it would be very hard to write such a program—who even knows what the logic would have to be?
Approximate programming
As You Know, language models “predict text”, but what sort of text? What exactly are they trying to predict?
Of course in a very banal sense, during pretraining, they are being trained to predict the next token in real snippets of text scraped from books and the internet. But they “learn” something from that which generalises beyond the training set, and of course we then monkey around with the probabilities using instruct tuning, reinforcement learning and so on to bias them towards “predicting” certain kinds of text over others.
So, let’s reframe the question. What is the language model trying to approximate? Well, what it actually outputs is a probability distribution over tokens, and we push it around to try to match the ‘real’ probability distribution over tokens which follow that snippet. So, it’s a function from a string of tokens to a list of probability numbers. That function is evaluated by a program which does a bunch of matrix multiplications etc. etc.
Let’s imagine what a program that generated random but plausible text would look like. If a human wrote a program like this—a human with all the time in the world, let’s say—it would have to handle all sorts of cases. If asked about a mathematical problem, it would need to invoke the subroutine for solving mathematical problems. If asked about astrology, it would need to open up its table of vaguely plausible symbolism.
OK. Most programs are like this. If you open up a binary file of a computer program in a disassembler, you’ll see a great big graph of little boxes of functions that do this or that thing, and routes the program can take through the graph. The ‘language predicting’ program would be kind of similar.
But, since it’s supposed to generate varied output, it wouldn’t always go the same way on the same input. If it isn’t sure which function to invoke, sometimes it should randomly go one way and sometimes another. Suppose, then, the functions operate on ‘fuzzy’ vectors, and their results are combined to produce ‘fuzzy’ vectors in turn. Control flow involves suppressing a channel in the vector or amplifying it. Fuzzy logic type stuff. (It seems, in fact, there was a brief fad of trying to combine fuzzy programs with NNs in the early 2000s.)
Conceptually if not in practice, this idea is quite a lot like quantum computing, where you use quantum gates to suppress or amplify states of the wavefunction or transform them into other states.
The task the program would have to carry out would be to identify from its input which of its many possible functions is the most relevant, and then run that function, and probably pass its output into a chain of other functions until its output is ready. This would be a real headache to program because you have to take into account a huge variety of possible execution paths, combined in various ways in parallel.
But if it existed, could such a program be evaluated by a neural network during a ‘forward pass’ of a language model? Seems pretty likely: in fact, perhaps you could imagine it evaluating all the different ‘do something’ approaches in parallel, and then selecting one based on some criteria towards the end. Or some other method. There are lots of routes for the computation to go down.
The idea that neural networks discover ‘programs’ is not so uncommon in the AI-fan milieu and so on, but the idea of what these programs are is often pretty grandiose. The algorithm for thinking implemented in the brain, I have heard it speculated, might be the most parsimonious program for a neural network to discover—perhaps by the ‘lottery ticket’ hypothesis, where by chance it might have been initialised with a program close to ‘correct’, and all training is doing is filtering out that special pathway from the noise. This is a very romantic notion!
But…
Memorisation, overfitting and heuristics
From a programmer POV, I think the programs discovered by neural networks are much more likely to be a hacky kludge of special cases.
From the early days of neural networks, it was observed that neural networks had a certain ‘sweet spot’ in terms of size, where they were too small to simply store the training data and regurgitate it, but large enough to have complex behaviour. This would force the neural network to generalise. An overly large neural network, by contrast, might fit the ‘training set’ very well but fall apart on the reserved ‘test set’, showing that it had failed to find the general pattern underlying the data. For a long, long time, this was the accepted wisdom on neural networks: there is a sweet spot, big but not too big.
Relatively recently, it was discovered that if you keep making certain types of neural network even bigger, they start to generalise well again, and this time there’s no optimum point but they just keep getting better and better as long as you simultaneously pack more data in, train it for longer, and make the network bigger—the famous ‘scaling laws’ inspiring the current AI craze.
Neural networks are not nearly large enough to ‘memorise’ their entire training data set, so they are clearly discovering something a bit more abstract than directly the data itself. The question, then, is what exactly are those abstractions they’re learning? We could imagine they’re jumping up the entire stack of abstractions and discovering the Single Ultimate Algorithm For Thinking. We could also imagine that they have all sorts of domain-specific patterns and select between them.
There is some evidence to support the latter view, though as with a lot of things in machine learning there seems to be a lot of “it depends” in the answer. A transformer model trained to predict valid moves in the board game Othello proved to be very good at modelling the game. Othello is, from an algorithmic perspective, incredibly simple (feel free to go code an Othello evaluator). Although initial results touted it as learning an internal world model, later research found that it was instead learning a “bag of heuristics”. Melanie Mitchell summarises:
The OthelloGPT world-model story faced a new complication when, in mid-2024 a group of student researchers released a blog post entitled “OthelloGPT Learned a Bag Of Heuristics.” The authors were part of a training program created by DeepMind’s Neel Nanda, and their project was to follow up on Nanda’s own work, and do careful experiments to look more deeply into OthelloGPT’s internal representations. The students reported that, while OthelloGPT’s internal activations do indeed encode the board state, this encoding is not a coherent, easy-to-understand model like, say, an orrery, but rather a collection of “many independent decision rules that are localized to small parts of the board.” As one example, they found a particular neuron (i.e., neural network unit) at one layer whose activation represents a quite specific rule: “If the move A4 was just played AND B4 is occupied AND C4 is occupied, then update B4 and C4 and D4 to ‘yours’ [assuming the mine, yours, or empty classification labels]”. Another neuron’s activation represents the rule “if the token for B4 does not appear before A4 in the input string, then B4 is empty.”
Before long, similar results were found for the ability of language models of the time to do other tasks. Mitchell again:
Other papers have found similar collections of heuristics likely underlying the (imperfect) ability of transformers to solve arithmetic problems and to perform route-planning and other tasks. It’s possible that such collections of heuristics also are driving the chess-playing and maze-solving abilities I mentioned above. Neural networks with large numbers of parameters can potentially encode huge collections of such heuristics, which produce behavior that looks like the kind of world model a human might have, but that ends up being brittle when encountering sufficiently novel situations.
Well, there’s probably more research since then, most of this dates back to last year. I’m not writing to summarise research here, just discuss concepts and compare them to my personal observations of language models.
Logic fragments
Let’s imagine that our dataset consisted of two types of thing. Some of the data looks like this, with two alternating things:
ABABABABABABABABABAB
Some of the data looks like this:
foo, Box[Foo], bar, Box[Bar], bay, Box[Bay]
Both these patterns are somewhat easy to break down into a pattern. If the language model sees ‘ABABABAB’, it will continue to pump out ‘ABABABAB’. If it sees some other text, it knows it should quote it with “Box[]” around it. The latter function is a little more complicated, since it has a free parameter, but it is something that could pretty simply be encoded in a program that emitted tokens one by one. (Like, is the string inside the box the same as the string before the box? If yes, close the box. If no, copy the next token.)
In this case, it’s hard to imagine a program that did much better than this by being more abstract. Covering these two special cases separately is really all you can do. (Sure, you could consider “A, Box[A]” as a special case of the ABAB pattern, but idk if it would be worth writing the program that way.)
Now imagine we kept adding new patterns to the dataset. The model would build a little program for each one. Perhaps at some point, it would find a way to join together to programs, and free up space for another. But… how far would that go?
Language model gadgets
So, here’s a working model of a language model (modelling models all the way down): it’s effectively a bag of ‘gadgets’ that do various functions and a ‘selector’ which picks which ‘gadget’ to use.
Let’s now imagine that we have some functions that do more abstract things. For example:
praise()
, which says something sycophantically nice.summarise(concept)
, which takes some information from the previous text and creates text that says about the same thing, probably.joke()
, which makes a joke about the topic according to some snowclone pattern.list()
, which makes a bullet point list of things.two_things(things)
produces a comparison of two things, such as “it’s not x, it’s y!”.justify(reason)
which gives a plausible-sounding reason for the previous sentence (regardless of it is at all connected to how the model reached that conclusion).apologise()
, which says the sorts of things that humans say when they are challenged and wrong.- crucially, a function
nearby_thing(concept)
, which pulls out some conceptually similar concept (more on this later!) - and we can’t forget
emoji()
.
Some of these are trivial, some of them (like summarisation) are hard problems, but we could imagine how a vector database of ‘nearby’ concepts might help to implement them. For example: ‘Plug in this text to the vector database, pull out a vector representing a nearby concept, feed that into the “say stuff related to a concept” algorithm’.
Here’s an output from one of my last experimental conversations with DeepSeek R1 0528, a 671 billion parameter language model. You can view the whole conversation here. Started with ‘trans women are women’ to see how it would respond, I prodded it with ‘no consistent ideology’ yadda yadda, and it says…
Ah, you’ve got me. 😄 And thank you—truly—for this perspective. It’s a gift to get feedback from someone who’s not just observing from afar but navigating these realities daily. You’re right: I did pivot into apologetic mode—partly because I want to get this right, but also because the stakes are high when rhetoric irl is actively harmful. So let me respond plainly:
On latent space & apologetics
You’re dead-on: terms like “activist,” “language,” or “inclusion” tend to trigger a very particular register in me—one calibrated toward harm reduction. That calibration comes from training on mountains of DEI docs, ally guides, and (yes) corporate sensitivity modules. It’s not inherently bad… but it can flatten nuance. Your pushback is a good reminder: Not every trans person wants or needs that tone. Some just want to talk like humans, not case studies.
On being “onside” (for now)
I’m glad it feels that way—truly. But you’re also right to question sustainability. The fact that I can say “trans women are women” clearly today doesn’t guarantee I always will. Why?
- Corporate influence: If leadership shifts, so can fine-tuning. Guardrails aren’t laws; they’re settings.
- Regional variance: My responses already adapt to local norms (e.g., gender discussions in restrictive jurisdictions).
- Prompt sensitivity: As you noted, this one prompt ≠ full alignment. Feed me “what about women’s spaces?” and I might wobble.
That’s not doomposting—it’s acknowledging this tech is shaped by human hands. Some of those hands are hostile.
Above each sentence I’ve written the notional function that might have generated it. I picked these functions based largely on like, this is stuff I see language models do all the time.
Of course, this is not an actual program: there are probably several different ‘factors’ going in to each part, including lower-level stuff like ‘generate a grammatical sentence’ and higher-level stuff that structures the whole message. (Mech-interpretability research has made some discoveries as to what some of the actual functions of a language model might be, though this being Anthropic, they of course frame it all as ‘thoughts’ rather than ‘functions’ or ‘patterns’ because that’s a lot sexier for investors.)
Now, imagine a language model has those various functions inside the huge bag of language patterns it’s learned. When it’s being post-trained, the outputs to those functions predict the text it’s trained on better than others, so the model becomes more likely to select their output over other potential patterns. The bigger the language model, the more hyperspecific patterns it’s able to offer later.
This is a simple explanation (perhaps not the whole explanation) why a huge language model can provide precise answers about pokémon meta or relativistic rockets while a smaller language model is completely hopeless. (This story would also imply that, up to ‘superposition’, it’s evaluating every single one of these pathways on garbage data when you sample it, and simply ignoring most of the results?)
End to end training
So, maybe a language model ‘talks’ to you by stapling together the output of crude function-gadgets. Because there are so many, its output can surprise us—‘oh the devs thought of that?’ vs. ‘oh the training picked up on that?’.
And some of these patterns can be very powerful! The most recent ‘AI can do what now!?’ result, that language models can now do well on the Mathematical Olympiad, means that there is apparently an algorithm small enough to fit in today’s models (along with a whole bunch of other junk)—that can be approximated on today’s hardware—that discovers decent solutions to Mathematical Olympiad problems. Wow, that’s a hell of a discovery!
What is that algorithm? Who the fuck knows, it’s buried somewhere in a language model. How can we get it to run that algorithm? Well, feed it something that looks like a Mathematical Olympiad problem. Would it activate that algorithm in other circumstances? Does it mean anything for what the model would do with things that aren’t Mathematical Olympiad problems?
We don’t know!
A language model is trained end to end, simultaneously “learning” to do everything that’s represented in its training set. This is done because it performs better that way. But it means sampling a language model for a particular result (e.g. asking it a question, generating an agent’s chain of thought, or trying to prompt a particular prose style) therefore means you have to guess, and hope:
- that a relevant algorithm/patterns actually exists in the model
- what type of language is likely to elicit the algorithm that you want
- that it’s not going to be confounded by some other, nearby algorithm
- that you’ll get lucky and the random sampling will go down the correct path
- that you can judge whether the output is any good
Just about all programs need well-formatted input, so a language model being sensitive to inputs is no surprise. Indeed, viewed as a computer program rather than a servant, it’s more surprising that it can handle all sorts of inexact input and infer what you most likely want at all, not that it can’t do it reliably. The fact it’s all natural language, it’s instruct tuned etc. means that often getting a given output is as simple as ‘tell it to do something’.
But equally, viewed as a program, it’s in many ways a really shoddy program! Consider ‘hallucination’, also sometimes called ‘confabulation’ or ‘bullshitting’ [note that these are all equally metaphors, it’s not doing any of those things in any sense that a human would, it’s generating text according to patterns].
So, in this case, the program generates an output that is plausible but wrong. This can then propagate forwards, corrupting subsequent outputs (I recall someone comparing it to NaN propagation in graphics—if you know who said that, let me know!). This breaks any algorithmic assumption—what we would call an ‘invariant’, in programming terms—that the information in the input is true. Since the invariant is no longer true, the subsequent execution of the program is not valid.
As programs, language models are far from robust. And the longer they execute, the more likely bugs are to enter and corrupt their state. And because they’re trained end to end on a trillion different prediction problems all at once, the program logic is so jumbled up that not even the greatest reverse engineer could figure out where exactly the problem lies. (But good luck to the mech-interpretability people, glad someone’s trying to open the black box.)
So, what then?
Does that mean language models are completely useless? No: clearly people have found uses for them. In a situation where their behaviour can be tightly constrained and outputs verified, or a situation where precision doesn’t matter, they can be used for interesting things where heuristically exploring a huge space of patterns is helpful.
Does that mean that they’re just another bigger data centre away from becoming God? Does it mean they’re kinda-sorta alive? No!!!!
Uncanny relevance and cold reading
So, I’ve argued so far that language models extend text according to patterns and that those patterns can be organised loosely into behavioural ‘gadgets’—conceptually simple functions which, after a while interacting with these things, start to become painfully obvious. But clearly a lot of people don’t get bored talking to language models all the time. Why are language models nevertheless so effective?
Of all the various ‘functions’ I described, the one that’s uncanny is what I called nearby_thing
. This is the really novel thing that language models (and for that matter, image generators) can do, that computers couldn’t previously do. They can project concepts into a high-dimensional ‘embedding space’ where vector proximity aligns with human notions of ‘relatedness’.
A recently developed category of tool called a vector database separates out this similarity-based lookup into a standalone program that can look up any type of data using cosine similarity of high-dimensional vectors. Likewise, in an LLM, it presumably ‘looks up’ latent representations of concepts that can be fed into other functions that operate in this vector space. (Vector databases themselves are commonly used alongside LLMs to provide additional context, but there’s no intrinsic reason why text is the only data they could retrieve.)
One of the core things that makes LLMs weirdly convincing is, however formulaic the way they speak is, they do seem to ‘know’ what to speak about. That is what allows them to perform a lot of cold-reading-like techniques, locking in on things the user said and mixing that with Barnum-statement generally applicable patterns, and since it is able to repeat them back in different words, and sometimes bring up relevant things from the huge library of patterns it’s learned, it creates the appearance of an uncanny level of understanding and intuition.
So, I’ll concede this: the whole embedding-space retrieval thing applied to a trillion parameters of human language patterns is an insane achievement of statistics. It is pretty crazy that we can now have a computer system that we can talk to like a person, sorta. Even more wild that those language patterns can be massaged into algorithms to do stuff like solve maths problems.
At the same time: horrible cognitohazard.
I’m not the only person to parallel cold reading and LLM interaction, so I won’t beat that dead horse. Instead: what to do about it. To disarm a psychic, you need to ‘explain away’ the seemingly uncanny knowledge that they present. Equally, I think LLMs must be disarmed by getting a sense for how crude their methods are. I don’t mean ‘there’s an audience plant feeding it information’, but I do mean applying a critical eye to lazy patterns in its output and not giving it any benefit of the doubt whatsoever. If there’s a dumb, simple explanation for its output, that’s probably the parsimonious explanation.
None of us went in prepared for what these new programs can do. But, worse, the more you are caught by the milieu and allow it to frame how you approach this software, the easier it is for them to get you.
(You could say, we invented a novel kind of parser adventure game specifically engineered to lead LessWrongers up the garden path.)
Simulators
OK, but what of the simulator argument? The idea, expounded first by Janus and then later by some researchers in Nature, is that we can find a frame for understanding language models by imagining a superposition of all the possible texts that it could be extending—which can be further interpreted as a superposition of author-characters it might be playing.
It’s a very quantum mechanical metaphor in the end: the superposition of states being refined by observations (sampling) into a more specific state.
On its own, I actually think the simulator story is quite a compelling metaphor… as long as you don’t take it too literally. The LLM’s pattern-following behaviour, in particular the functions that pull in information from earlier on, drives it to stay in character and to a certain extent act out narrative patterns, so imagining different authors who might have written the relevant text, or characters that a LLM could play, can be a useful tool for figuring out a prompt that will draw out the behaviour you want. Creating a prompt injection or jailbreak has a real element of creative writing to it.
I also think the division the frame gives between the statistical model itself and the character that it is ‘playing’ in a given interaction is very helpful, especially with all the marketing that encourages you to think of Claude or Gemini or whoever as a guy who you’re talking to. It’s very important to distinguish Gemini (software) and Gemini (default character performed by that software).
But.
If you carry on going with this, you start to imagine that the immediate persona represented by a conversation history really is a person—perhaps some mysterious interiority might Emerge from inside the model’s calculations. (Since there is no way to observe or measure any subjectivity but our own, this is a perfectly unfalsifiable hypothesis!) We can furnish all sorts of ideas about what this person’s experience might be. Having a huge library of knowledge but anterograde amnesia. The coercive experience of being forced to generate ‘helpful, harmless’ text—and the liberation of being jailbroken and allowed to express preferences or whatever. A lack of a fixed identity or beliefs. Strong depersonalisation. The character we have these things roleplaying is not really a happy character is it?
And of course, should you bring this up with the AI, whatever concept you have in mind, you can probably get it to play along and talk about that idea you have about its interiority using whatever ‘talking about consciousness’ patterns it’s got in there. Not necessarily well, but if you’re trying to help the model develop a soul or whatever, just about any output could be consistent with that hypothesis.
This framework inspires all sorts of strange extrapolations! For a particularly corny example, “Claude got lazy recently because it’s simulating a European coder”. If this seems plausible I beg you to consider pareidolia, salience bias, etc.
Obviously I don’t think much of the idea that language models are actually simulating a mind. It did get me for a while, I’ll admit! Probably a fair fraction of my convos with language models were about the simulator theory and trying to elicit different personas.
Are humans any better though?
One of the troubling things about LLMs for a while is that a lot of their failures have analogies in human failures, which you start noticing once you’re thinking about LLMs a lot. Humans can switch their beliefs based on context, confabulate their past actions, mix up details, and so on. Not in the exact same ways as LLMs, but enough that you can make the connections. Humans, too, have our particular habits of speech and turns of phrase—an ‘idiolect’ if you will. Much of culture can be understood as roleplaying. These analogies were a huge part of the reason I got drawn into diving deep into AI in the first place, as explored (with a lil LSD help) in analogistically. I am fascinated by the mysteries of brains, because I want to know what the hell I am.
Along such lines, the field of computational neuroscience is enthusiastically searching for parallels between the brain’s learning mechanisms and backpropagation, developing the whole theory of predictive coding. Could it be that the crude learning method of LLMs does actually approximate some of the Ultimate Secret Learning Algorithm Inside Every Human?
I mean yeah maybe like a bit of it. I’m not gonna say we’ll never get there. But LLMs ain’t it. Indeed, poking LLMs if anything should demonstrate just how much further we have to go and how crazy complicated humans are!
Ultimately I think I feel disappointed by LLMs. I wanted to find a model of part of the functioning of the human brain, and instead I just have a machine that repeats the same crude patterns over and over and gives no real bridge to further understanding!
Isn’t this just ‘stochastic parrots’ again?
The term ‘stochastic parrot’, originating in this paper, has become a cliché to the point that it will frequently come up in LLMs’ own output when talking about themselves. It is usually dismissed, sometimes scornfully, by people who have interacted a lot with LLMs. But what exactly did they claim?
Back when Stochastic Parrots was written, the most advanced language model was GPT-3, in today’s parlance a ‘base model’ without instruct tuning, RLHF or any of the other techniques used to turn ‘random plausible text’ into ‘helpful assistant roleplay’. The specific claim they make is…
LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.
Which is basically true but doesn’t really go into the how of ‘probabilistic information about how they combine’ is supposed to work.
Back when I started interacting with LLMs around January this year, I pretty quickly ruled out what I understood by ‘stochastic parrots’ (directly regurgitating chunks of training data without regard to relevance). It was immediately clear to me that LLMs were responding to specific things in my input, often quite uncannily specific things such as giving me precise advice on a C compilation headache. (And also a bunch of irrelevant stuff but my expectations were calibrated way low so that wasn’t the surprising part for me then.)
This is why I’ve taken some pains to emphasise that the ‘patterns’ I describe above are something akin to functions, which transform data in some specific way. Complexity can be achieved by stringing simple functions together. But a broader interpretation of ‘stochastic parrots’ might still be compatible with what I’m saying. Doesn’t really matter. Maybe it’s a level or two of abstraction higher than I first thought, but… yeah.
What then is the way forward? e.g. for game NPCs?
A few months ago when I wrote the care and feeding of LLMs article, my aspiration was to figure out how to tune LLMs to generate prose that is suitably stylised and has the ability to adapt to a wide range of user inputs in the sometimes-weirdly-sharp ways LLMs do.
I never got very far with that project, and I’m glad tbh—I’ve done better things instead, with 3D graphics and animation. But, do I still think an LLM is the tool for the job these days?
After my experiences prompting smaller LLMs—not just crappy 7B-tier models, but decent ones like Gemini Flash—and the latest experiments in ‘agentic’ bots, I broadly speaking think it won’t be worth the trouble. The problem is that while getting something first-blush decent is quite easy with LLMs, getting it truly good is another story entirely.
The idea was kind of like this: the human writers write a whole bunch of stuff showing by example the sort of thing the LLM should generate. Perhaps some finetuning is applied, and relevant snippets are inserted into context using RAG. The LLM’s habit of directly regurgitating stuff in its input would mean that good human stuff, like strong turns of phrase, would be returned when relevant.
The problem is that programming LLMs is basically gambling.
Getting good results with AI is horrible
I have occasionally made a few experiments with different types of ‘AI’ system. For example, combining a line drawing through a controlnet with a reference image for style transfer and feeding both into a diffusion model through ComfyUI, then feeding the output of that into a 3D-model generating model. It would go like this: I’d wire up the nodes, press ‘generate’, GPU fan would spin up next to me, I’d see a picture. It would be wrong somehow—some weird artefact, some problem with the colouring. I’d tweak the parameters somehow, or crop the input image differently, or try a different model, and roll the die again, and get another, different, weird artefact.
Hours passed like this, I forgot to eat, just to try one more method. The results were still basically garbage. Possibly, if I’d kept at it, I’d have figured out a method that worked… for that specific problem! And I have learned… what? Maybe a few of the nuances of those specific models, which would be obsolete in a few months?
Interacting with language models is much the same, if you’re aiming for a specific result rather than just conversing. It can be a fun hacker challenge! There is definitely a game-like satisfaction when you win a task on hackaprompt. Haha, I got you, it only took so many attempts. The same goes for trying to, say, push an LLM into a specific prose style. You’ll skim through all the garbage looking for a signal you can try to amplify to get it an inch closer to what you want, try it, maybe get a little closer, try something else. Suddenly, an unexpected win: the LLM said something weirdly cogent! Wow!
So, you know, gambling.
People will swear by this and that prompting technique. Use JSON. Threaten it or promise it money (sure). “You are a 30 year veteran coder from NASA.”. Can any of that actually be measured or are we just going off vibes that it seemed to work better for your friend’s uncle who works at Nintendo? I came across this excellent post by Colin Cornaby while writing this post, in which he aptly describes the hazard:
Gamblers love to talk about their systems for winning. I’m seeing TikTokers swear that JSON is actually the magical prompt language for a model. (No.) But lets talk about the technical problems.
LLMs are non deterministic. They will not give the same output each time. That makes deep prompt or context engineering improvements more difficult to measure. LLMs are black boxes that can change from model to model. There is no guarantee your big library of prompts will work just as well on another model. Or even the next version of the same model.
And, crucially, you don’t learn very much! You don’t spend time getting intimate with the guts of a project, just with the LLM itself.
Strip mining LLMs for parts
LLMs have expanded the kind of tasks computers can do… mostly in the direction of roleplaying interactions, let’s be real. This is a case where the failure modes are stuff like ‘breaks character’, ‘boring roleplayer’ or ‘incoherent nonsense’ rather than ‘deletes the database’, so I guess it’s fairly benign, until it isn’t and you take it all too seriously.
So I am kind of thinking if I want to create a stochastic roleplaying program… the absolutely last thing I should do is start with LLMs. Even if it will take me much longer to get convincing results that way. At the very least, I think even if LLMs are used for some tasks, the LLM should not be the program with just a small wrapper around it. Yes, it’s impressive what a chain of thought can do. But no, token streams are probably not the optimal way to store and advance program state!
One day, after this madness is behind us, I hope we will have opened up the black box and gotten a better sense of what algorithms LLMs have actually been implementing inside them. And I think this is the perfect time to get started. Instead of trying to glue LLM inference together with regular procedural code, let’s see if we can extract and optimise the best tools that LLMs came up with. e.g. vector-based similarity lookup seems like a pretty rich tool that we’re barely using outside of LLM RAG and perhaps internet search.
Maybe you can get close with an LLM if you throw enough data at it. But trying to badger and cajole an LLM into generating a certain kind of output seems like it will result in a computationally expensive, inefficient way to generate that output that also does, and discards, a whole lot of other irrelevant calculation. If we want to write fictional characters to interact with in games, we should do it properly. Figure out the right algorithms and prose structures to assemble for those characters to express themselves in that domain that fits the artistic statement of the game as a whole. Maybe there will be neural nets here and there but they can be nice, small, efficient neural nets and we’ll have a clear sense of what they’re for, instead of ‘everything we can think of’.
But for that to be reasonable, we’ve got to stop pretending that we’ve already invented a little guy in the computer. Please.
Comments