Dramatis Personae
AI artificial intelligence
LLMs large language models
NNs neural networks
CNNs convolutional NNs
People people
Bots not yet people
Over a trillion dollars is being invested in AI. Nvidia is a hot stock, and Transformers are topical. People are failing to find jobs, based on the future promise of how large language models will deliver. In my last post I featured both the extraordinary success we’ve recently seen, but also how limited LLMs are in one key way: once trained, they can learn nothing new.
There we also learnt that the second AI winter came about not so much because expert systems failed, but due to secondary issues. They didn’t learn well, they made silly mistakes and they were expensive to maintain. Perhaps a bit similar to LLMs, which eat cash and have largely failed to deliver meaningful return on investment.
This shouldn’t surprise us, because we’ve already worked out that if you want to improve efficiency at your company, quick fixes are rare. Lasting improvement comes from process re-engineering. Surely, if as part of this you find a bottleneck in say “document retrieval” and can engineer an LLM-based solution, productivity may improve. But most companies don’t seem to be thinking like this. Generally, a bot-ex-machina won’t descend from the wings and fix things!
We also discovered that Large Language Models are not Large Logical Models. An LLM may well provide an appealing answer over a logical one—that’s one reason for their apparent success. But facts count. In 2024, Air Canada were taken to court after its helpful chatbot ‘hallucinated’ the idea that passengers could claim a retroactive bereavement discount. They lost the case.
Enthusiasts will however still object to my analysis. This post looks at some potential objections, and delves into other ‘non-trivial’ problems we’re seeing with LLMs. This sets up my third and final post in the series, where we look at solutions.
A moving target
In a developing field, it’s easy to protest “You’re criticising old tech”. As a doctor, I’m familiar with this, especially from device manufacturers. “The next defibrillator won’t fail to fire”. If you use new software a lot, you’ll also know it well. Report a bug—“it will be fixed in the next release”. Even if it is, this may be at the cost of two new and slightly more irritating bugs. Be wary.
Consider, for example, the Tesla video at the start of my post, where one of Elon’s finest goes almost full-tilt into a large truck on its side.1 The obvious lessons here are that abandoning lidar was a mistake, and that you can’t train up neural nets for every eventuality. The moving-target defence may run something like “Yes, but we’re continually improving, replacing CNNs with LLMs, which will work better.” The important question though is:
What assurances can we give about the performance of neural nets like CNNs and LLMs?
Before we examine this, let’s look at some claims about (and failures of) Transformer-related technology.
Zero shot
‘Zero shot inference’ is a play on ‘one shot’ learning, where you see one, you learn, and you perform perfectly the next time you encounter the problem.2 Zero shot is even more ambitious—you determine ‘what class a problem is’ without ever having encountered it before. It’s a test of generalisation. Knowing horses, you might identify a zebra when told “A zebra is like a horse with stripes”.
The problem — as we discovered recently with our example of “a mammal that wears trousers and swims in the ocean” — is that current LLMs generalise rather poorly. We’ve found that once taught “A=B”, they can’t even work out that “B=A”!
The limitations of zero shot have been scrutinised recently. Udandarao and colleagues looked at data quantities needed for zero-shot image recognition. They examined 34 models and 5 standard pre-training datasets.3 Disappointment: O(2n). Exponentially more data are required to achieve linear improvements in performance. This may well be why newer bots seem to be running out of data to train on.
(Thanks, Google Gemini)
I have two watches ...
We know that current LLMs can’t learn a thing after being trained. AI promoters like Sam Altman know this too, but are undeterred. Their techies say they have fixes.
So far, re-training even a tiny part of an LLM runs the risk of catastrophic forgetting, because the whole point of an LLM is that it makes global associations and stores these inscrutably.4 More interesting is retrieval-augmented generation (RAG). This links in an external database, and modulates the oracular pronouncements of the LLM with the leaven of learning. In theory.
The “I have two watches so I don’t know the time” problem here is obvious: the new data are not integrated into the LLM.5 Where there is discord, whom do we trust? RAG can fix very specific defects (“If someone asks about mammals that wear trousers and swim in the ocean, don’t mention sea otters”) but doesn’t address the internal, more general problem.
You can work out that other prosthetic limbs are similarly limited. If we grasped the entire internal LLM ‘data structure’, then we could either supplant it or fix its inability to learn without catastrophe. Perhaps this will happen someday, but I have my doubts. Inscrutable structure and costly coding fight against a magic fix without massive re-design.
Ball-and-chain of thought
We must examine two other ‘fixes’. The first is the idea that you simply need to ask the right questions. There’s a grain of truth here, as with any oracle, and indeed, any problem.
In the old days, where Google worked and didn’t simply try to push stuff on you, savvy users quickly discovered that the quality of the query counted. If you simply tried to search for “making stone tools”, then you’d get a lot of guff. Find the magic word ‘flintknapping’, and the search suddenly becomes sensitive and specific. Similarly, searching for the mechanism of the haemoglobin oxygen dissociation curve is clumsy, but singling out the three keywords ‘haemoglobin’, ‘relaxed’ and ‘tense’ is a winner. The problem is, you need to know a lot about the subject.
A more recent ‘solution’ has been “chain of thought prompting”. You can do this manually, nudging the LLM to take baby steps through a complex query, but increasingly this is built into the shroud of code around LLMs like GPT 4o. The catches are still similar: it’s prompt engineering.
The recent, controversial ‘Apple paper’ on The Illusion of Thinking attacks the issue head-on. The authors evaluate built-in ‘large reasoning models’ (LRMs) that wrap chain-of-thought around current LLMs, and show that beyond certain complexities, they simply collapse. Utterly. This is shown in their Figure 1 (above), where the bot solves the Tower of Hanoi problem. Performance is improved—up to seven disks. Then, Bam!
Chain of thought seems to be just a sophisticated form of RAG. It tweaks the smarts of the LLM by putting a bit up front. But it doesn’t fix the internal issues.6
Creatively, LLMs only confabulate
A persistent problem with LLMs is that they just make shit up. There are countless examples. This has been called ‘hallucination’ but a better term is ‘confabulation’, for two reasons.7 The first is that hallucination seems to imply that the LLM has a mind in which it can construct, well, hallucinations. It doesn’t. That’s the whole point—it just associates data globally. The second is that LLMs work by filling in gaps. Anything that’s not a gap-filler is a regurgitated datum. Take your pick. This is well shown when you try to set rules.
Show me the rules
LLMs can regurgitate rules. They however struggle to play by them. In fact, quite because they are so good at interpolating between data points, they get creative at all the wrong times. In the example above, ChatGPT tried to get its black queen to jump the knight and take the white queen!
This gives us an answer to our Tesla-induced question at the start. We can only give assurances about software performance when we can trust the constraints on that performance. For example, the (hardened) L4 operating system can be mathematically shown to conform to specific requirements—“proof of correctness”. Proofs don’t need to be deterministic. We can, for example, use statistical tests to show that a large number is prime to a “hell freezes over” level of confidence.
But we simply can’t apply such confidence to LLM performance. Conversely, we might be quite keen to show failures, and this bleeds over into the very basics. As shown by the Tesla example, rare failures can be quite costly.
( Punch’s final take on their original “Curate’s egg” cartoon)
Good in parts?
You might think from the above that I see no role for AI. Nothing could be further from the truth. I am merely doing good Science—examining the defects so that we can do better. As we’ll see in my next post, suitable deployment of AI has the potential to do better than people can, even better than experts.
We do however need to look at claims critically. Have we allowed bias to creep in? There is already a multitude of tasks that are better done by machines. When it comes to doing thinking tasks, we have no hope of besting a modern chess-playing program like Stockfish. Certain AIs can read images like retinal photographs better than we can, and in some circumstances out-perform radiologists when looking at x-rays. The list is increasing, and once we realise that intelligence is a continuum, we should be no more surprised at escalating machine performance than we should at how a chainsaw is often better than an axe, or a surgical stapler is usually better than a surgeon for joining bowel during colorectal surgery, or a robotic car welder exceeds the skills of a human at the task.
But bizarrely, we think the human brain occupies some sort of lofty, separate position. We’ve already dispensed with a lot of the associated mythology, but we still tend to make the same mistake. And the reverse mistake!
Take programming. For years now, it’s been clear that machines are better at some tasks than programmers. A compiler takes code in a higher level language and translates this into efficient machine code that will run on a microprocessor. A few decades ago, a smart human could often squeeze more out by hand-compiling sections of the code in assembly language. This is pretty generally not the case now; we will do worse. In contrast, when it comes to programming in the large, ‘AIs’ have been pretty poor. Recently, LLMs have made great strides, to the point where a new mythology is becoming established.
For example, in a 2025 study published this month, Becker and colleagues randomised experienced open-source developers to complete 246 tasks in mature projects where they had years of experience, with or without AI assistance. The results are startling: participants anticipated that AI would reduce their completion time by a quarter, and they estimated that it had been reduced by one fifth. But when measured, those who used AI were slowed down by 19%!
(ChatGPT needed several revisions and a lot of prompting to produce the above8)
LLMs can not do Science
I’ve previously presented a solid model of Science. The nice thing about it is that if you can think of a better way of doing science, the model accommodates this. It’s built in. Even better, Science has built in feedback assessment, encouraging us to identify bias and then purge it. Science works on Pearl’s levels 2 (causation) and 3 (counterfactuals). It’s part of the model. Effectively, it is the model.
From time to time, an LLM will seem to operate on Pearl’s levels 2 & 3. There are two problems, though. The first is Magritte’s lesson: a drawing of causal thinking isn’t causal thinking :)
But the bigger problem here is that we know what is going on under the hood. The LLM works by association, so it cannot be operating on a higher level. An apparent success may represent:
Random variation (‘luck’);
Regurgitation of someone’s prior thinking—a drawing of causal thinking;
A cheat, or tweak, for example leading the LLM by the nose into a pattern that appears smart and causal or counterfactual.
Which brings us to the inevitable question “How then do we fix things?” LLMs have so much potential. AI has so much potential. They can already do things that we previously considered near-impossible, using machines. So how do we make things good?
This is the topic of my third AI post.
My 2c, Dr Jo
A widely circulated 2024 report asserted that the death rate from Tesla crashes is twice that of comparable cars, but this has not been substantiated.
In Medicine, the traditional saying is “See one, do one, teach one”, sometimes cynically modified (by me) to “See one, kill one, teach one”.
The models depend on CLIP for image recognition, and Stable Diffusion for text-to-image generation. Transformer-based CLIP, which stands for Contrastive Language-Image Pre-training, works by juxtaposing two models. One ingests text and outputs a vector for ‘semantic content’; the other ingests an image and outputs a vector for ‘visual content’. A shared vector space is produced, with semantically similar text-image pairs closer to one another in multidimensional space.
Stable Diffusion can then take the CLIP vector space and do something extraordinary. Starting with random noise and a vector of semantic content (built from a text input string), it progressively ‘de-noises’ the image, working from the association of text and images. Lo and behold, we have a picture!
We should note that Stable Diffusion and CLIP are open source, but their performance has been supplanted by multiple proprietary and other constructs like those in Midjourney 6, DALL·E 3, Sora and so on. SD+CLIP had particular issues with hands, but these still aren’t completely fixed, even with more sophisticated models. Try e.g. “Draw a close-up photograph of the two hands of one person, with the fingers interlocked. It is important that the fingers of one hand are interlocked with the fingers of the other, so there's an alternation between the fingers of one hand and the fingers of the other. The hands are resting on the person's chest. The person is wearing a multi-coloured robe. Studio lighting.”
MEMIT claims to identify local factual defects and fix them, but how this works on generalised domain-wide knowledge is unclear. There is now evidence that models edited with MEMIT or ROME degrade after sequential edits, culminating in catastrophic forgetting.
Did you notice the times on those watches? Web pictures of watches are set to a similar time (or 01:50), for aesthetic reasons. You will likely struggle to find an adequate training set of watch pictures displaying other times!
We should also not confuse these ‘fixes’ with things like Model Context Protocol, which is a great idea for communication between applications and LLMs.
From a medical point of view, confabulation is characteristic of people whose ability to lay down new memories is destroyed, notably in the Wernicke-Korsakoff syndrome, brought about by damage to the mammillary bodies. This is usually due to profound vitamin B1 deficiency.
With apologies to René Magritte. (Even with repeated revisions, I eventually ran out of patience and clumsily edited the image in Gimp.)
“Trained on 20 years of blues, an “LLM” won’t produce rock ‘n roll”..
A badly remembered comment from one of Zitron’s blog posts.
It strikes me that a LLM must inevitably get dreadfully confused because a lot of what it has been trained on will be fiction or incorrect. Even if the builders tried to prevent it, like for example deliberately ignoring the works of Shakespeare, there would still be all the English Literature study guides that discussed those works. I wouldn’t be surprised if they are also being trained on documents that were written by other AI systems, so the errors will accumulate, like persistent pollutants in the food chain. And in any case, as you rightly point out, they’re not actually thinking anyway, just picking a likely next word or action based on probabilities of what has happened before. When something stupid becomes a hot topic (like Mr T’s “why not try bleach?” COVID cure), then the fact it is repeated so much must surely make it even more likely to be picked? So even articles that say something is wrong, may end up reinforcing the likelihood of the same wrong thing being put out again.
We are doomed. Our only hope is that it will make such a mess of something expensive, that those in charge finally say “Ouch!” and give up on the idea.