I’m going to do something slightly unusual today. Most of my posts work through towards a conclusion. Here, the big reveal is so absurd, I’m going to put it upfront. It’s this:
Changing a single thought in your mind is about 8,000,000 times less expensive than doing the same thing in the ‘mind’ of a large language model (LLM). And that’s a deeply conservative estimate.
Clearly, some work will be needed to explore this preposterous deduction. You are—as always—welcome to criticise it in the comments. Here’s the broad layout of my argument.
A brief recap: what is intelligence?
Big O!
Some logic, needed for the next section:
A short history of failures of artificial intelligence (AI).
How LLMs work.
A human thought benchmark.
The comparison.
An intelligent recap
In my most recent post, we found it unreasonable to view intelligence as a yes | no thing. We also defined intelligence: someone or something is being intelligent if they are solving problems using all three levels in Pearl’s causal hierarchy. If you’re unfamiliar with any of this, well then, read the post. Being intelligent is also “doing Science well”.
We are mostly wrong. We know that Science can never provide ‘truth’. Striving for truth is the Platonic fallacy that killed logical positivism. So you can’t hard-wire intelligence. It needs to adapt, so we can grow. The best way to do so is often Bayesian: new information (and the associated likelihood ratio) allows us to take our prior odds and adjust them to ‘posterior odds’.
We will not yet explore silly questions like “Are we there yet?” or the even more nonsensical “Does this AI meet the criteria for ‘artificial general intelligence’?” We’ll even hold back on the Bayesian bit. Here we’ll simply look at matters of efficiency.
To do this, we need a couple of insights. If you’re familiar with Big O notation and the propositional calculus, you can skip the next two sections.
Big O notation
Computer science is saturated with Big O. Provided you understand it and use it sensibly, it’s a boon. We are interested in how expensive it is to run a particular function on a computer, as we feed it bigger and bigger data.
A traditional example relates to sorting. Let’s say you have n
numbers to sort. How long does the sort take as n
increases? This depends on the algorithm used. For example, nobody uses a ‘bubble sort’, as this sorts by sequentially comparing a number to all of the others, allowing the largest to bubble up to the top; it then does so for the next number. Effectively, you’re doing about n2 comparisons. Using Big O, we say the bubble sort is O(n2). There are far more efficient algorithms like powersort or quicksort, which are O( n×log2(n) ). Practically, this means a lot. If you’re sorting a million items, then you’d rather make 20 million comparisons than one thousand billion comparisons. The picture above contrasts linear growth O(n) in red, quadratic growth O(n2) in blue, and O(n×log(n)) in green. If you’re keen, play with Desmos to make similar graphs.1
Logic
The following exploration is a bit more than you need for this post, but will be useful in later posts. Skim it if you wish. When I started programming in 1978, one of the first things that I found both fascinating and a bit intimidating was “how to do logic”. There are various levels, of course.
Near the bottom, we have the propositional calculus, which is a form of Boolean algebra. Despite its polysyllabic name, it’s actually pretty simple, as (unlike reality) it just deals with things that are true or false. Okay, for convenience, let’s call it the ‘PC’.
Unsurprisingly, the PC is the basic logic that powers modern digital computers. It’s so simple that we can build up everything from just a single type of logic gate. Take the Apollo Guidance Computer pictured above, on the right. Conspiracy theorists aside, this guided humans to the Moon. It used about 2800 integrated circuits that ran on NOR gates.
We’ll get to what ‘NOR’ means in a moment. First, more familiar terms: AND, OR and NOT. The logic of a logic gate is that it takes inputs (commonly 2) and generates a single output:
AND: ∧ (But also &
or just “.
”—the terminology is confusing. ) Here, x ∧ y
means that to produce a true output, both inputs x and y must be true, otherwise the result is false.
OR: ∨ (But also +
or ∥
) — For x ∨ y
, if either is true, or indeed both are, then the result is true.
NOT: ¬ (“Negation” also ~
, !
or even ′
) — this swaps around a value: true becomes false, and false becomes true.
To make our discomfort complete, there are even more symbols: ⊕ means “exclusive or” (XOR), which translates to “either but not both”; ⇒ means “implies” (with → and ⊃ also acceptable); and ⇔ is used to say “implies and is implied by”, aka ↔ aka ≡ aka “biconditional”. Then we have extra symbols for true and false (⊤ and ⊥) although it’s easier to type T and F or even just 1 and 0.
You’ll likely agree that’s all pretty intimidating, especially as all of this logic can be written using combinations of NOR (¬∨, aka ⊽ aka ↓) which you now understand means NOT applied to the output of OR.
This in turn pales into insignificance when you move up to the level of the predicate calculus, aka ‘first-order logic’. Here we meet ∃ and ∀, which respectively mean “there exists” and “for all”. But I’m not going to go there.2 Yet.
Learning from winter
The term “artificial intelligence” was first used in 1955 by John McCarthy, when he invited other brilliant AI pioneers (Marvin Minsky and Claude Shannon, who wrote the most important master’s thesis ever) to a conference. It all seemed so easy.
“Within a generation, I am convinced, few compartments of intellect will remain outside the machine’s realm—the problems of creating “artificial intelligence” will be substantially solved.”
Marvin Minsky, 1967.
AI has a long history of setbacks, where a prediction “We’ll have proper AI within n
years” was repeatedly followed within a few years by dashed hopes. So far, there have been two major AI ‘winters’ where a period of hype was succeeded by a let down, and freezing of funds. These occurred in the 1970s and early 1990s. Given the current enthusiasm for ‘AI’, it’s wise to take a brief look at previous winters.3
There are several themes used to ‘do AI’. Broadly, I’d sketch these as:
Approaches that use rigid classification systems (ontologies)
Abstract manipulation of symbols using the propositional or predicate calculus
Using Bayes’ theorem
Using neural nets, which are simple, idealised models of how human nerve cells might work.
Storing, retrieving and using structured information (joined up data)
Model-based systems that create an internal ‘world model’, or ‘microworld’ like SHRDLU, which was manipulating blocks back in 1971.
Some have found it convenient to break these into two simplistic categories: on the one hand you have approaches that use logic and ‘knowledge representation’; on the other is the ‘connectionist’ approach epitomised by neural nets. You can immediately work out that most components I’ve listed are not mutually exclusive. For example, you might try to join a neural net to a traditional database. Symbol manipulation might use the propositional calculus, the predicate calculus, or embrace the continuity of Bayes theorem. Traditionally however, computer scientists have inclined towards simple magic.
For example, in 1984 Douglas Lenat started an initiative called ‘Cyc’ which aimed to describe how the world works as an ontology. This struggles to learn. It’s inflexible. The Web Ontology Language (OWL 2) is a similar and more insidious attempt at ontology that unfortunately cripples itself by limiting logic to propositional logic.4
There have been many, many failures. Initial hopes for easy language translation plummeted in 1966 with the ALPAC report that explained in small words how tricky machine translation is. Early enthusiasm for neural nets evaporated in 1969 when it was shown that something as simple as an XOR was beyond the capability of a ‘single layer perceptron’. And in 1973 the UK Lighthill report picked up the ‘grandiose objectives’ of the AI community and bashed them against a cliff. This may well have precipitated the ‘first AI winter’, not just in the UK, but also in the United States. “Thinking is more difficult than you think” is the repeated lesson here.
In the early 1980s, things picked up. Interest focused on “expert systems” that cast aside the temptation to ‘generalise’ AI, and instead tried to excel in specific areas. These centred on the AI language LISP, but by the end of the decade the market collapsed. First, specialised ‘Lisp machines’ were supplanted by more general workstations, but soon afterwards expert systems were abandoned. Particularly painful was how the Japanese “Fifth Generation” project fell from grace. After labouring for a decade to replicate human reasoning and language, understand pictures and translate languages, they folded.
It’s instructive to learn from this second winter. After initial hype, people worked out that expert systems had many problems. They were expensive to maintain, particularly because they didn’t learn well, and tended to make silly mistakes. This is not surprising: as we’ve already noted, ontic assumptions don’t gel well with reality.
So it seems wise to scrutinise recent hype for cost blowouts and silly mistakes. We should be particularly vigilant when it comes to efficiency, and costs of learning and updates. We’ll do this. But first, what are LLMs and how do they work?
LLMs work by association
We’ll use LLMs as a specific representation of AI because a couple of years ago, they changed its face. We know there’s more to AI, but LLMs are now attracting the big bucks, and LLM-adjacent tech underpins most of the recent, truly remarkable advances that have been made. Less than a decade ago, if you’d told me you could program a computer to:
Recognise a species of bird or a flower
Compose a plausible picture based on a description
Parse complex text from, say, a scientific journal
Generate a complex text about a political problem
Write a multi-line computer program using just a general description of what you want
... then I’d have laughed at you. These are now all commonplace, thanks largely to the mechanism that powers LLMs. LLMs are based on neural nets (NNs). But we’ve learnt a few tricks since the first primitive NNs fell over and failed to perform in the late 60’s.
Effectively what we are doing with LLMs is using the associative structure of the inputs to predict “the next word” in the output. If you know everything Shakespeare wrote (and have read a lot of literature besides), you can have a decent stab at completing something produced by Shakespeare, or even predicting (say) the first word in a sonnet. You can also make the model creative: simply introduce a random component called the ‘temperature’ which ranges from zero, which is staid, boring and repeatable, to one, which is wild and silly. The details can be a bit tricky, though.
An important advance was the discovery of how to use gradient descent well. Let’s assume that you want to find the lowest point in a landscape. You’d go downhill, right? But the problem is that you may end up in a local hollow, rather than at the lowest point, especially if you can’t see the big picture and have to rely on the local shape of the landscape.
Now consider a landscape with a lot of added noise—for example mist and tall trees prevent you from seeing more than a few metres. Next, do this in a huge number of dimensions! This is not an easy problem—but several ideas have made it more tractable. The main trick here is to use something called backpropagation (backprop) to descend the gradient. It’s an efficient way of sussing out the local slopes.
There was a lot of enthusiasm when practical ways of doing this were discovered, but things didn’t move fast. People argued about the best approach, which often seemed to depend on complex and inefficient feedback mechanisms. In about 2012, computer scientists working on computer vision showed that a lot of the hassles with feedback could be tamed using feed-forward connections resembling the wiring of the visual cortex of animals: convolutional neural networks (CNNs).
Then, in 2017, a single paper changed everything. This was Attention is all you need, written by Google researchers Ashish Vaswani and his colleagues. In contrast to earlier models, the ‘Transformer’ has a peculiar feed-forward mechanism called ‘attention’.
Effectively, data items are input and identified, globally weighted against one another, and the result stored. Then, when we want an answer, we can use those weightings to fairly rapidly produce matching output. Their general schema is pictured above. This all comes down to linear algebra (matrix multiplication), which is generally quite expensive from an energy point of view.
Transformers are powerful
The Transformer has many strengths: it’s relatively easily trained, effective, and can be ‘parallelised’ using multiple computer processors. This has translated into excellent language translation; profound improvements in image recognition and generation of speech, sounds, stills and video; all of that list from bird-recognition to programming, and a lot more besides.
The key idea here is that we make weighted associations between words or word fragments across the entire input data set. An important insight is that as the number of tokens increases, so the amount of training increases, and this is O(n2). That’s expensive—like a ginormous bubble sort, but the argument is that this is pretty much a one-off. Actually interrogating the model is less costly.
Practically, if you’re to obtain something like the ‘smarts’ of ChatGPT, you also need to wrap some software around the Transformer core. This often involves large numbers of people trimming the worst excesses of what the training data contains—bomb kits and revenge porn—and making it almost sycophantically friendly instead.
The cost of a human thought
Let’s now move on to a brain-based benchmark. How much energy does a thinking brain burn? The human brain is one of the most complex arrangements of matter in the known universe. It has round about 60 to 100 billion neurones, of which perhaps one fifth are in the crinkly rind of the brain we call the cortex.
What is truly astonishing is that all of this runs on just 20 watts—about a third of the power output of a heated towel rail. Even when we’re thinking furiously about a problem, the power consumption does not increase much. Perhaps 10%.
Excitable tissue like nerves conducts impulses by depolarisation. This is a complex process where ions flow into nerves, and are then pumped out again. Pumping costs energy. Most energy is used at the tiny gaps between nerve cells (synapses), where impulses jump the gap using chemical signals called neurotransmitters.
We’ve even calculated the cost of transmitting one bit of information across a single synapse: 24,000 ATP molecules per bit at a common firing rate of about 4 Hz. A single ATP molecule, when hydrolysed to ADP, produces about 5×10-20 joules. So that single bit of information processing requires just 1.2 femtojoules!5
How (not) to train your LLM
We chose the brain as our benchmark for examining LLM efficiency, but there’s a problem that makes a lot of the preceding section irrelevant. It’s a fundamental property of LLMs:
There is currently no way to put a single new ‘thought’ into an LLM, apart from re-training the whole damn thing.
If you ask LLM experts, then they will say something like “This is an active area of research”. The reason why it’s an active area of research is because if you tweak a single bit in that LLM, it is prone to ‘catastrophic forgetting’. We have no clear “map of how things are stored” so we currently have no way of introducing any new information. In other words, once set up, an LLM cannot learn!
Because of this failing, smart people have come up with a large number of ways of tweaking the embedding of the LLM. Let’s explore this with a single question. Today I open up Google Gemini, DeepSeek, ChatGPT and (unwisely, the fascist) Grok, and ask something I asked eight months ago:
Can you think of a mammal that swims in the ocean and wears trousers?
Here are the responses I got today:
Gemini: “This is a fun, riddle-like question! The answer is a sea otter.
[blah blah]
ChatGPT: “Why yes… the answer must be: a pair of pant-seals.”
[etc]
Grok: “No mammal that swims in the ocean naturally wears trousers, as clothing isn't part of any marine mammal's biology or behavior. However, for a whimsical answer, imagine a sea otter dressing up for a dive—they might slip into some tiny trousers for fun, though they'd likely float right out of them!
Deepseek: This sounds like a playful riddle! The answer is likely a "whale" wearing "kilt" (a type of trousers/skirt)
[then goes on to discuss sea otters]
You can deduce several things from the above:
Current models seem to struggle to generalise well. They can’t work out the simple logic that humans are mammals, humans can swim in the ocean, and humans can wear trousers, so the answer is “a human”.
The dressing up of the LLM hasn’t worked at all here. Twee humour is not a fix for the internal defect.
All of the models have similar defects, likely because they are all trained on pretty much the same data, and all have the same internal failings. The second ‘L’ in ‘LLM’ stands for language, and not logic.
You can try to teach the LLM about where it failed—but it won’t remember. It says “Ah! yes …” and then forgets.
The logical conclusion
If we are to compare the cost of our remembering a single new fact, we need to be fair. On the LLM side, this is easy. Based on what we learnt in the preceding section, we need to ask “What is the energy cost of training an entire LLM like ChatGPT 4o?”
First though, what is a reasonable brain-based comparator? It’s tempting simply to multiply 1.2 femtojoules by an arbitrary factor like a billion or a trillion, but to me this seems unfair. So I had an idea. Humans need sleep for normal brain function. So let’s give an entire human brain a full eight hours of sleep!6 About 20 watts for 8 hours, or 20×3600×8 = 576 kilojoules. That’s very generous.7
But what are the actual LLM numbers? The energy cost of training GPT 3 (on about 300 billion tokens) was about 1.287 million kilowatt hours, or 4.6 trillion joules. So for GPT3, the relative cost of learning something new is about eight million to one, compared to a human brain. Nearly seven orders of magnitude.
All of this is still pretty conservative, for several reasons. We don’t know the costs of training GPT4 on 13 trillion tokens, but from above we know that the cost of training scales by O(n2).
Next up, nobody is actually going to re-train the LLM often, because of the cost. They’ll just tinker round the edges, resulting in poorly integrated data that are not part of the model. We also haven’t examined how efficiently an LLM uses new information during training, compared to how the brain does. We’ll do so in a subsequent post. (Spoiler, the brain is orders of magnitude better). There are many other factors that we still haven’t considered, particularly related to the higher levels of Pearl’s hierarchy. We also haven’t looked at other implications of having to continually tinker around the edges, to compensate for defects that are burnt into the LLM.
But the bottom line is simple. LLMs are searingly inefficient when it comes to the energy costs of fixing bad information. You can work out that even if we make LLM training a thousand times more efficient, they will still be rubbish when learning a single new fact. Short of a breakthrough that fixes catastrophic forgetting, we have a vast, unsolved problem.
In my next post, I’ll look at some of the other ‘AI’ problems I’ve merely hinted at.
My 2c, Dr Jo.
Top image was generated by Ideogram in 2024 in response to “A sea otter wearing pants”. NOR gate image is from GeeksForGeeks.
With some algorithms, things can really blow out, for example finding the exact solution to the travelling salesman problem is O(kn) i.e. exponential in its best time complexity; and a brute-force solution is even more expensive, O(n!)
The predicate calculus allows you to symbolically represent concepts like “For all x, if x is a human, then x is mortal”, which you can’t really do in the propositional calculus.
It also struggles with negation.
Turning this around, we can crudely estimate that the 20 Joules per second used by the brain corresponds to 1.6×1016 bits (or ~ten petabits) of information processed per second, but this is of course an upper bound. Other estimates of “brain bandwidth” are several orders of magnitude lower.
Some have claimed that e.g. REM sleep is important for consolidating memories, but the evidence seems rather weak. As I’ve mentioned in a previous post, recent theories about the role of the glymphatic system in sleep are controversial. But we do need sleep, so let’s factor it in. Conservatively. The traditional ‘8 hours’ varies a lot from person to person.
You can see that if—as seems likely—changing that single memory takes just milli- or microjoules, then we’ve been generous to the tune of an order of magnitude orders of magnitude! We can hardly be kinder than that.
This is truly excellent. I've had a lot of similar thoughts about the Big O expense of training an LLM, but this really wraps it up and puts a bow on it. Extremely well done and informative.
What are your thoughts on MCP? All of the major providers support it now, and it seems to be at least somewhat of a bandaid on the issue of how to add new information without burning the existing model down and starting over.
And the cost of changing a thought in a MAGA brain?