LLMs - What aren't they good for?

Forty is an interesting number. It seems to show up throughout literature, be it Ali Baba and his Forty Thieves, the Israelites wandering for forty years in the wilderness, or Jesus fasting for forty days and nights in the desert. One Thousand is a similarly interesting number. In literature we have Helen of Troy’s face that “launched a thousand ships” and Juliet wishing Romeo “A thousand times good night”. Mathematically, one thousand and forty are separated by 960, but in literature they are very close: both represent an arbitrarily large quantity.

It’s not just literature that fails to make a meaningful distinction between forty and one thousand. Research has shown that animals¹, and even infant humans², can distinguish one from two, and two from three, but that as numbers become larger, their ability to separate specific quantities decreases with increasing size. To a dog, one hundred bits of kibble is just as good as one hundred and one or one hundred and ten. Intuitively, this makes sense. Even a highly trained mathematician is unlikely to count the individual nuts in a bowl offered at a party before reaching in and grabbing a handful.

What does any of this have to do with AI? In my previous post, I explored how the way in which the technology that eventually led to the current batch of Large Language Models (LLMs) emphasized the mapping of language (and other forms of input) onto concepts. Appreciating this is important not only to understanding what LLMs are good for, but also to understanding what they are very much not good for. It turns out, LLMs kind of suck at math (and, at a deeper level, logic…more on that in a bit).

To understand why math is hard for LLMs, consider the following two statements:

Ugh! Math is so hard. There’s 1000s of rules I have to remember!

Ugh! Math is so hard. I have 1001 rules I have to remember!

Here we have two statements that express the exact same concept using two different numbers. If an LLM was to map both of these statements into the same location in concept space, it would be absolutely correct to do so. But now consider these next two statements:

My math teacher asked me if 1000 is even.

My math teacher asked me if 1001 is even.

Here, the concept represented by “1000” is very much unique from the concept represented by “1001”, and confusing the two would lead to an incorrect answer. The problem is that humans tend to use numbers to represent both quantity and amount. That is, we can say “40 theives”, “99 problems”, or “1001 rules” to express “some arbitrarily large amount of a thing” or to express “a counted quantity of a thing”. What makes working with numbers complicated for LLMs is the hidden meaning in the second. What is this hidden meaning? The answer is: counting.

Inductive reasoning is extremely powerful. When it comes to math, using inductive reasoning allows us to answer a question such as “what is 40?” rather elegantly: 40 is one more than 39. Now, that may not seem like an earth-shattering revelation, but consider what thinking about numbers in this way allows us to do. If we start with just two numbers, one and zero, and then say that one more than a number is the next number and zero more than a number is just the same number, we can not only define a list of numbers that extends forever off into infinity, but we can also define all of the rules of arithmetic.

How? Consider that 4 is 1 more than 3, 3 is 1 more than 2, and 2 is 1 more than 1. This means that we can express “4” as “1 more than 1 more than 1 more than 1”. If we, likewise, define “2” as “1 more than 1”, we can see that addition is just the process of combining two numbers’ lists of “1 more than”s. So “4 + 2” is “1 more than 1 more than 1 more than 1 more than 1 more than 1”, which is “6”! Similarly, we can define subtraction as taking two numbers’ “1 more than” lists side by side, and crossing out “1 more than”s from both until one list is empty. The remaining list, then, is the answer. So “4 - 2” is “1 more than 1”. We can use this system of working with numbers and extend it to define multiplication, division, and exponentiation. This approach to basic math is closely related to Church Encoding, named after the computer scientist and mathematician Alonzo Church.

What Church Encoding of numerals can teach us about math, particularly as it relates to LLMs, is that numbers and basic arithmetic are, fundamentally, systems of logic and that logic is formed of concepts that map to concepts. It’s this second insight that is critical to understanding why LLMs are bad at math, and logic in general. Insofar as LLMs map language (and sound, images, videos, etc.) to a concept space, and then navigate around within that space, they are very good at holding basic conversations with us humans, but concept space is not flat. Logic is like a subway system layered beneath the surface of concept space. So long as LLMs navigate only along the surface, they will easily be led astray or take spurious, wrong paths when attempting to answer questions that involve logic.

This affects more than just math. Whether we realize it or not, a significant amount of our day-to-day lives involves the use and application of logic. Take the simple example of family relationships. What does it mean to be someone’s grandparent? If your child has a child, then you are that person’s grandparent. As simple as it seems, that statement describes a system of logic. Of course the concept of how grandparents and grandchildren are related is a common one that can be understood without working through the logic of the relationship every time. This actually gives us a useful way to probe LLMs and understand their difficulty with logic.

We can present an LLM with the following problem statement:

Person A is Person B’s child if Person B is Person A’s parent. Person A is Person B’s grandchild if Person B is Person A’s parent’s parent.

Alice is Bob’s parent and Claire’s child. What is Bob’s relationship to Claire?

Chances are good that the LLM will come up with the right answer, but is this due to luck? An understanding of the concepts? Or the ability to evaluate the logic as presented? Well, if we consider one example session with ChatGPT:

Let’s break it down step by step:
Alice is Bob’s parent → So Bob is Alice’s child.
Alice is Claire’s child → So Claire is Alice’s parent.
This gives us the family chain:
Claire → Alice → Bob

Since Claire is Alice’s parent, and Alice is Bob’s parent, Claire is Bob’s parent’s parent.

By definition, that makes Claire Bob’s grandparent.

Therefore:
Bob is Claire’s grandchild.

…it’s fairly obvious that ChatGPT is leaning heavily on the concepts presented, because it mentions “grandparent”, a concept which we did not introduce in the statement of the problem! The only way ChatGPT could have understood that Claire is Bob’s grandparent is because it has trained on the concept and recognized it as the inverse of “grandchild”.

What happens if we remove the LLM’s ability to rely on the concepts we’re presenting and force it to work by logic alone? We can do this by simply switching out the familiar terms with completely made-up words:

Person A is Person B’s snuzzlement if Person B is Person A’s quibbleshank. Person A is Person B’s flinderbond if Person B is Person A’s quibbleshank’s quibbleshank.

Alice is Bob’s quibbleshank and Claire’s snuzzlement. What is Bob’s relationship to Claire?

Now the only way an LLM can arrive at the correct answer is if it can analyze the logic of the relationships and apply that logic to the question presented (or it can guess and get lucky). Early in the development of LLMs solving this sort of problem was effectively impossible. Today, if you present this problem to ChatGPT, it will arrive at the correct answer more times than not:

Let’s translate the relationships into simpler terms.

Step 1: Understand the definitions

Snuzzlement is the inverse of quibbleshank:
If A is B’s snuzzlement, then B is A’s quibbleshank.
Flinderbond is the inverse of “double quibbleshank”:
If A is B’s flinderbond, then B is A’s quibbleshank’s quibbleshank.

Step 2: Use the given information

Alice is Bob’s quibbleshank
→ Bob → Alice
Alice is Claire’s snuzzlement
Since snuzzlement is the inverse of quibbleshank:
→ Claire is Alice’s quibbleshank
→ Alice → Claire

Step 3: Chain the relationships

We now have:
Bob → Alice
Alice → Claire
So:
Bob’s quibbleshank’s quibbleshank = Claire
In other words:
Claire is Bob’s double-quibbleshank.

Step 4: Apply the definition of flinderbond

A is B’s flinderbond if B is A’s double-quibbleshank.
Since Claire is Bob’s double-quibbleshank:
Bob is Claire’s flinderbond.

Final Answer:

Bob is Claire’s flinderbond.

If you compare these two ChatGPT responses, you should notice two things: first, in the scenario with the made-up terms, only those terms appear in the answer, unlike when we used the commonplace terms and the LLM was able to relate them to a concept we had not introduced. More revealingly, though, in the second scenario ChatGPT’s answer is considerably longer! While it does, eventually, work things out it is clearly working hard to do so.

Why is it the case that ChatGPT has to work so much harder on the same problem when we present it in this fashion? And how is it that ChatGPT can work out the answer when we’ve just discussed how LLMs are really bad at logic? The answer to both questions is that ChatGPT, in its modern form, is more than just an LLM. The ways in which ChatGPT, and LLM-based AI systems in general, have evolved in the face of LLMs’ inherent difficulty in dealing with logic is central to how and why it seems like AI has taken over in the last year or so…and key to understanding AI’s one remaining fundamental limitation. More on that next time…