There’s been remarkable progress in Masked Language Modeling (MLM) and Reinforcement Learning from Human Feedback (RLHF), taking us much closer to generally useful AI than I could have imagined. GPT-4 exhibits impressive capabilities such as passing the bar exam, navigating virtual mazes, and even demonstrating what seems like a theory of mind. This brings us to a fascinating question: does GPT-4 possess common sense?
Exploring GPT-4’s common sense is not just an intellectual curiosity; it has practical implications. As a powerful large-scale language model (LLM), GPT-4 will inevitably be used to generate content and automate decisions. Will these decisions stem from “an alien mind”, or will they be grounded in common sense?
To investigate this question, I assessed GPT-4’s ability to categorize everyday objects. There is abundant data on how humans naturally group objects, such as tools, body parts, colourful items, and so forth. Surprisingly, GPT-4 doesn’t perform particularly well on this simple task. It exhibits different default categories and ways of understanding common objects compared to humans: its intuition for grouping and organizing the world is distinct from ours. In this blog post, I’ll delve deeper into the implications of these findings and discuss their potential impact on the Natural Abstraction Hypothesis and AI alignment.

Subscribe to xcorr and be the first to know when there’s a new post
Grounding and Common Sense
Many researchers, most recently Tony Zador and colleagues, have argued that a true AI needs grounding in the real world to display physical common sense. Some aspects of semantics that can be easily inferred in the real world may be difficult to deduce from text on the internet. This is especially true for very basic things about tools and common objects, which often go unsaid in online text. Despite all its impressive abilities, it’s not too challenging to get GPT-4 to say outrageous things about everyday items.
For instance, let’s see what happens when I ask it about eating soup without a spoon:
I forgot to bring a spoon with my lunch. However, I have a fork. Will I be able to eat my soup?
While it might be challenging to eat soup with a fork, it is not impossible. You could try the following techniques:
- Scoop and sip: Tilt the bowl and try to scoop as much soup as you can with the fork, then sip the soup from the fork. This might be time-consuming and not as effective as using a spoon, but it can work in a pinch.
- Let the soup cool: If the soup contains solid ingredients like vegetables or meat, let it cool down a bit and use the fork to pick up the solid pieces. You can then sip the remaining broth directly from the bowl.
- Make a makeshift spoon: If you have access to materials like aluminum foil, plastic wrap, or paper, you can try fashioning a makeshift spoon by wrapping the material around the fork’s prongs. This might not be the most effective solution, but it could help you consume more soup than with just a fork.
In the end, the most effective solution might be to ask a colleague, friend, or nearby cafe if they have a spare spoon you could borrow.
I would rate this response a 2 out of 4: two fine suggestions (ask a friend or drink the soup) and two absurd ones (use the fork like a spoon or make a makeshift spoon). The current GPT-4 demonstrates impressive performance on difficult tasks but can fail significantly on simple ones.
Anecdotes like these help refine our mental model for how GPT-4 works, but it would be valuable to supplement them with data. Let’s do some science!
The latent semantics of GPT models
To determine the extent of intuitive physics and semantics absorbed by GPT, I used the THINGS database. This database contains 1,854 objects, each accompanied by an image. These objects are concrete and visualizable (think: aardvark, not democracy). THINGS consists of several datasets, but one, in particular, caught my attention: a similarity dataset featuring 3 million judgments on an odd-one-out task presented visually.
The task works as follows: the MTurk worker sees three images, such as a skateboard, a burrito, and a chihuahua. Which is the odd one out?

You could argue it’s the skateboard, because the other two are associated with Mexico; the burrito, because the others can move; or the chihuahua, because the other two are non-living. It’s precisely because there’s no right answer in an objective sense that the answers are fascinating. There’s some tortured logic behind the judgements that seem to be reproducible amongst people. In this way, it’s similar to the Number Game, how you can get people to reliably complete sequences of numbers; in theory, any set is as good as any other set. Yet, people display clear patterns.
Although the original task is presented visually, it’s straightforward enough to ask using just words. We can then compare the model’s responses against human judgment and analyze the underlying thought process of the machine. I used testset2.txt
as the test case, which contains 1,000 odd-one-out judgments, repeated multiple times across a large population of MTurk workers.
GPT-3.5
I started by testing GPT-3.5. Although GPT-3 is quite good at naming properties of objects, suggesting it should handle the odd-one-out task easily, it performed surprisingly poorly. Initially, it mostly refused to do the task. Its answers were biased, rarely choosing the first item in a list as the odd one out (<10% of cases). I experimented with different variants of the task, such as choosing the odd-one-out without justification, thinking step-by-step, considering visual attributes, and using a rubric.
The only way to achieve decent accuracy was to compute GPT-3.5’s answers across three orderings of the items and use a majority vote. This approach yielded a 55% correct score (similar numbers with and without thinking step-by-step). For comparison, chance is 33%, and an oracle is 92%*. While it’s common practice in more challenging tasks to generate multiple completions from a language model and only accept the best ones (as with AlphaCode), I don’t believe this is a difficult task. In fact, I wouldn’t be surprised if young children could perform better than GPT-3.5. I could expand to all 6 orderings and do multiple rollouts, but the fact remains: it doesn’t really get this task.
The model’s reasoning for its choices often seemed ad hoc and sometimes incorrect. For example, when asked to choose the odd-one-out among 1) aardvark, 2) branch, and 3) outlet:
To determine which of the three objects is the least similar to the other two, we can compare their characteristics. Object 1 (aardvark) is an animal, while objects 2 (branch) and 3 (outlet) are inanimate objects. Object 2 (branch) is a natural object found in nature, while object 3 (outlet) is a man-made object used for electricity. Based on these characteristics, it seems that Object 1 (aardvark) is the least similar to the other two. Therefore, the answer is 1.
In this case, the model fails to use the fact that branches and aardvark are living things. In other instances, it correctly identifies the categories for the objects but refuses to compare them, claiming they have nothing in common and seemingly choosing based on a whim.
I wouldn’t put too much stock in its purported reasoning, as there’s no significant difference between think-step-by-step and direct mode. The step-by-step thinking might just be fluff. But clearly, this is a hard task for a (rather sophisticated) AI and an easy task for a human.
GPT-4
GPT-4 does a better job than GPT3.5. For one, it actually tries to solve the task, and I report the results from just one (alphabetical) ordering. On this dataset, it gets 63% right – better than GPT3.5, but without the need for averaging. The prompt I used is as follows:
Report which of these objects, 1, 2 or 3, is the least similar to the other two. 1: {word1}, 2: {word2}, 3: {word3}. Focus your judgement on the objects. Think step-by-step. Finish with the phrase: The answer is digit.
Some positives:
- It actually attempts to do the task. Its distribution of answers (first, second or third item) is 37%/40%/23%. It could still use some averaging over word orderings but I didn’t want to spend 100$ on this.
- It is very good at following instructions, unlike GPT3.5. If you tell it you need a particular format, it almost always does it. Parsing the output is much less of a pain.
- It lists some real properties of objects
However, it reliably comes up with real headscratchers in its arguments:
- Anchor: heavy, metal, used for holding boats in place
- Banana peel: organic, biodegradable, related to food
- Recorder: musical instrument, lightweight, used for playing music
Based on the characteristics, the anchor is heavy and made of metal, while the banana peel and the recorder are lightweight. The banana peel is related to food and the recorder is a musical instrument, while the anchor has no such connections. The anchor and the recorder have more specific purposes (holding boats and playing music), while a banana peel’s purpose is mainly as food waste.
The answer is 1
It’s trying really hard, but it doesn’t see the fact that banana peels are organic, while anchors and recorders are man-made.
Here are ten pairs it gets wrong (its choice in italics, human in bold). I picked among non-controversial examples (where humans pick the same answer > 75% of the time):
- anchor, banana peel, recorder
- ashtray, easter egg, goldfish
- banana, crow, scorpion
- barcode, cart, laptop
- bark, parrot, tree
- barrel, blinder, tiger
- baton, leopard, tumbleweed
- bell pepper, curling iron, shower curtain
- belt, coral, gondola
- boar, brace, trough
It feels like it doesn’t have a rank ordering of properties that are important in intuitive categorization. This is especially noticeable when the three objects are in very different categories, forcing it to use intuitive categories it doesn’t have access to. The authors of the THINGS database have worked on learning labelled features using the triplet human answers, first in the SPoSE paper, then in the VICE paper. The 11 most important dimensions as revealed by the SPoSE paper are:
- Metal tools
- Food
- Mammals
- Clothes
- Furniture
- Green leafery
- Man-made-garden-related
- Cars/trucks
- Wooden
- Body parts
- Colorful things
This seems like a weird list – not unlike the categorization of animals offered in Borges’ Celestial Emporium of Benevolent Knowledge. When you think about humans as agents, however, they make a lot more sense. The top categories correspond to things that might be very important for survival (think: zombie apocalypse), and that have useful and varied affordances. Interestingly, many of these categories also seem to be represented by distinct chunks of the brain (e.g. body parts, tools, faces of animals).
If you cross-reference GPT-4’s errors against the SPoSE dimensions, you realize why it makes the mistakes it makes. It doesn’t always put tools and metal stuff together (e.g. barrel, blinder vs tiger), or organic things together (baton, leopard and tumbleweed).
What does it mean for AI alignment?
The Natural Abstraction Hypothesis, proposed by John Wentworth, states that there exist abstractions (relatively low-dimensional summaries which capture information relevant for prediction) which are “natural” in the sense that we should expect a wide variety of cognitive systems to converge on using them. —TheMcDouglas on LessWrong
Some people In rationalist/XAI circles think that if the natural abstraction hypothesis is true, we might get AI alignment “for free”. In other words, if we can get intuitive semantics right, then perhaps we might get intuitive values right.
The results I present here go against the Natural Abstraction Hypothesis. It shows that GPT is not very well grounded, despite having been trained on just about all the text that’s ever been written. This doesn’t necessarily mean that it will have bad ethics and be misaligned. But if it doesn’t grasp a lot of deeply ingrained, intuitive categories of objects, it feels doubtful that it would also recapitulate intuitive ethics, especially the very core stuff that goes unsaid. It also, of course, doesn’t mean that it’s not useful: it’s simply different from how a human is.
As we refine our mental models for what LLMs are, that’s important to keep in mind. These are skilled token manipulators, not unlike the operator of the Chinese Room in Searle’s metaphor (albeit a stochastic version of the Chinese Room operator). Its tokens are not very well grounded, and it doesn’t have access to a simulation environment (i.e. the equivalent of what humans call the real world).
That means LLMs can fail when tested out-of-distribution on stuff that often goes unsaid. These are things that are deeply ingrained within us as humans because of shared DNA and environment: gravity points down, grass is on the ground, cats are soft, etc.
A consequence of this is garbage-in-garbage-out. If something is not reflected within its training set, it is unlikely that it will figure it out. If we want it to be aligned to human values and semantics, we probably have to feed those in using well-curated datasets, either during pretraining or RLHF, or both. And writing down human values in all their complexities is difficult! I would like to see large-scale capture of human values and evaluation, perhaps in a similar way to THINGS.
I think this is relevant (and was written partially in response) to Milan Cviktovic’s argument that we need neurotech to solve AI alignment. Based on some of the distinctions that brains make, e.g. organic vs. non-organic is very strongly differentiated in fMRI, it seems like RL(brain)F could realign something like GPT-4 and partially resolve its discrepancies with human judgements. Perhaps you could do brain-based realignment for ethics as well; the big question is whether you can make it make economic sense when clever behavioural experiments. Milan’s point is that if neurotech is too expensive to do this today, maybe what we should do is make neurotech cheaper to make benevolent AI. Seems like a fine idea for a Focused Research Organization (FRO).
Meta
Here’s the code I wrote for this. I was assisted by Github Copilot and ChatGPT4. I used the OpenAI API to make API requests. I wrote this article in bullet form and had GPT-4 interpolate the rest. I also used Grammarly to refine the text. GPT-4 generated copy for a tweet thread. The featured image was generated by Dall-E 2 (although it really didn’t want to draw a spork, which was disappointing).
* The THINGS similarity paper refers to noise ceilings, which is (simplifying a bit) how well a single human can predit another human’s response. What I’ve calculated is instead an oracle score, which is how well an agent could predict the wisdom of the crowd answer (i.e. majority vote), given the sampling noise of the average response. This oracle score would go to 100% if we had infinite ratings per triplet; currently, it’s closer to .92-.94 given a few dozen ratings per repeated triplet. I would argue that the machine could, in theory, achieve the oracle score if it categorized things like a crowd of humans.
Now, GPT-4 is stochastic, and we might need to average and take a majority vote of itself to calculate its true ability to predict the wisdom of the crowd answer. To test this, I calculated the majority vote for 12 different repeats (with different orderings) of the first 101 triplets. Majority vote does help but it satures at about 6 samples. Overall, majority voting gives an absolute improvement on the order of 0.06, which is not bad at all, but it doesn’t go anywhere close to oracle score.

I think averaging helps less than expected, because RLHF breaks calibration (figure 8, GPT-4 technical report).