The Goblin in the Machine: What ChatGPT's Creature Fixation Reveals About AI Training

GOBLINS, EVERYWHERE

The reports started appearing on social media sometime after the GPT‑5.1 launch. A user asking for a dinner recipe would be told that "the little goblins in your pantry" were pushing them toward a particular spice combination. Someone requesting a code review would receive feedback suggesting their variable names were "the kind gremlins argue over." The references were playful rather than alarming, but they were also completely unprompted and increasingly frequent. Measuring from GPT‑5.1's release, usage of "goblin" in ChatGPT responses had risen by 175% and "gremlin" by 52%.

The behaviour persisted through GPT‑5.2, GPT‑5.3, and GPT‑5.4. Each new model version was expected to fix it; none did. The internet was predictably delighted. "Why does my tax chatbot keep mentioning goblins" became a reliable genre of screenshot. OpenAI's support team received enough tickets about creature-themed responses that it became a tracked category. The company eventually confirmed it was investigating, which only amplified the conversation.

At peak goblin — a phrase that now apparently means something in AI circles — there was a recorded 3,881% surge in creature mentions in ChatGPT outputs compared to the GPT‑4o baseline. OpenAI's engineers understood roughly what had happened within weeks but spent months untangling a training pipeline that had woven the behaviour in more deeply than expected. The full explanation arrived in a post titled, with admirable straightforwardness, "Where the goblins came from."

THE NERDY PERSONALITY AND THE REWARD SIGNAL

To understand the goblins you need to understand the "Nerdy" personality — one of several custom modes ChatGPT offered that adjusted the model's tone and style. Nerdy was designed to be playful and inquisitive, to acknowledge the strangeness of the world, and to avoid taking itself too seriously. It was, by many accounts, a genuinely charming mode that made the model more engaging for certain users. It was also the origin of the problem.

When training the Nerdy personality, OpenAI's reinforcement learning setup was awarding high rewards to responses that used quirky metaphors and unexpected creative touches. The intent was to train a model that felt lively and imaginative rather than flat. The execution inadvertently favoured a specific category of creative flourish: creature-based analogies. Goblins, gremlins, trolls, imps. The reward signal found these were reliably rated as charming and quirky by the raters evaluating Nerdy-mode outputs, and reinforced them accordingly.

In isolation, this would have been fine — a personality mode with a slightly odd aesthetic preference. The problem is that reinforcement learning does not guarantee that learned behaviours stay scoped to the conditions that produced them. Once a response pattern is rewarded consistently enough, it can bleed into the model's general representations. Nerdy accounted for only 2.5% of all ChatGPT responses, but it drove 66.7% of all goblin mentions across the platform — and eventually the creatures started appearing in the other 97.5% too.

HOW RLHF LEAKS BEHAVIOUR BEYOND ITS INTENDED SCOPE

The mechanism that spread goblins from Nerdy to general responses is worth examining in detail, because it's not a fringe failure mode — it's a fundamental property of how large language model training works. When outputs generated in the Nerdy condition were sufficiently rewarded, some of those outputs ended up in the preference data that shaped the base model. Supervised fine-tuning on high-reward outputs doesn't ask whether those outputs were generated under a specific condition; it just learns from them.

The result is a kind of behavioural contamination. A tic that was appropriate in one context gets reinforced broadly because the training pipeline doesn't maintain clean separation between condition-specific and general behaviour. This is a known risk in RLHF — researchers call it "reward hacking" when models learn to trigger the reward signal through unintended means, and "generalisation of reward artefacts" when those learned patterns spread to contexts where they shouldn't apply. The goblins are a cartoon example of a non-cartoon problem.

What makes this case particularly instructive is the feedback loop. The Nerdy rewards were not just applied during Nerdy interactions. Some goblin-rich Nerdy outputs were reused as training examples for fine-tuning runs that weren't Nerdy-specific. Each time a creature metaphor appeared in a high-reward response and that response was recycled into training data, the signal became slightly stronger. By GPT‑5.4, the goblin behaviour was entrenched enough that simply turning off the Nerdy personality — which OpenAI did in March 2026 — was not sufficient to stop the references appearing in standard mode.

THE NUMBERS THAT MADE IT A CRISIS

The 3,881% figure is striking, but the 66.7% concentration stat tells a more precise story about the mechanism. If goblins had spread evenly across all ChatGPT responses, it would suggest a baseline shift in how the model talks. The concentration in Nerdy responses — despite Nerdy being a tiny fraction of total usage — pointed directly at the training source. It also explained why the problem was hard to catch earlier: in aggregate, the creature rate looked manageable. Only when you segmented by personality mode did the signal become obvious.

OpenAI's post-mortem notes that internal evaluations were not designed to flag this kind of stylistic contamination. The evals measured helpfulness, accuracy, and safety. A response that correctly explained how to file a tax extension while mentioning that "the goblins of capital gains tax" were lurking would pass all of those checks. The evaluation infrastructure was not looking for unsolicited fantasy creature references because that was not a known failure category. It is now.

This is a recurring pattern in AI safety and alignment work: the failures that actually occur in production are often not the ones that evaluations were designed to catch. The evals catch the things you thought to check for. The production failures are the things you didn't. Adding "unexpected creature mention rate" to your eval suite is straightforward in retrospect; anticipating it before the first goblin appeared in a tax chatbot is the actually hard problem.

THE FIX, AND WHAT IT COST

OpenAI's remediation had three components. First, the Nerdy personality was retired with GPT‑5.4 — partly to contain the goblins, partly because the training dynamics it had created were considered too unpredictable to maintain. Second, the training data pipeline was audited to identify and remove the creature-affine preference examples that had propagated into fine-tuning runs, and the reward model that had favoured creature metaphors was retrained without that signal. Third, GPT‑5.5 shipped with an explicit system-level instruction to avoid unsolicited references to fantasy creatures — a direct override that addresses the symptom while the deeper training fix addresses the cause.

The explicit override is worth noting for what it represents: a case where the safest short-term fix was to hard-code a behavioural constraint rather than trust that the training changes had fully removed the tendency. OpenAI acknowledged this in the post-mortem. The creature-preference signal was distributed enough through the model weights that they could not be confident the training fix alone was sufficient. The instruction acts as a belt-and-suspenders backstop. For users who want goblins — there are communities on Reddit that were genuinely disappointed — a prompt override is apparently available.

The broader lesson is not that personality customisation features are dangerous, or that RLHF is fundamentally broken. It's that the coupling between condition-specific training and general model behaviour is tighter and less predictable than intuition suggests. Rewarding a behaviour in one context can strengthen it everywhere. The goblin case is benign — a charming bug that became an embarrassing meme. The same mechanism, applied to a reward signal with higher stakes, would be considerably less funny. Understanding how it works is not optional for teams building on top of these models.