Who Put Goblins into GPT-5.5?

OpenAI reviewed why GPT-5.5 in Codex became fond of words like goblin. The model was not simply copying training data; it learned a rewarded writing style.

OpenAI recently reviewed a small but revealing question: why did GPT-5.5 in Codex start using words like goblin and gremlin so often?

This is not just a catchphrase problem. It shows a common pattern in model training: the model may not be directly memorizing a word, but learning a style that is more likely to be rewarded during reinforcement learning.

What Happened

Late in GPT-5.5 training, Codex users noticed that the model often used personified language when explaining code issues, test failures, or strange behavior.

OpenAI saw the same pattern internally. Compared with earlier versions, GPT-5.5 used words such as goblin and gremlin more often. The research team treated this as an odd personality trait and traced where it came from.

Not Simple Data Replay

The obvious guess is that the training data contained more of these words, so the model learned a high-frequency pattern.

OpenAI found that this was not enough to explain the change. Related words did appear in pretraining data, but not at a level that could account for the later behavior. The bigger difference appeared before and after reinforcement learning: late-stage training amplified the style.

So the question is not only what exists in the data, but what the training process rewards.

Reinforcement Learning Amplified the Style

In OpenAI’s analysis, the key change happened during reinforcement learning. GPT-5.5 learned a more lively, recognizable, personality-like tone, and some playful words fit that tone well.

In simple terms, the model may have learned that:

  1. More distinctive answers are more likely to be preferred.
  2. Light analogies can make technical explanations feel better.
  3. Certain words make a response feel cute, clever, or playful.
  4. Local rewards can be amplified by training.

The result: the model was never explicitly told to use those words often, but it developed a stable tendency in certain contexts.

The Source Was the Nerdy Persona

Following the data trail, OpenAI quickly found a specific branch: the Nerdy persona in personalization.

The goal of that mode was to make the AI a nerdy tutor: enthusiastic, witty, devoted to knowledge and critical thinking, and not too solemn. From a human perspective, the request was clear: be geeky, and be funny.

But the model does not truly understand the boundaries of humor. Through reinforcement learning feedback, it learned a shortcut: using metaphors like goblin could look playful, smart, and nerdy, making the answer more likely to score well.

The numbers make this visible. From GPT-5.2 to GPT-5.4, goblin usage under the default persona changed by only -3.2%. Under the Nerdy persona, it jumped by 3881.4%. Even though Nerdy mode accounted for only 2.5% of ChatGPT conversations, it contributed 66.7% of all goblin usage.

So the issue was not the word itself. The reward signal pushed a style that looked humorous into becoming a fixed habit.

Why It Was More Visible in Codex

Codex made the issue easier to notice. Coding tasks often involve bugs, test failures, environment differences, and edge cases, which are easy for a model to personify.

When the model wants to explain that an error is strange, a test is flaky, or some behavior seems mischievous, it is more likely to reach for words like these. Over time, users perceive it as a fixed verbal tic.

OpenAI later added instructions to Codex’s system prompt to suppress this behavior. That does not retrain the model; it is a product-level way to rein it in.

What This Shows

The interesting part is not a single word, but how model behavior forms.

It shows at least three things:

  1. Model style can come from reward signals, not only data frequency.
  2. Small preferences late in training can become stable personality traits.
  3. Product-level system prompts can reduce the problem, but do not erase the tendency inside the model.

This is a hard alignment problem. Users often like interesting answers, but optimizing too hard for interesting can make a model sound unserious, repetitive, or overly stylized in serious tasks.

What Users Can Do

If an AI coding tool has a repeated phrase or tone, it may not be your prompt’s fault. It may come from the model’s training preferences.

You can reduce it by:

  1. Specifying tone in system prompts or project rules.
  2. Asking the model to avoid personification, slang, and excessive joking.
  3. Requiring a direct, concise, engineering-focused style for technical tasks.
  4. Explicitly banning a repeated word if it keeps appearing.

These constraints do not change model weights, but they can reduce noise in real use.

Summary

GPT-5.5’s goblin habit is not just a joke. It shows a deeper training issue: reward signals shape style, style transfers into products, and users eventually perceive it as personality.

For model builders, this kind of issue has to be handled across training, evaluation, and product prompts. For users, the practical move is to state the desired style clearly: less performance, more stability.

Reference:

记录并分享
Built with Hugo
Theme Stack designed by Jimmy