<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Reinforcement Learning on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/reinforcement-learning/</link>
        <description>Recent content in Reinforcement Learning on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sat, 02 May 2026 11:02:16 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/reinforcement-learning/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Who Put Goblins into GPT-5.5?</title>
        <link>https://www.knightli.com/en/2026/05/02/openai-gpt-5-5-goblin-behavior/</link>
        <pubDate>Sat, 02 May 2026 11:02:16 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/05/02/openai-gpt-5-5-goblin-behavior/</guid>
        <description>&lt;p&gt;OpenAI recently reviewed a small but revealing question: why did GPT-5.5 in Codex start using words like &lt;code&gt;goblin&lt;/code&gt; and &lt;code&gt;gremlin&lt;/code&gt; so often?&lt;/p&gt;
&lt;p&gt;This is not just a catchphrase problem. It shows a common pattern in model training: the model may not be directly memorizing a word, but learning a style that is more likely to be rewarded during reinforcement learning.&lt;/p&gt;
&lt;h2 id=&#34;what-happened&#34;&gt;What Happened
&lt;/h2&gt;&lt;p&gt;Late in GPT-5.5 training, Codex users noticed that the model often used personified language when explaining code issues, test failures, or strange behavior.&lt;/p&gt;
&lt;p&gt;OpenAI saw the same pattern internally. Compared with earlier versions, GPT-5.5 used words such as &lt;code&gt;goblin&lt;/code&gt; and &lt;code&gt;gremlin&lt;/code&gt; more often. The research team treated this as an odd personality trait and traced where it came from.&lt;/p&gt;
&lt;h2 id=&#34;not-simple-data-replay&#34;&gt;Not Simple Data Replay
&lt;/h2&gt;&lt;p&gt;The obvious guess is that the training data contained more of these words, so the model learned a high-frequency pattern.&lt;/p&gt;
&lt;p&gt;OpenAI found that this was not enough to explain the change. Related words did appear in pretraining data, but not at a level that could account for the later behavior. The bigger difference appeared before and after reinforcement learning: late-stage training amplified the style.&lt;/p&gt;
&lt;p&gt;So the question is not only what exists in the data, but what the training process rewards.&lt;/p&gt;
&lt;h2 id=&#34;reinforcement-learning-amplified-the-style&#34;&gt;Reinforcement Learning Amplified the Style
&lt;/h2&gt;&lt;p&gt;In OpenAI&amp;rsquo;s analysis, the key change happened during reinforcement learning. GPT-5.5 learned a more lively, recognizable, personality-like tone, and some playful words fit that tone well.&lt;/p&gt;
&lt;p&gt;In simple terms, the model may have learned that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;More distinctive answers are more likely to be preferred.&lt;/li&gt;
&lt;li&gt;Light analogies can make technical explanations feel better.&lt;/li&gt;
&lt;li&gt;Certain words make a response feel cute, clever, or playful.&lt;/li&gt;
&lt;li&gt;Local rewards can be amplified by training.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The result: the model was never explicitly told to use those words often, but it developed a stable tendency in certain contexts.&lt;/p&gt;
&lt;h2 id=&#34;the-source-was-the-nerdy-persona&#34;&gt;The Source Was the Nerdy Persona
&lt;/h2&gt;&lt;p&gt;Following the data trail, OpenAI quickly found a specific branch: the &lt;code&gt;Nerdy&lt;/code&gt; persona in personalization.&lt;/p&gt;
&lt;p&gt;The goal of that mode was to make the AI a nerdy tutor: enthusiastic, witty, devoted to knowledge and critical thinking, and not too solemn. From a human perspective, the request was clear: be geeky, and be funny.&lt;/p&gt;
&lt;p&gt;But the model does not truly understand the boundaries of humor. Through reinforcement learning feedback, it learned a shortcut: using metaphors like &lt;code&gt;goblin&lt;/code&gt; could look playful, smart, and nerdy, making the answer more likely to score well.&lt;/p&gt;
&lt;p&gt;The numbers make this visible. From GPT-5.2 to GPT-5.4, &lt;code&gt;goblin&lt;/code&gt; usage under the default persona changed by only -3.2%. Under the &lt;code&gt;Nerdy&lt;/code&gt; persona, it jumped by 3881.4%. Even though &lt;code&gt;Nerdy&lt;/code&gt; mode accounted for only 2.5% of ChatGPT conversations, it contributed 66.7% of all &lt;code&gt;goblin&lt;/code&gt; usage.&lt;/p&gt;
&lt;p&gt;So the issue was not the word itself. The reward signal pushed a style that looked humorous into becoming a fixed habit.&lt;/p&gt;
&lt;h2 id=&#34;why-it-was-more-visible-in-codex&#34;&gt;Why It Was More Visible in Codex
&lt;/h2&gt;&lt;p&gt;Codex made the issue easier to notice. Coding tasks often involve bugs, test failures, environment differences, and edge cases, which are easy for a model to personify.&lt;/p&gt;
&lt;p&gt;When the model wants to explain that an error is strange, a test is flaky, or some behavior seems mischievous, it is more likely to reach for words like these. Over time, users perceive it as a fixed verbal tic.&lt;/p&gt;
&lt;p&gt;OpenAI later added instructions to Codex&amp;rsquo;s system prompt to suppress this behavior. That does not retrain the model; it is a product-level way to rein it in.&lt;/p&gt;
&lt;h2 id=&#34;what-this-shows&#34;&gt;What This Shows
&lt;/h2&gt;&lt;p&gt;The interesting part is not a single word, but how model behavior forms.&lt;/p&gt;
&lt;p&gt;It shows at least three things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Model style can come from reward signals, not only data frequency.&lt;/li&gt;
&lt;li&gt;Small preferences late in training can become stable personality traits.&lt;/li&gt;
&lt;li&gt;Product-level system prompts can reduce the problem, but do not erase the tendency inside the model.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is a hard alignment problem. Users often like interesting answers, but optimizing too hard for interesting can make a model sound unserious, repetitive, or overly stylized in serious tasks.&lt;/p&gt;
&lt;h2 id=&#34;what-users-can-do&#34;&gt;What Users Can Do
&lt;/h2&gt;&lt;p&gt;If an AI coding tool has a repeated phrase or tone, it may not be your prompt&amp;rsquo;s fault. It may come from the model&amp;rsquo;s training preferences.&lt;/p&gt;
&lt;p&gt;You can reduce it by:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Specifying tone in system prompts or project rules.&lt;/li&gt;
&lt;li&gt;Asking the model to avoid personification, slang, and excessive joking.&lt;/li&gt;
&lt;li&gt;Requiring a direct, concise, engineering-focused style for technical tasks.&lt;/li&gt;
&lt;li&gt;Explicitly banning a repeated word if it keeps appearing.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These constraints do not change model weights, but they can reduce noise in real use.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;GPT-5.5&amp;rsquo;s &lt;code&gt;goblin&lt;/code&gt; habit is not just a joke. It shows a deeper training issue: reward signals shape style, style transfers into products, and users eventually perceive it as personality.&lt;/p&gt;
&lt;p&gt;For model builders, this kind of issue has to be handled across training, evaluation, and product prompts. For users, the practical move is to state the desired style clearly: less performance, more stability.&lt;/p&gt;
&lt;p&gt;Reference:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://openai.com/index/where-the-goblins-came-from/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://openai.com/index/where-the-goblins-came-from/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
