<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Cost Analysis on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/cost-analysis/</link>
        <description>Recent content in Cost Analysis on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sat, 25 Apr 2026 08:44:32 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/cost-analysis/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Why LLM APIs Charge by Tokens: A Clear Guide to Input, Output, and Context Costs</title>
        <link>https://www.knightli.com/en/2026/04/25/llm-token-pricing-principles/</link>
        <pubDate>Sat, 25 Apr 2026 08:44:32 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/25/llm-token-pricing-principles/</guid>
        <description>&lt;p&gt;One of the easiest things to find confusing about LLM API billing is why almost every platform eventually comes down to one unit: &lt;code&gt;token&lt;/code&gt;. The real question is simple: &lt;strong&gt;why do LLMs charge by token, and why can different tokens have different prices?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For many people who are just starting to use model APIs, the most confusing part is not model capability but the bill. Why does the cost rise so quickly even when you only ask a few questions? Why is input cheaper than output? Why does the bill start growing much faster once context becomes long?&lt;/p&gt;
&lt;p&gt;A simple way to think about it is this: &lt;strong&gt;you are not paying for &amp;ldquo;one answer.&amp;rdquo; You are paying for the compute and bandwidth consumed throughout the whole inference process.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;1-what-is-a-token&#34;&gt;1. What is a token
&lt;/h2&gt;&lt;p&gt;In LLM billing, a &lt;code&gt;token&lt;/code&gt; is neither a character count nor a word count. It is the unit a model uses when processing text.&lt;/p&gt;
&lt;p&gt;A token might be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A single Chinese character&lt;/li&gt;
&lt;li&gt;Part of an English word&lt;/li&gt;
&lt;li&gt;A punctuation mark&lt;/li&gt;
&lt;li&gt;A short chunk of frequently seen text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why API platforms usually do not charge per sentence or per request. They charge according to how many tokens the model actually reads and generates.&lt;br&gt;
This is much more reasonable than charging by request count, because one request might contain 20 characters, while another might include 200,000 tokens of context. The resource consumption is nowhere near the same.&lt;/p&gt;
&lt;h2 id=&#34;2-why-input-and-output-are-priced-separately&#34;&gt;2. Why input and output are priced separately
&lt;/h2&gt;&lt;p&gt;Most model APIs today split pricing into two parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Input token price&lt;/li&gt;
&lt;li&gt;Output token price&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And in many cases, &lt;strong&gt;output tokens cost more than input tokens.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The reason is not hard to understand.&lt;/p&gt;
&lt;p&gt;When a model processes input, it is mainly reading and encoding existing content. But when it generates output, it has to predict the next token, then the next one, then the next one. This is not just reading. It is an ongoing process of inference and sampling, which usually costs more compute.&lt;/p&gt;
&lt;p&gt;You can think of it roughly like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Input: handing materials to the model&lt;/li&gt;
&lt;li&gt;Output: asking the model to write the answer on the spot&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Writing on the spot usually costs more than reading the materials once, so it is very common for output pricing to be higher.&lt;/p&gt;
&lt;h2 id=&#34;3-why-long-context-makes-costs-easier-to-lose-control-of&#34;&gt;3. Why long context makes costs easier to lose control of
&lt;/h2&gt;&lt;p&gt;Many people think they are only adding a bit more background information, but from the model billing perspective, the impact is often much bigger than expected.&lt;/p&gt;
&lt;p&gt;The reason is that &lt;strong&gt;each model call usually has to process the full context included in that request again.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;That means if your request currently contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A system prompt&lt;/li&gt;
&lt;li&gt;Conversation history&lt;/li&gt;
&lt;li&gt;Tool return values&lt;/li&gt;
&lt;li&gt;Long document chunks&lt;/li&gt;
&lt;li&gt;Source code files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;all of that goes into input token billing.&lt;/p&gt;
&lt;p&gt;So what really makes bills grow is often not the final question itself, but the long chain of context attached before it.&lt;br&gt;
As the number of conversation turns increases, tool calls accumulate, and prior messages keep getting fed back in, token cost grows round after round.&lt;/p&gt;
&lt;h2 id=&#34;4-why-tool-calls-are-especially-likely-to-inflate-token-usage&#34;&gt;4. Why tool calls are especially likely to inflate token usage
&lt;/h2&gt;&lt;p&gt;In scenarios like agents, coding assistants, and workflow automation, token usage is often much higher than in ordinary chat.&lt;/p&gt;
&lt;p&gt;The issue is not just that the model wrote a paragraph. It is that the workflow keeps producing content like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reading files&lt;/li&gt;
&lt;li&gt;Inspecting logs&lt;/li&gt;
&lt;li&gt;Calling APIs&lt;/li&gt;
&lt;li&gt;Returning JSON&lt;/li&gt;
&lt;li&gt;Feeding tool results back into the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As long as the result of each tool call gets inserted into the next round of context, it becomes a new source of input tokens.&lt;/p&gt;
&lt;p&gt;That is why many developers eventually realize:&lt;br&gt;
&lt;strong&gt;the model&amp;rsquo;s unit price is not always the real problem. The workflow itself may be stacking token cost layer by layer.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, imagine a coding agent doing the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read the project structure&lt;/li&gt;
&lt;li&gt;Open several source files&lt;/li&gt;
&lt;li&gt;Run a test suite&lt;/li&gt;
&lt;li&gt;Feed the error logs back into the model&lt;/li&gt;
&lt;li&gt;Read more related files&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step can make later requests carry even more context. Even if the unit price does not change, the total bill can rise quickly.&lt;/p&gt;
&lt;h2 id=&#34;5-why-the-same-kind-of-model-can-have-very-different-prices&#34;&gt;5. Why the same kind of model can have very different prices
&lt;/h2&gt;&lt;p&gt;Differences in token pricing between models are not only about vendors wanting to charge more. They are usually tied directly to several factors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model size&lt;/li&gt;
&lt;li&gt;Inference efficiency&lt;/li&gt;
&lt;li&gt;Context length&lt;/li&gt;
&lt;li&gt;Deployment cost&lt;/li&gt;
&lt;li&gt;Target market&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The larger the model, the more active parameters it uses, and the more complex its inference path is, the higher the cost of generating one token usually becomes.&lt;br&gt;
If the model also supports ultra-long context, more complex reasoning, or better tool use, the infrastructure pressure increases even more.&lt;/p&gt;
&lt;p&gt;So pricing is really covering several kinds of cost:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPU or accelerator resources&lt;/li&gt;
&lt;li&gt;VRAM usage&lt;/li&gt;
&lt;li&gt;Inference latency&lt;/li&gt;
&lt;li&gt;Network and service stability&lt;/li&gt;
&lt;li&gt;Peak concurrency capacity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A cheaper model is not necessarily bad, and a more expensive model is not necessarily the right choice for every task. In many cases, the price gap reflects how much infrastructure cost a certain level of capability requires.&lt;/p&gt;
&lt;h2 id=&#34;6-why-cached-input-is-cheaper&#34;&gt;6. Why cached input is cheaper
&lt;/h2&gt;&lt;p&gt;Many model platforms now offer features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;cached input&lt;/li&gt;
&lt;li&gt;prompt caching&lt;/li&gt;
&lt;li&gt;prefix caching&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The shared idea behind them is simple: if a large chunk of input has already been processed once, do not keep recomputing it from scratch at full price.&lt;/p&gt;
&lt;p&gt;For example, if you repeatedly send the same system prompt, the same tool instructions, or the same long document prefix, the platform may be able to cache part of that computation. Then even though it is still input token usage, the cached portion can be billed at a lower rate.&lt;/p&gt;
&lt;p&gt;This also explains why many API pricing pages show three or more price tiers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Standard input&lt;/li&gt;
&lt;li&gt;Cached input&lt;/li&gt;
&lt;li&gt;Output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The difference is not that the text means different things. It is that the underlying computation may or may not be reusable.&lt;/p&gt;
&lt;h2 id=&#34;7-why-cheap-tokens-do-not-automatically-mean-lower-total-cost&#34;&gt;7. Why &amp;ldquo;cheap tokens&amp;rdquo; do not automatically mean lower total cost
&lt;/h2&gt;&lt;p&gt;When people see a model advertised as &amp;ldquo;very cheap per million tokens,&amp;rdquo; the first instinct is often that total cost must also be lower. In reality, not always.&lt;/p&gt;
&lt;p&gt;That is because total cost is roughly:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;token unit price × actual token volume&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;And actual token volume can be amplified by many things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompts that are too long&lt;/li&gt;
&lt;li&gt;Conversation history that is never trimmed&lt;/li&gt;
&lt;li&gt;Too much tool output fed back in&lt;/li&gt;
&lt;li&gt;Overly verbose model output&lt;/li&gt;
&lt;li&gt;Repeated retries for the same task&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the real bill is not determined by price alone. It is usually determined by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model unit price&lt;/li&gt;
&lt;li&gt;Input length per round&lt;/li&gt;
&lt;li&gt;Output length per round&lt;/li&gt;
&lt;li&gt;Number of calls&lt;/li&gt;
&lt;li&gt;Workflow design&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is also why a &amp;ldquo;low-cost model&amp;rdquo; can still end up expensive in some agent workflows. It may need more rounds, more supplemental context, and more retry cycles.&lt;/p&gt;
&lt;h2 id=&#34;8-how-developers-should-estimate-token-cost&#34;&gt;8. How developers should estimate token cost
&lt;/h2&gt;&lt;p&gt;If you want better budget control in a real project, a simple way to estimate cost is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Measure average input tokens per request&lt;/li&gt;
&lt;li&gt;Measure average output tokens per request&lt;/li&gt;
&lt;li&gt;Estimate how many rounds one complete task requires&lt;/li&gt;
&lt;li&gt;Multiply by the model&amp;rsquo;s pricing&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8k tokens&lt;/code&gt; of input per round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1k tokens&lt;/code&gt; of output per round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;10&lt;/code&gt; rounds for one task&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then what you are really consuming is not &amp;ldquo;one Q&amp;amp;A exchange,&amp;rdquo; but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;About &lt;code&gt;80k tokens&lt;/code&gt; of input&lt;/li&gt;
&lt;li&gt;About &lt;code&gt;10k tokens&lt;/code&gt; of output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And if logs, tool results, and file contents keep being added along the way, the total grows even further.&lt;/p&gt;
&lt;p&gt;That is why budget planning should not only look at a single round. It should look at &lt;strong&gt;how many tokens a full task loop will consume from start to finish.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;9-how-to-control-the-bill-in-practice&#34;&gt;9. How to control the bill in practice
&lt;/h2&gt;&lt;p&gt;If you are already using APIs or agents, the following methods are usually the most effective:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Shorten the system prompt and cut repeated wording&lt;/li&gt;
&lt;li&gt;Trim old conversation history regularly&lt;/li&gt;
&lt;li&gt;Keep only necessary fields from tool outputs&lt;/li&gt;
&lt;li&gt;Retrieve first, then send only relevant parts of long documents&lt;/li&gt;
&lt;li&gt;Limit output length and avoid unbounded expansion&lt;/li&gt;
&lt;li&gt;Use expensive models for high-value tasks and cheaper ones for lower-value tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In many cases, the best way to save money is not to switch blindly to a cheaper model. It is to remove unnecessary token consumption from the workflow first.&lt;/p&gt;
&lt;h2 id=&#34;10-how-to-think-about-all-of-this&#34;&gt;10. How to think about all of this
&lt;/h2&gt;&lt;p&gt;At the end of the day, token pricing is a way of charging for how much the model had to read, infer, and write.&lt;/p&gt;
&lt;p&gt;It is not like traditional software pricing, where per-account, per-request, or monthly billing is enough to describe resource use. A model call is a dynamic computation process. The amount of context you send, the tools you invoke, and the output length you request all directly affect cost.&lt;/p&gt;
&lt;p&gt;So the most important thing is not memorizing price tables. It is building the right intuition:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Long context increases input cost&lt;/li&gt;
&lt;li&gt;Long output increases generation cost&lt;/li&gt;
&lt;li&gt;Tool chains amplify total token usage&lt;/li&gt;
&lt;li&gt;Caching and workflow design can change the bill significantly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once those points are clear, the pricing structure of most LLM APIs becomes much easier to understand.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
