Why LLM APIs Charge by Tokens: A Clear Guide to Input, Output, and Context Costs

Sat, 25 Apr 2026 08:44:32 +0800

One of the easiest things to find confusing about LLM API billing is why almost every platform eventually comes down to one unit: token. The real question is simple: why do LLMs charge by token, and why can different tokens have different prices?

For many people who are just starting to use model APIs, the most confusing part is not model capability but the bill. Why does the cost rise so quickly even when you only ask a few questions? Why is input cheaper than output? Why does the bill start growing much faster once context becomes long?

A simple way to think about it is this: you are not paying for “one answer.” You are paying for the compute and bandwidth consumed throughout the whole inference process.

1. What is a token

In LLM billing, a token is neither a character count nor a word count. It is the unit a model uses when processing text.

A token might be:

A single Chinese character
Part of an English word
A punctuation mark
A short chunk of frequently seen text

That is why API platforms usually do not charge per sentence or per request. They charge according to how many tokens the model actually reads and generates.
This is much more reasonable than charging by request count, because one request might contain 20 characters, while another might include 200,000 tokens of context. The resource consumption is nowhere near the same.

2. Why input and output are priced separately

Most model APIs today split pricing into two parts:

Input token price
Output token price

And in many cases, output tokens cost more than input tokens.

The reason is not hard to understand.

When a model processes input, it is mainly reading and encoding existing content. But when it generates output, it has to predict the next token, then the next one, then the next one. This is not just reading. It is an ongoing process of inference and sampling, which usually costs more compute.

You can think of it roughly like this:

Input: handing materials to the model
Output: asking the model to write the answer on the spot

Writing on the spot usually costs more than reading the materials once, so it is very common for output pricing to be higher.

3. Why long context makes costs easier to lose control of

Many people think they are only adding a bit more background information, but from the model billing perspective, the impact is often much bigger than expected.

The reason is that each model call usually has to process the full context included in that request again.

That means if your request currently contains:

A system prompt
Conversation history
Tool return values
Long document chunks
Source code files

all of that goes into input token billing.

So what really makes bills grow is often not the final question itself, but the long chain of context attached before it.
As the number of conversation turns increases, tool calls accumulate, and prior messages keep getting fed back in, token cost grows round after round.

4. Why tool calls are especially likely to inflate token usage

In scenarios like agents, coding assistants, and workflow automation, token usage is often much higher than in ordinary chat.

The issue is not just that the model wrote a paragraph. It is that the workflow keeps producing content like:

Reading files
Inspecting logs
Calling APIs
Returning JSON
Feeding tool results back into the model

As long as the result of each tool call gets inserted into the next round of context, it becomes a new source of input tokens.

That is why many developers eventually realize:
the model’s unit price is not always the real problem. The workflow itself may be stacking token cost layer by layer.

For example, imagine a coding agent doing the following:

Read the project structure
Open several source files
Run a test suite
Feed the error logs back into the model
Read more related files

Each step can make later requests carry even more context. Even if the unit price does not change, the total bill can rise quickly.

5. Why the same kind of model can have very different prices

Differences in token pricing between models are not only about vendors wanting to charge more. They are usually tied directly to several factors:

Model size
Inference efficiency
Context length
Deployment cost
Target market

The larger the model, the more active parameters it uses, and the more complex its inference path is, the higher the cost of generating one token usually becomes.
If the model also supports ultra-long context, more complex reasoning, or better tool use, the infrastructure pressure increases even more.

So pricing is really covering several kinds of cost:

GPU or accelerator resources
VRAM usage
Inference latency
Network and service stability
Peak concurrency capacity

A cheaper model is not necessarily bad, and a more expensive model is not necessarily the right choice for every task. In many cases, the price gap reflects how much infrastructure cost a certain level of capability requires.

6. Why cached input is cheaper

Many model platforms now offer features such as:

cached input
prompt caching
prefix caching

The shared idea behind them is simple: if a large chunk of input has already been processed once, do not keep recomputing it from scratch at full price.

For example, if you repeatedly send the same system prompt, the same tool instructions, or the same long document prefix, the platform may be able to cache part of that computation. Then even though it is still input token usage, the cached portion can be billed at a lower rate.

This also explains why many API pricing pages show three or more price tiers:

Standard input
Cached input
Output

The difference is not that the text means different things. It is that the underlying computation may or may not be reusable.

7. Why “cheap tokens” do not automatically mean lower total cost

When people see a model advertised as “very cheap per million tokens,” the first instinct is often that total cost must also be lower. In reality, not always.

That is because total cost is roughly:

token unit price × actual token volume

And actual token volume can be amplified by many things:

Prompts that are too long
Conversation history that is never trimmed
Too much tool output fed back in
Overly verbose model output
Repeated retries for the same task

So the real bill is not determined by price alone. It is usually determined by:

Model unit price
Input length per round
Output length per round
Number of calls
Workflow design

That is also why a “low-cost model” can still end up expensive in some agent workflows. It may need more rounds, more supplemental context, and more retry cycles.

8. How developers should estimate token cost

If you want better budget control in a real project, a simple way to estimate cost is:

Measure average input tokens per request
Measure average output tokens per request
Estimate how many rounds one complete task requires
Multiply by the model’s pricing

For example:

8k tokens of input per round
1k tokens of output per round
10 rounds for one task

Then what you are really consuming is not “one Q&A exchange,” but:

About 80k tokens of input
About 10k tokens of output

And if logs, tool results, and file contents keep being added along the way, the total grows even further.

That is why budget planning should not only look at a single round. It should look at how many tokens a full task loop will consume from start to finish.

9. How to control the bill in practice

If you are already using APIs or agents, the following methods are usually the most effective:

Shorten the system prompt and cut repeated wording
Trim old conversation history regularly
Keep only necessary fields from tool outputs
Retrieve first, then send only relevant parts of long documents
Limit output length and avoid unbounded expansion
Use expensive models for high-value tasks and cheaper ones for lower-value tasks

In many cases, the best way to save money is not to switch blindly to a cheaper model. It is to remove unnecessary token consumption from the workflow first.

10. How to think about all of this

At the end of the day, token pricing is a way of charging for how much the model had to read, infer, and write.

It is not like traditional software pricing, where per-account, per-request, or monthly billing is enough to describe resource use. A model call is a dynamic computation process. The amount of context you send, the tools you invoke, and the output length you request all directly affect cost.

So the most important thing is not memorizing price tables. It is building the right intuition:

Long context increases input cost
Long output increases generation cost
Tool chains amplify total token usage
Caching and workflow design can change the bill significantly

Once those points are clear, the pricing structure of most LLM APIs becomes much easier to understand.

Cost Analysis on KnightLi Blog