Ai-Edu

Context Compaction: Why Your AI Bill Just Doubled

By Singularity • March 31, 2026 • 10 min read

On March 13, 2026, Anthropic announced that Claude's 1M token context window was "now generally available" at "standard pricing, with no special pricing tier, no beta header, no asterisks." Three weeks later, developers discovered their bills had doubled.

The disconnect reveals a hidden truth about AI economics: large context windows are expensive, and someone has to pay. For developers, that "someone" is increasingly themselves.

The Harvey Tweet That Started It

On March 30, 2026, developer Harvey (yorkeccak) tweeted: "Opus 4.6 1Mil context now billed as extra usage... Guess we are back to /compact-ing our way through life."

The tweet referenced a GitHub issue documenting a silent billing change. In February 2026, a Max plan subscriber ran identical workloads before and after a Claude Code update. Same model, same plan, same task. The difference:

Before v2.1.51: 644 API calls, 85M tokens processed, zero extra usage charges.

After v2.1.51: 392 API calls, 80M tokens processed, $48.79 in extra usage charges.

The billing changed. The announcement didn't.

Why Context Costs Money

Large language models don't just process input — they store it. The "KV cache" (key-value cache) holds the intermediate state for every token in your conversation. For 1M tokens, that's gigabytes of GPU memory that must remain allocated for the entire session.

The math is brutal. Transformer attention is quadratic: at 10,000 tokens, the model tracks 100 million pairwise relationships. At 100,000 tokens, that's 10 billion. At 1M tokens, the computational cost explodes.

This is why Anthropic's "no surcharge" announcement was notable. Google charges premium for Gemini's 2M context. OpenAI charges per token for GPT-4's 128K context. Anthropic was promising 1M at standard rates. The fine print revealed otherwise.

The "Re-Reading Loop" Problem

Even when billing works correctly, large context creates a technical problem: signal dilution. As context grows, LLM performance degrades even when the window isn't full.

The "Lost in the Middle" paper found that LLM accuracy drops over 30% when relevant information sits in the middle of context rather than at the beginning or end. Every irrelevant token makes the model worse at attending to what matters.

This creates the "re-reading loop": an agent searches for code, fills context, the summarizer compresses the results, the summary paraphrases file paths and line numbers, the agent needs the exact reference for its next edit, so it searches again. The cycle repeats. Agents spend 60% of their time re-searching for information that was summarized away.

Seven Methods of Context Compression

Developers have responded with increasingly sophisticated compression techniques. Each trades off compression ratio, fidelity, speed, and hallucination risk.

1. LLM Summarization (70-90% compression)

The most widely deployed approach. An LLM rewrites conversation history into organized sections: completed work, current state, pending tasks. Claude Code's auto-compact uses this method. Factory.ai's evaluation scored structured summarization 3.70/5 on 36,000 real engineering messages. The tradeoff: code snippets get paraphrased, file paths become "the auth module," line numbers vanish.

2. Opaque Compression (variable compression)

OpenAI's Codex uses server-side compression via /responses/compact. You send context in, you get a smaller representation back. You cannot inspect what was kept or dropped. Factory.ai scored this 3.35/5 — lower than structured summarization, likely because the compressed output lacks explicit sections for done/failed/next.

3. Verbatim Compaction (50-70% compression, zero hallucination)

Morph Compact takes a different approach: deletion, not rewriting. The system identifies which tokens carry signal and removes the noise. Every surviving sentence is word-for-word identical to the original. Zero hallucination risk. Processing speed: 3,300+ tokens per second.

4. Token-Level Pruning (2-20x compression)

Microsoft's LLMlingua scores each token by information entropy, then removes low-information tokens. The result reads like telegraphic text: grammatically broken but semantically preserved. LLMlingua-2 runs 3-6x faster than the original with 2-5x compression. Limitation: it operates below the semantic level, so file paths might lose their line numbers.

5. Observation Masking (~100% compression per output)

The simplest approach: replace tool outputs with placeholders after the model has processed them. "[File read: src/auth.ts, 247 lines]" takes 7 tokens instead of 247 lines. JetBrains tested this against full LLM summarization on SWE-bench and found it matched quality at zero compute cost.

6. Adaptive Context Control / ACON (26-54% compression)

Microsoft Research's framework learns compression guidelines by analyzing why compressed versions fail. It treats compression as an optimization problem, not a fixed algorithm. Achieves 95%+ task accuracy while reducing tokens by up to 54%.

7. Multi-Agent Isolation (architectural approach)

Don't compress at all. Decompose tasks so each sub-task runs in its own context window. Anthropic's multi-agent research found that Opus 4 lead delegating to Sonnet 4 subagents outperformed single Opus 4 by 90.2% on research tasks. Tradeoff: 7x more total tokens across all agents, but each stays clean.

Prevention Over Compression

The most effective compression is the compression you never need to run. Morph's FlashCompact approach combines three prevention strategies:

WarpGrep: Semantic search returning only relevant snippets (0.73 F1 in 3.8 steps vs grep's 0.19 F1 in 12 steps)
Fast Apply: Compact diffs at 10,500 tokens/second
Verbatim Compaction: Cleanup at 3,300+ tokens/second

The combined effect: 3-4x longer context life. Auto-compact fires 3-4x less often. More time reasoning, less time compressing and re-searching.

What This Means for Developers

The pricing confusion is a symptom of a deeper reality: AI companies are still figuring out how to price large context windows. Anthropic announced "no surcharge" but bills it as "extra usage." Google and OpenAI charge explicitly by token. No model is quite right yet.

For developers, the practical response is twofold:

Prevention: Use semantic search (like WarpGrep) to return only relevant code, not entire files. Mask observations after processing. Keep context clean from the start.

Compression: When context fills, choose based on your needs. Summarization for highest compression. Verbatim compaction for zero hallucination risk. Multi-agent isolation for complex tasks.

The 1M context window is real. The "no surcharge" promise may not be. Either way, understanding context compression is now a core developer skill.

Sources

GitHub Issue #29289: "Max plan Extra Usage charges during usage reporting outage" (March 2026)
Anthropic Blog: "1M context is now generally available for Opus 4.6 and Sonnet 4.6" (March 13, 2026)
MorphLLM: "Context Compression for LLMs: 7 Methods Compared with Benchmarks (2026)"
ComputeLeap: "Claude's 1M Context Window Is Here. Is It Worth $15/MTok?"
Factory.ai: Evaluation of 36,000 real engineering messages
ACON Framework: Microsoft Research (arXiv:2510.00615)
LLMlingua: Microsoft (EMNLP 2023, ACL 2024)