
3 Simple Techniques to Reduce Token Consumption in Claude Code and Codex
Slavik Shynkarenko · June 22, 2026 · 12 min read
At AutoScout24 Group, Codex and Claude Code are now part of everyday engineering work. Teams, and increasingly non-technical colleagues, use coding agents for exploration, coding, and automation. As Head of Platform Engineering, I care about the bill that comes with that momentum: my team has to make LLM usage transparent, standardise the optimisations that work, and keep agents efficient without slowing anyone down.
The surprising part is where much of that cost comes from. Not the model thinking through a difficult feature or finding a subtle bug: that is the useful part. The waste is the thousand-line npm test log sent to the model so it can find one failing assertion. Or the five files dumped into context because nobody was sure where the relevant function lived. It is a frontier model writing a docstring that a much cheaper model would have handled just as well.
Tokens are the unit of cost, latency, and context capacity. Every unnecessary token in the context window is one the model cannot spend on the actual problem. Reducing token usage is not about penny-pinching. It is about preserving context for the work that actually matters: cheaper to run, quicker to respond, easier to keep on task. Across dozens of teams we have found that three techniques capture most of the value, and each one is small enough for an engineer to adopt without changing their whole workflow.
Technique 1: Stop feeding the model noise
The first rule of token economics: you pay for everything the model reads and writes back. A useful first step is to separate real signal from routine development noise.
Look at a single agent turn in a normal dev loop. The agent runs git status and receives 80 lines of untracked files. It runs the test suite and gets a wall of passing tests around the one failing assertion. It runs a build and reads every dependency warning emitted since 2021. The model will process all of that, at full token cost, to extract one or two facts.
You can reduce that traffic on both sides.
Trim the tool output: RTK
RTK sits between your agent and the commands it runs. It wraps noisy commands such as git status, npm test, and cargo build, then strips the output down to what actually matters before it reaches the model. The failing test survives. The 200 lines of passing scaffolding around it do not.
The useful part is how little discipline it requires from the agent. In Claude Code, rtk init --global installs a PreToolUse hook that rewrites Bash commands transparently. The agent types git status, the hook turns it into rtk git status, and the model never has to know.
As I’m writing this, Codex works differently in RTK’s current setup. RTK documents Codex as an instruction-based integration: rtk init --codex creates or patches AGENTS.md so Codex learns to prefer rtk <command> for noisy shell output. So you do not need to hand-write Codex hook configuration for the normal RTK path; run the installer for the agent you use, restart it, and keep working. The point is the same: less text entering the context window with very little workflow change. In practice, we see a 60% to 90% reduction on dev-loop output, exactly the high-frequency, low-value traffic worth squeezing.
RTK also keeps score. rtk gain shows the running tally of tokens saved, and rtk discover mines transcripts for noisy commands you have not wrapped yet. That measurement matters, because token optimisation gets much easier when you can see which changes actually move the numbers.
Trim the model’s own prose: Caveman
Output tokens are easy to ignore because they feel like “just the answer”, but they are expensive and they add latency. Left alone, a chatty model will preface work with “Great question”, restate your diff back to you, and close with a summary of what it just did. That can be pleasant in a tutorial. In a coding loop, it is often paid filler.
Caveman mode is the deliberately blunt version of the fix. It is a skill that compresses the model’s communication style by dropping articles, pleasantries, and filler while preserving the technical substance. Instead of:
The bug you're seeing is likely caused by the token expiry check, which appears to be using a strict less-than comparison.you get:
Bug in auth middleware. Token expiry uses `<`, should be `<=`. Fix:Users report significant reductions in output-token usage, particularly in iterative coding sessions where agents tend to produce verbose explanations. The style is not for everyone, but the underlying principle is useful even if you prefer a softer tone.
You do not need a dedicated tool for the gentler version. A few lines in CLAUDE.md or AGENTS.md go a long way:
## Output style
- Reply in unified diff form. No full-file rewrites unless asked.
- No preamble, no trailing summary of what you just did.
- For research questions, answer in 10 lines or fewer unless I ask for depth.It is also worth setting explicit output controls. In Claude Code, CLAUDE_CODE_MAX_OUTPUT_TOKENS caps most response output. In Codex, model_verbosity = "low" tells GPT-5-family models to be terse. It is not a hard ceiling, but it does reduce default prose. If you use MCP tools, their own output limits matter too, because an overly generous tool response can flood the context before the agent has made a single decision.
Key point of this technique: stop paying for words that carry no valuable information.
Technique 2: Give the agent a map, not a phone book
The second source of waste is more subtle. It appears when the agent does not know where something lives, so it goes looking. For an LLM, “looking” usually means reading.
Imagine asking an agent to change how authentication works. It does not have your mental model of the codebase. It greps for auth, opens six files to see which one matters, reads each file top to bottom, follows an import into a seventh file, and only then starts to understand the shape of the change. Every read is full-price input, and much of it is irrelevant.
What the agent needed was a map rather than a scavenger hunt.
The graph approach
A code graph tool parses your repository once, usually with Tree-sitter, and builds a structural model of it: which functions call which, what depends on what, where the tests live, and how wide the blast radius of a change might be. It then exposes that model to the agent, usually through MCP. Instead of reading ten files to discover that validateToken is called from three places and tested in one, the agent asks the graph and receives the answer in a few hundred tokens.
That is the trade: structured queries instead of brute-force reads. A targeted “who calls this?” replaces a fan-out of file dumps. The savings scale with repository size. The larger and more tangled the codebase, the more a blind agent would have had to read, and the more the graph saves.
At AutoScout24, we experimented with several tools. One I use daily is code-review-graph, an MCP server that maintains a persistent, incremental graph and auto-rebuilds on file save. The habit that makes it pay off is a simple rule in CLAUDE.md or AGENTS.md:
## Code exploration
Before any Grep/Glob/Read, call the code graph tools:
- semantic_search_nodes to find functions/classes by intent
- get_impact_radius before edits with a wide blast radius
- detect_changes + get_review_context for code review
Fall back to Grep/Read only when the graph cannot answer.Without that nudge, agents tend to fall back to grep-and-read habits. With it, structural context becomes the first move and file reading becomes the fallback. That ordering is where most of the benefit comes from.
Pick the flavour that fits your repo
code-review-graph is what I use, but it is one option in a fast-moving field. The right choice depends on what kind of “understanding” your agent needs.
Structural / call-graph tools answer “how does this code connect?”: callers, callees, dependencies, impact. Serena leans on language servers (LSP) to give the agent precise go-to-definition and find-references. Aider’s repo map builds a ranked structural summary so the agent sees the skeleton of a large project without loading every file. Meta’s Glean is the heavy-duty option when you operate at serious scale.
Semantic search tools answer “where is the code that means this?” using embeddings rather than structure. Claude Context is a popular open-source option here. It reports meaningful token reductions by retrieving relevant slices instead of whole files. This is useful when intent matters more than call graphs.
Repo-packing tools answer “give me the whole thing, but cleanly packaged.” Repomix flattens a repo into a single AI-optimised file, respecting
.gitignoreand counting tokens as it goes. It is less surgical than a graph, but it is a one-command way to hand a model a deduplicated project view.
You do not have to agonise over the perfect choice. The practical point holds across tools: any structured index beats blind file reads. Start with the category that matches your pain, whether that is call-graph confusion, semantic “where is it?” hunts, or the occasional need to pack a whole repo into context. Wire in one tool first. Adding a second rarely pays until the first is part of the workflow.
Technique 3: Use the right model for the job
Using the strongest model for every task feels safe, but it is often the expensive version of laziness. A frontier model is worth it for ambiguous, high-stakes work. It is wasteful for mechanical edits, log classification, or a single-file documentation cleanup.
Larger models are more expensive per token and the performance argument matters just as much. A smaller, faster model often returns a simple answer sooner, with less deliberation you did not need in the first place.
So the move is to route tasks to the model that fits them.
The routing map
The specific model names will change. The principle will not: use small models for high-volume, low-ambiguity work, and reserve frontier models for tasks where mistakes are expensive.
In practice the split looks like this:
- Small / fast models (Claude Haiku, GPT mini) -> triage, simple tests, doc edits, log classification, single-file refactors. High volume, low ambiguity.
- Frontier or coding-specialist models (Claude Sonnet/Opus, GPT-5) -> multi-file changes, hard reviews, migrations, anything where a wrong call is expensive. Low volume, high stakes.
Codex guidance at the time of writing recommends a frontier model such as gpt-5.5 for complex coding and research, and a faster *-mini variant for lighter tasks and subagents. Those names will age. The routing habit is the part that should stick.
Use subagents and profiles deliberately
Claude Code has a mature subagent workflow: keep a capable model as the orchestrator, then delegate bounded work to cheaper specialists. A triage subagent pinned to Haiku can classify a failure and hand back the next step. A review-diff subagent can inspect a patch and return only concrete findings. Each specialist runs on the smallest model that can do the job, with a tight output budget:
---
name: triage
description: Classify an issue or test failure into a known bucket
model: haiku
tools: Read, Grep
---
You triage. Reply in 120 tokens or fewer with: bucket, confidence, next step.Codex supports the same pattern through subagents as well. Custom agents live as TOML files under ~/.codex/agents/ or .codex/agents/, and they can override the same settings as a normal Codex session: model, reasoning effort, sandbox mode, tools, MCP servers, and instructions. A documentation specialist might look like this:
name = "docs_researcher"
description = "Verify APIs, options, and version-specific behaviour before documentation edits."
model = "gpt-5.4-mini"
model_reasoning_effort = "medium"
sandbox_mode = "read-only"
developer_instructions = """
Use docs and local evidence before making claims.
Return concise answers with links or file references.
Do not edit application code.
"""Profiles are useful for manual routing in Codex. In current Codex CLI versions, profiles are separate config overlays such as ~/.codex/cheap.config.toml and ~/.codex/frontier.config.toml, selected with --profile.
~/.codex/cheap.config.toml:
model = "gpt-5.4-mini"
model_reasoning_effort = "low"
model_verbosity = "low"~/.codex/frontier.config.toml:
model = "gpt-5.5"
model_reasoning_effort = "high"
model_verbosity = "medium"Then choose the profile at the task boundary:
codex --profile cheap "regenerate the changelog"
codex --profile frontier "plan the auth migration"The important knob is not only the model. Reasoning effort matters too. Even within one model family, shallow tasks do not need deep reasoning. Dialling reasoning down for straightforward work saves tokens; dialling it up for hard work spends them where they are most likely to pay back.
The mindset shift is to stop choosing one model for an entire session. A session is a mix of tasks with very different risk and difficulty. Pay accordingly.
Conclusion
The techniques I mentioned are easy to implement. No retrieval platform, no fine-tuning project, no grand rewrite of your developer workflow. Three practical moves cover most of the waste:
- Trim the noise. Wrap tool output with RTK, compress the model’s prose, and set sensible output controls.
- Give the agent a map. Use a code graph or another structured index so targeted queries replace brute-force file reads.
- Route to the right model. Use frontier models for hard calls, cheaper specialists for bounded work, and reasoning effort that matches the task.
The common theme behind all three is simple: reasoning is usually not the expensive part. Information movement is. Every unnecessary log line, file read, and verbose response competes with the actual problem you are trying to solve. Reduce the noise, improve navigation, and match the model to the task. The result is lower cost, lower latency, and often better outcomes.
If you take one habit away, make it measurement. Watch your cache-hit ratio and your cost per merged AI-assisted PR month over month. If both trend in the right direction, scale what works. Token efficiency shouldn’t be a one-time cleanup, but an easy to follow discipline.
