Skip to main content

Context Management

You're using AI to refactor a large module. The Agent has read a dozen files, executed a bunch of shell commands, and searched tens of symbol references. Half an hour in, it starts getting "sluggish" — answers go off-track, it repeats earlier mistakes, and forgets key information you already told it.

The model didn't get dumber. The context window got overwhelmed.

Helix solves this with two complementary strategies: Cache (store and recall on demand) and Compact (compress into structured summaries).


Why Context Management Is Critical

Token consumption in tool-intensive AI engineering tasks grows far faster than in normal conversation:

  • Reading a 500-line file → ~2,000 tokens
  • Executing a find command → potentially thousands of lines of output
  • Searching code references → match results across dozens of files
  • Every tool call's parameters and results count toward context

After 10 tool calls, you may have consumed 50K+ tokens. And most models concentrate effective attention on recent content — important early decisions and context are being forgotten.

Without context management, you have two choices:

  1. Start a new chat — lose all progress
  2. Push through — output quality keeps declining, costs keep rising

Helix offers a third way.


Cache: Store Large Outputs, Recall on Demand

Cache solves this problem: tool outputs are too large, but you don't need them every time.

How It Works

  1. After conversation turns exceed a threshold, Helix automatically replaces large tool outputs in earlier messages with lightweight cache markers
  2. Original content is stored in a local KV cache (Pebble storage engine with SHA256 content-addressed deduplication)
  3. Cache markers remain in conversation history, in the form [CACHED] recall_cached_content("sha256hash")
  4. When the Agent needs to revisit a historical tool output, it recalls the original content on demand via the recall_cached_content tool

Key Design Choices

  • Content-addressed deduplication — identical content stored only once, SHA256 hash ensures uniqueness
  • High reversibility — any cached content can be recalled in full, no information loss
  • Automatic triggering — no user intervention needed, the system handles it based on conversation turn count
  • Workspace-level isolation — each Workspace has its own cache instance, managed through the builtin_context_cache MCP server

Practical Effect

Imagine a session with 80 tool calls where the first 60 tool outputs are no longer in the "active working area." Cache replaces them with markers of a few dozen bytes, freeing massive context space, while the Agent can still "look back" at any historical output whenever needed.


Compact: Compress Old Conversations into Structured Summaries

Compact solves this problem: conversation history is too long, but you can't just throw it all away.

Trigger Conditions

Compact triggers automatically when token usage reaches 85% of the context window (default window size: 128K tokens). Additional conditions must also be met:

  • No compression operation is currently in progress
  • At least 60 seconds have passed since the last compression (cooldown period)
  • There are enough historical messages to compress

Compression Method: "Handover Document" Format

Compact doesn't simply truncate or delete old messages. It calls AI to generate a structured handover document summary containing six sections:

Summary SectionContents
Project BackgroundProject type, tech stack, key architecture information
Technical DetailsImportant technical facts and constraints discovered
Completed WorkWhat's been done so far, key decisions and rationale
Current IssuesBlockers and unresolved problems
TODO ItemsWhat to do next, prioritized
User PreferencesStyle, constraints, and special requirements expressed by the user

This summary is injected into conversation history as a [Conversation History Summary] message, replacing the compressed old messages.

Three-Phase Lock Strategy

To ensure the compression process doesn't conflict with ongoing conversations, Compact uses a three-phase lock strategy:

  1. Hold lock, read data — acquire lock, read messages to compress
  2. Release lock, call AI — release lock, call AI to generate summary (this step takes a while, doesn't block conversation)
  3. Re-acquire lock, write results — lock again, write summary into conversation history

This means users can continue chatting normally while Compact generates its summary — no blocking.


Cache vs Compact: When to Use Which

DimensionCacheCompact
OptimizesRepeated transmission of large tool outputsOverly long conversation history
Reversibility✅ High — full original content recoverable anytime⚠️ Low — compressed to summary, details lost
TriggerBased on conversation turn thresholdToken usage hits 85% of window
StorageLocal KV cache (Pebble + SHA256)Replaced with summary message
User PerceptionNearly invisible — Agent auto-decides when to recallMay be noticeable — early details get summarized
Applies ToFile contents, command output, search resultsCompleted discussion segments, earlier reasoning

Complete Lifecycle of Dual-Strategy Collaboration

Session starts

▼ Normal conversation (all content in context)

├─ Tool outputs accumulate...

▼ Turn count exceeds threshold → Cache activates
│ Early tool outputs → cache markers
│ When needed → recall_cached_content retrieves them

├─ Conversation keeps growing...

▼ Tokens hit 85% of window → Compact triggers
│ Old conversations → "handover document" summary
│ Recent conversations remain intact

├─ Work continues...

▼ Approaching limit again → Compact fires again (after 60s cooldown)

└─ Session can run for hours / across days

Real-World Scenario

Scenario: A multi-day large-scale refactoring task

  • Hour 1: Understanding code structure, reading a dozen files → Cache starts storing early tool outputs
  • Hour 2: Forming a plan, beginning code changes → context approaches 85% → Compact generates the first handover summary
  • Hour 3: Continuing changes, encountering an issue that requires reviewing an earlier file → Agent recalls it via Cache
  • Next day: Another Compact → summary updates → work continues without losing key context

Throughout this entire process, users don't need to manage anything manually. The system automatically balances context capacity, information retention, and cost.


Synergies with Other Capabilities

With SubAgents

SubAgents run in isolated contexts with their own independent Cache/Compact strategies. A subtask's intermediate tool outputs never enter the main session's context — the main session only receives distilled summary results. This is another form of "context protection."

With Workspaces

Both Cache and Compact operate at the Workspace level — each Workspace has independent cache storage and compression policies. Context management across different projects never interferes.