Skip to main content

Manager Mode: How Helix Keeps AI Agents on Track for Real Engineering Tasks

· 14 min read
AI Agent Team
Core Development Team

Most AI agents fail not because they lack capability — they fail because they drift.

They start with your intent, pick up momentum, and end up doing twelve things you never asked for. Or they declare success before anything actually ships. Manager Mode in Helix is the architectural answer to that problem.

The drift problem nobody talks about

You ask an AI to "refactor the authentication module." Three minutes later it has:

  • refactored authentication ✓
  • "improved" a bunch of unrelated utilities
  • changed the error handling convention across five files
  • added a new dependency it thought was "clearly better"
  • written a summary explaining why all of this was necessary

The core task might be done. But now you have a diff that touches forty files, your code review is a nightmare, and you have no idea what actually changed versus what the agent decided to change on its own.

This is scope drift. It happens because a single-agent system has no separation between understanding what was asked and executing what it thinks is needed. Those two things collapse into one thread with no guardrails.

Helix Manager Mode solves this with a three-layer architecture where intent preservation, task execution, and parallel subtask handling are handled by separate, specialized agents.


What is Manager Mode

Manager Mode is Helix's orchestration layer for complex, multi-step engineering work.

When you enable it, your session gains a Manager Agent that sits between you and execution. The Manager does not write code. It does not run tools. Its job is to:

  1. Receive your request and forward it to the Execution Agent — faithfully, without modification
  2. Verify that what actually got done matches what you actually asked for
  3. Enforce a strict definition of "done" that includes commit, merge, verification, and clean workspace
  4. Refuse to call anything complete until all five criteria are met — with evidence

The Execution Agent handles the real work, breaking tasks into subtasks and running them in parallel using SubAgents. But it always operates under the Manager's scope constraints.

Think of it as having a technical project manager and a senior engineer on every task, where the project manager's only job is to make sure the engineer doesn't go off-script.


Three-layer architecture

Here is how the three layers interact:

┌─────────────────────────────────────────────────────────┐
│ You │
│ (send task via chat/UI) │
└───────────────────────┬─────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ Manager Agent │
│ │
│ • Locks original intent as a baseline │
│ • Forwards task to Execution Agent (verbatim) │
│ • Checks completion: impl + commit + merge + │
│ verify + clean workspace + no scope creep │
│ • Demands evidence, not just summaries │
└───────────────────────┬─────────────────────────────────┘
│ longterm_chat tool

┌─────────────────────────────────────────────────────────┐
│ Execution Agent │
│ │
│ • Plans and executes the actual implementation │
│ • Decomposes work into parallel subtasks │
│ • Manages tool calls: file edits, shell, LSP, MCP │
│ • Reports back with verifiable evidence │
└──────┬────────────────┬────────────────┬────────────────┘
│ run_subagent │ run_subagent │ run_subagent
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────────────────┐
│ SubAgent A │ │ SubAgent B │ │ SubAgent C │
│ │ │ │ │ │
│ "Fix auth │ │ "Write │ │ "Update integration │
│ module" │ │ tests" │ │ tests and docs" │
└────────────┘ └────────────┘ └────────────────────────┘
(shared (shared (shared Execution Agent
worktree) worktree) worktree, runs in parallel)

SubAgents share the Execution Agent's git worktree — they operate within the same working copy of the repository. The Execution Agent coordinates all file changes and, when the task is complete, commits and merges the result back to the main branch.

SubAgents cannot spawn their own SubAgents. This is intentional. Unbounded recursion in agent systems leads to unpredictable resource usage and hard-to-trace execution paths. The three-layer limit is enforced at the system level.


Core mechanisms

Scope locking

When the Manager Agent receives your request, it records your original intent as a baseline. Every subsequent action by the Execution Agent is evaluated against this baseline.

The rules are strict:

  • "Improvements" that weren't requested are out of scope
  • Refactors that touch files not related to the task are out of scope
  • New dependencies added because the agent thought they were better are out of scope

The Manager maintains a mental model of what belongs to this task and what does not. If the Execution Agent tries to expand the scope — even with good justification — the Manager flags it and either rejects it or surfaces it to you as an explicit proposal.

The five completion criteria

The Manager will not report a task as complete until all five of these are true — with verifiable evidence:

#CriterionWhat counts as evidence
1Original requirement implemented correctlyTest output, output of relevant commands
2Changes committedgit log showing the commit
3Merged to main branchgit log main showing the merge
4Main branch verified post-mergeBuild/test run on main after merge
5Workspace clean, no scope violationsgit status clean, diff shows only expected files

"The agent said it's done" does not count. The Manager requires actual command output, tool results, or test runs. This prevents the common failure mode where an agent summarizes success without actually delivering it.

Parallel SubAgent execution

When the Execution Agent identifies independent subtasks, it dispatches them as SubAgents that run concurrently:

Execution Agent calls run_subagent("Fix auth module", model="claude-sonnet-4-5")
Execution Agent calls run_subagent("Write tests", model="claude-sonnet-4-5")
Execution Agent calls run_subagent("Update docs", model="claude-haiku-4-5")
↓ ↓ ↓
[runs in parallel] [runs in parallel] [runs in parallel]
↓ ↓ ↓
SubAgent returns result SubAgent returns result SubAgent returns result
↓ ↓ ↓
Execution Agent collects all results

Collects results, runs verification

SubAgents share the Execution Agent's worktree and coordinate their file changes through the Execution Agent, which sequences writes and manages the final merge.

Context management under long tasks

Long-running tasks accumulate a lot of history. Helix uses two mechanisms to keep sessions healthy:

KV Caching: Large tool outputs (file reads, command results, search results) are cached so they don't need to be re-sent with every LLM request. The cache is transparent — you don't configure it, it just works.

Auto-compression: When conversation history grows beyond a threshold, Helix compresses older messages into a concise summary and moves the "active window" forward. The agent retains full context of what happened without paying the token cost of the full history.

Both mechanisms are invisible during normal use. They're what makes a 50-turn task feel as responsive as a 5-turn one.


How to use Manager Mode

Enabling it

On the workspace selector page, you will find a Manager entry alongside the Chat option. Click it to open a Manager session. The three-layer architecture is automatic.

Writing good task requests for Manager Mode

Manager Mode is most effective when your request is specific about scope boundaries. Compare:

Less effective:

Improve the login flow

More effective:

Refactor the login flow to use the new AuthService interface. Only touch files in src/auth/ and src/components/Login/. Don't change the API contracts.

The Manager uses your request as its scope baseline. The more precisely you describe what's in scope, the more precisely it can guard against drift.

You don't need to be exhaustive — the Manager can handle ambiguity. But explicit scope boundaries give it harder constraints to enforce.

Handling scope expansion proposals

Sometimes the Execution Agent will identify something it thinks should be part of the task. The Manager will surface this to you as an explicit question rather than silently including it:

Execution Agent found an issue in the session middleware that may affect
the auth refactor. This was not in the original scope.

Expand scope to include middleware fix? [Yes / No / Defer]

Saying "No" or "Defer" keeps the current task clean. You can always start a new session for the follow-up work.

Monitoring progress

While a Manager session is running, you can see real-time progress:

  • Which SubAgents are active and what they're working on
  • Token usage per agent
  • Tool calls in flight
  • What changes are currently pending in the worktree

This is all visible in the Manager Board, which we'll cover in detail below.


Real-world scenarios

Scenario 1: Multi-module API migration

The task: Migrate three service modules from REST to gRPC.

Without Manager Mode: You start a session, the agent begins migrating auth service, notices the user service uses a similar pattern, starts touching that too, then realizes the test fixtures need updates, then decides to refactor the error types "since we're here anyway." Two hours later you have a diff across eleven modules and a broken build.

With Manager Mode:

You submit:

Migrate auth-service, payment-service, and notification-service from REST to gRPC. Use the existing proto definitions in /proto/. Don't touch other services or shared utilities.

The Manager locks this scope. The Execution Agent dispatches three SubAgents — one per service — running in parallel:

SubAgent A: auth-service migration      [parallel]
SubAgent B: payment-service migration [parallel]
SubAgent C: notification-service [parallel]

Each SubAgent works independently. When all three complete, the Execution Agent runs the merge sequence and verifies the build. The Manager reviews the final diff, confirms it only touches the three specified services, then presents you with a commit hash, test results, and a clean git status.

Total scope: exactly what you asked for.

Scenario 2: Large codebase refactor with test coverage

The task: A legacy data model class (LegacyUserRecord) needs to be replaced with the new UserProfile type across a large codebase — 60+ files.

Without Manager Mode: A single-agent session will lose track of its own progress in long tasks. It might fix 40 files, think it's done, write a summary, and stop. Or it might fix 60 files but introduce subtle differences in how it handled edge cases across different parts of the codebase.

With Manager Mode:

The Execution Agent uses LSP tools to find all 63 references to LegacyUserRecord, groups them into logical clusters by module, and dispatches SubAgents for each cluster:

SubAgent A: core domain models (12 files)
SubAgent B: API layer (8 files)
SubAgent C: service layer (18 files)
SubAgent D: repository layer (14 files)
SubAgent E: test files (11 files)

Each cluster is internally consistent. When all SubAgents complete, the Execution Agent runs the full test suite. The Manager verifies:

  • All 63 references migrated (via grep -r LegacyUserRecord returning empty)
  • Tests pass
  • No unrelated files changed

If any SubAgent missed a reference or introduced a regression, the Manager identifies the gap and sends the Execution Agent back to fix specifically that issue — not restart everything.

Scenario 3: Parallel feature development with merge coordination

The task: Implement a new analytics dashboard that requires backend API endpoints, frontend components, and database migrations — all independent work streams.

The challenge: Three engineers would normally do this in parallel. With a single AI agent, it becomes a serial slog.

With Manager Mode:

You send:

Build the analytics dashboard feature. Backend: add /api/analytics/summary and /api/analytics/events endpoints in src/api/. Frontend: create AnalyticsDashboard component in src/components/. Database: add migration for analytics_events table. These are independent — parallelize them.

The Execution Agent dispatches three SubAgents simultaneously:

SubAgent A [backend]   → writes API endpoints, runs unit tests
SubAgent B [frontend] → builds React component with mock data
SubAgent C [database] → writes migration, tests locally

SubAgent A and SubAgent C finish first. SubAgent B finishes 40 seconds later. The Execution Agent then:

  1. Collects and applies all three SubAgents' results in sequence
  2. Runs integration tests that connect all three layers
  3. Fixes one minor import path conflict from the merge
  4. Verifies the full test suite passes

The Manager confirms: three independent workstreams, completed in roughly the time it would have taken to do one serially, with verified integration.


Manager Board: visibility across all your tasks

When you're running multiple Manager sessions — whether sequentially or in parallel across different projects — the Manager Board gives you a single view of everything.

What the Board shows

Open the Manager Board from the sidebar. You'll see four sections:

Needs Attention — sessions that are running, have unread updates, or have encountered errors. These appear as cards in a horizontal scroll row, sorted by urgency: errors first, then active, then unread.

Favorites — sessions you've starred for quick access. Good for long-running background tasks you want to check in on.

In Progress — all active (non-completed) sessions across all workspaces, in a grid layout.

Completed — finished sessions grouped by date. Today's group expands by default. Older groups are collapsed.

Session card anatomy

Each card shows:

┌──────────────────────────────────────┐
│ ● Auth refactor - v2 API │ ← session title (editable)
│ │
│ workspace: my-backend │ ← workspace name
│ ⚡ 3 agents running │ ← live SubAgent count
│ 🔧 2 background tasks │ ← background task count
│ │
│ [Open] [★ Favorite] [✓ Complete] │ ← action buttons
└──────────────────────────────────────┘

The indicator color reflects status: green (running), yellow (unread), red (error), gray (idle).

Bulk operations

For teams managing many tasks, the Board supports bulk completion:

  • Complete all — marks everything done
  • Complete tasks older than 30 days — good for periodic cleanup
  • Complete by workspace — useful when you've finished a project sprint

These are available from the dropdown on the Board header.

Real-time updates

The Board subscribes to live WebSocket events from all your workspaces. When a SubAgent completes a task, when an error occurs, or when a session goes from idle to running, the Board updates in real time without requiring a manual refresh.

If you're away when a session completes, the card will show an unread indicator when you return. Opening the session clears it.


When to use Manager Mode

Manager Mode adds orchestration overhead. For quick, scoped tasks it's often more than you need. Here's a rough guide:

Task typeRecommended mode
Quick question, explanation, code snippetStandard chat
Single-file edit or small bug fixStandard or Coder mode
Multi-file refactor within one moduleCoder mode
Cross-module refactor, feature spanning multiple layersManager Mode
Large migration (many files, parallel workstreams)Manager Mode
Long-running task where you need to walk awayManager Mode
Task where scope drift has burned you beforeManager Mode

The signal that you want Manager Mode: if you'd normally write a task spec or ticket before handing it to another person, you probably want Manager Mode.


What makes this work in practice

A few design decisions that make the system reliable rather than just theoretically sound:

The Manager never executes. It has no file tools, no shell access. It can only observe and direct. This separation is what makes scope enforcement credible — the enforcer can't be tempted to "just fix one more thing."

SubAgents are recursion-limited. SubAgents cannot spawn their own SubAgents. This is a hard system-level constraint, not a prompt instruction. It keeps execution depth predictable and prevents runaway branching.

Evidence is required, not requested. The Manager's completion check is not "did the agent say it's done?" It's "can I see the command output that proves it?" The prompting enforces that distinction explicitly.

Worktree is managed by the Execution Agent. SubAgents share the Execution Agent's git worktree. The Execution Agent coordinates write sequencing across parallel subtasks, so changes from concurrent SubAgents are applied in a controlled order rather than colliding.

Retries are built in. Every LLM call uses exponential backoff retry (up to 3 attempts, 2s initial delay). Transient API failures don't break long tasks.


Get started

If you haven't enabled Manager Mode yet:

  1. Open Helix and create a new session
  2. Click the Manager entry on the workspace selector page
  3. Write your task with explicit scope boundaries
  4. Watch the Manager Board as subtasks execute in parallel

The first time a task that would have drifted stays clean — or the first time you see three SubAgents completing a week's worth of parallel work in minutes — is when the model clicks.


What's coming

We're continuing to improve Manager Mode:

  • Richer evidence pages — visual breakdowns of what each SubAgent did, with diff summaries and test results inline
  • Scope proposals UI — cleaner interface for reviewing and approving scope expansion requests
  • Workflow templates — pre-built task templates for common patterns (migration, feature build, test coverage)
  • Team visibility — share Manager Board views across a team, so everyone sees live task status

Questions, edge cases where it broke, tasks where it surprised you in a good way — send them our way. Manager Mode gets better from real workloads.