Helix Blog

HelixVM: Stop Letting AI Agents Run Naked on Your Real Machine

Mon, 11 May 2026 00:00:00 GMT

What's really blocking AI agent adoption isn't model capability — it's the conflict between permission and safety. HelixVM exists to let agents work efficiently without putting your real machine at risk.

The harder I think about AI agent products, the more I believe the central problem isn't "can the model write code?" — it's:

Do you actually trust it enough to hand over the keys?

Give it too little permission, and it has to stop and ask for approval on every step. Give it too much, and it can quietly delete files, corrupt your environment, or even break your system.

This isn't an abstract design question. To actually finish a task, an agent has to touch real things: read files, change code, install dependencies, run commands, start services, move directories, access local ports.

So we built HelixVM — not because the world needs another VM tool, but to answer this:

How do we let everyday users run a high-efficiency AI agent safely, without forcing them to learn virtualization, buy cloud resources, or tolerate constant approval prompts?

A quick naming note: internally you'll see project names like aiagent, agentui, and helix-vm. For end users we want the brand to be cleaner: Helix Agent, Helix, and HelixVM.

1. Aggressive prompting looks safe — but isn't really

The permission model in many agent products is essentially: every potentially risky action triggers a confirmation dialog and waits for the user to approve.

On paper, this is reasonable. Should you confirm a file read? A code change? Running a shell command? Installing a dependency? Deleting a file? The user should know — sure.

The problem is what happens once the agent actually starts working continuously. The prompts pile up fast, and the user quickly falls into a pattern:

Prompt, click approve
Prompt again, click approve
Prompt again, keep approving
Eventually stop reading at all and just rubber-stamp everything

At that point "user confirmation" stops meaning anything. It doesn't reduce risk, but it absolutely slows everything down.

Worse, this design quietly shifts the safety burden onto the user: "I asked you. You clicked approve." But can a normal user really judge whether each shell command, each file write, each dependency install is safe? Usually not.

Constant approval is not safety. It's mostly the UI of safety.

2. Full permission is genuinely dangerous

The other extreme isn't an answer either.

If you grant the agent everything in the name of efficiency and let it execute freely on the host, the experience is glorious: no interruptions, no approvals, automatic code changes, automatic commands, automatic dependency installs, automatic cleanup.

This is exactly the kind of high-efficiency experience an agent should deliver — except it's all happening on your real machine.

Which means it can: delete your actual project files, corrupt your system environment, pollute your global dependencies, break a previously working dev setup, or run a destructive command in the wrong directory.

These aren't theoretical. Helix's early users hit them. Users of other agents have hit them too.

So we're stuck in a very concrete tension:

Efficiency means not asking every time. Safety means not letting the agent run naked on the host.

Most existing solutions swing between the two extremes. HelixVM tries a third path.

3. Cloud containers and sandboxes — right direction, wrong fit for many users

For the last few years, every cloud vendor has been talking about containers, sandboxes, remote dev environments, isolated cloud execution. The direction is right. An agent running inside an isolated environment is safer than one running directly on the user's machine.

But cloud-based solutions usually have a hidden assumption:

Safety = move to the cloud = consume cloud resources = enter the vendor's infrastructure stack.

That's great business for cloud vendors. But for many individual users, indie devs, and small teams, it's not the most comfortable answer. It typically means extra budget, remote environments, network latency, cloud resource management, and migrating your workflow onto someone else's infrastructure.

A lot of users do care about safety — they just don't want to buy a cloud subscription, learn another console, and maintain a remote environment as a prerequisite. If safety requires "go cloud first," you've already excluded all the local-first users.

4. Traditional local VMs aren't the answer either

So why not just spin up a local VM? VMware, VirtualBox, Parallels, or roll your own Linux with QEMU.

From a pure isolation standpoint, sure. But from a product experience standpoint, traditional VMs are too heavy.

First, they demand a lot of resources. Traditional VMs eat memory, CPU, and disk. Most users don't want a permanent desktop VM running just so an agent can edit some code.

Second, they demand a lot of knowledge. You need to understand images, disk layouts, networking, port forwarding, shared folders, system bootstrap, dependency installation. That's a lot for a normal user — and even people with a CS background often won't bother learning virtualization just to use an agent.

Third, even if you know how, it's still annoying. Today everyone is spoiled by one-click products. You can't really ask users to install VM software, download images, configure networking, tune ports, and manually launch the agent — all in the name of safety.

If the cost of safety is significantly more complexity, most users will just abandon safety.

5. So HelixVM's goal: make local VM isolation feel invisible

HelixVM's idea is simple:

Run the agent inside a lightweight local VM, but don't make the user feel the weight of a traditional VM.

Users don't need to know what QEMU is. They don't prepare images by hand. They don't configure port forwarding. They don't SSH into the VM to set up services. They don't buy a cloud server.

The experience should feel more like:

Pick an environment image → click Create → wait for boot → land in a ready agent workspace.

The isolation happens underneath. The complexity stays out of the user's face.

A traditional VM is a tool for the user to operate. HelixVM is an isolated runtime for the agent to live in. What the user actually cares about isn't "how do I configure the VM?" — it's "can my agent safely and efficiently get this done?"

6. How it works in Helix: Helix + HelixVM + Helix Agent

HelixVM isn't a single feature — it's an experience produced by several layers of Helix working together. From the user's perspective: pick a VM workspace, pick an image, create, then enter a ready agent workspace. Underneath, three layers cooperate.

Layer 1: Helix — the user-facing experience

Helix is the product surface users actually see. It exposes the VM workspace entry, starts the local HelixVM control plane, lets the user pick an image, creates the VM, manages port mappings, waits for both VM and in-guest Helix Agent readiness, and finally drops the user into a usable workspace.

One thing worth emphasizing: HelixVM's control plane is local by default — you don't need to register a remote cloud console first.

Layer 2: HelixVM — the local VM control plane

HelixVM is where all the messy VM details get wrapped up. It handles things the user shouldn't have to: the VM registry, template and downloaded image management, parsing bundles / disk images, allocating control / SSH / business ports, generating low-level VM launch plans, starting and stopping VMs, checking guest readiness, and cleaning up residual processes.

In other words, the user doesn't deal with QEMU directly. QEMU is still doing the heavy lifting underneath, but it's hidden behind HelixVM.

Layer 3: Helix Agent — inside the guest

Once the VM is up, Helix's execution environment runs inside the guest, with Helix Agent at its core. It does the actual work: reading and writing the workspace, executing shell, running builds and tests, managing sessions, exposing the agent API, establishing trusted connections back to Helix.

What the user opens at the end isn't an abstract VM — it's a ready agent workspace.

7. Invisible pairing: no manual pairing codes

A lot of local / self-hosted agent systems have one painful step: pairing. To make the UI trust the local agent, you usually have to start a service, find a pairing code, paste it into the client, wait for the binding, and save credentials.

HelixVM does it more smoothly. When the VM is created, Helix generates a one-time bootstrap secret and hands it to HelixVM as part of the launch. HelixVM injects this bootstrap data into the guest startup parameters. Once Helix Agent starts inside the guest, it reads the secret and treats it as the first-time binding credential.

Helix then calls Helix Agent's pairing endpoint with that secret. Helix Agent verifies it: is the secret correct, expired, or already consumed? On success, Helix Agent enters a bound state and issues a long-lived credential.

From the user's perspective, all they see is: the VM is ready, and the agent workspace is already connected.

8. Port forwarding + ready checks: making the in-VM agent feel local

The agent runs inside the VM. Helix runs on the host. They need to talk. HelixVM's default network mode is port forwarding, allocating control, SSH, and business service ports.

Helix then runs health checks to confirm the in-guest Helix Agent is actually responding — not just that a VM process exists.

This matters. "VM running" only means the VM process started; it does not mean the agent inside is ready. So Helix waits for two layers of readiness:

HelixVM reports guest control plane ready.
Helix Agent's health check inside the guest passes.

That's why a HelixVM workspace looks simple to create, but the underlying flow is more than "spawn a VM process."

9. Image marketplace: a safe environment shouldn't start from scratch

Handing the user a blank VM isn't enough. What they actually need is a usable dev environment, not a Linux system waiting to be configured.

So HelixVM ships with a template / image marketplace. Users can pick an image that fits what they're doing, for example:

Lightweight Linux + Helix Agent environment
Common dev toolchain environments
Browser-enabled automation environments
Environments tailored to specific languages or project types

This is where the experience really clicks. "Safe isolated environment" stops being an ops task and becomes a product choice: What am I working on today? Pick the matching image.

10. For the first time, safety and efficiency aren't opposites

Helix has always been focused on high-efficiency agents. We don't want users to be interrupted constantly while the agent is working — the whole point of an agent is to chain a complex task end to end: search code, edit files, run tests, analyze errors, fix, verify, summarize.

If every step needs user approval, you lose most of that automation value. But we also don't want the agent running unrestricted on the host.

So HelixVM's role is very clear:

Get safety from an isolated runtime, not from constant approval prompts.

With HelixVM, the agent can move much more freely inside the VM. Even if it makes a mistake, the blast radius stays inside the virtual machine — not on your real system.

11. HelixVM isn't a "VM feature" — it's infrastructure for agent products

The VM is just the mechanism. The real change is what it does to the agent's permission model.

Traditional agent products keep swinging between two bad options:

Option A: constant approvals. Looks safe, but fatigues the user, kills efficiency, and approvals end up perfunctory.
Option B: full permission. Efficient, but the agent is touching your host directly.

HelixVM offers a third one:

Option C: a highly permissive agent, inside an isolated environment. Inside the VM, the agent works fast. Outside the VM, the host still has a clear safety boundary.

This is the shape of AI agents I think actually fits everyday users: safe but not annoying; efficient but not naked on your machine.

Closing

HelixVM isn't trying to teach users how to run a VM. The opposite — we want users to get VM-level isolation without ever needing to understand a VM.

What we want to leave them with is a simple loop:

Pick an image. Click create. Let the agent work.

No cloud server. No VMware install. No virtualization crash course. No frantic approval clicking. And no agent running naked on your real machine.

We want AI agents to be more automated — but automation shouldn't cost you your real system. That's what HelixVM is for.

HelixVM is currently in private beta. If this resonates, come join the Helix beta and try it out.

Automatic Worktree: Stop Letting Agents Run Around on Your Main Branch

Thu, 02 Apr 2026 00:00:00 GMT

Every coding agent eventually writes to your repository.
The question is: what branch does it write to?

Most AI coding tools answer that question by making it your problem. Helix answers it at the system level, before the agent touches a single file.

This is the third boundary in Helix's multi-agent architecture. Manager Mode guards the boundary of intent. HelixVM guards the boundary of the host machine. Automatic Worktree guards the boundary that matters most to your code: the repository branch.

1. Writing directly to main: the part agent demos quietly skip

When an AI agent edits files, it needs somewhere to put them. The path of least resistance is the working directory the user opened — which is usually main itself.

This creates a class of problems nobody talks about in agent demos:

A task that dies halfway leaves main dirty. The agent started a refactor, got three files in, hit an error, and stopped. git status is a mess. The user has no clean way to tell what is safe to commit and what should be rolled back.
Concurrent tasks collide. Run multiple agents or sessions simultaneously, and they all write to the same checkout. File conflicts are unpredictable, hard to debug, and impossible to attribute — you cannot always tell which session dirtied which file.
Rollback is painful. The agent made changes the user did not want, but also made changes the user did want. They are tangled together in the same working copy, with no clean boundary to revert.
There is no "review before merge". The code is already on main. Review becomes retroactive acknowledgement instead of a preventive gate.

The worktree problem is not unique to AI agents. It is a well-understood challenge in any parallel development workflow. Git's own answer is the worktree: a separate checkout of the repository on a separate branch in a separate directory, with changes kept isolated until they are deliberately merged.

The real question is: who creates and manages that worktree?

2. The industry's answer: opt-in worktree + manual merge

Several coding agent tools have added worktree support. The pattern is consistent:

User: enable worktree mode
Tool: ok, worktrees are now enabled
User: run task
Tool: creating worktree... done
User: review output
User: merge branch          ← manual step

There are two structural problems with this shape:

Worktrees are opt-in. Users have to know the feature exists and remember to turn it on. Forget — or decide a small task is not worth the ceremony — and the agent writes directly to the working copy again.

Merge is always manual. The tool creates the worktree and supervises the agent, but bringing changes back to main is the user's job. That is fine for a single task. Across five concurrent tasks, or a workflow with dozens of daily agent runs, the manual merge cost compounds into real friction.

The direction is right. But "opt-in worktree with manual merge" still leaves the default path — the unconfigured, don't-think-about-it path — pointing straight at main.

And most accidental main-branch pollution starts exactly there: "It's just a small task, I won't bother with a worktree."

3. Helix's approach: worktree as a system constraint

In Helix, worktree isolation is not a feature users enable. It is an architectural constraint built into how agents interact with repositories. The rules are enforced at the system level, not asked for in a prompt.

Helix puts three hard rules around worktrees:

No binding, no write. Neither the Execution Agent nor any SubAgent can write to a git-tracked file on the current branch until a worktree binding exists for that repository. This is not a warning — the system rejects the write outright.
Bindings are declared by the agent, not configured by the user. When the agent decides a task requires changes to a repository, it calls create_worktree_binding first. The system creates the worktree, generates an isolated branch, and returns the path the agent should work in — all of this happens before the agent touches a single file.
Merge is automatic. When the session completes, code review passes, and the changes are committed, the system automatically merges the worktree branch back to base, removes the worktree, and deletes the temporary branch. Users do not have to manage any of it.

The end-to-end flow, as seen by a user, looks like this:

Agent: I need to write to /projects/my-repo
Agent: create_worktree_binding(project="/projects/my-repo", task="add auth middleware")
System: worktree created
        branch: aiagent/{session}/add-auth-middleware-a3f7c91d
        base:   main
Agent: [works entirely inside the worktree path]
Agent: [task complete, code review passed]
System: stage → commit → switch to main → pull → no-ff merge → remove worktree → delete branch

The agent never had access to main. Main was not dirty for a single moment during the task. The merge landed as a clean, traceable commit.

Isolation is not a toggle. It is the system's default shape.

4. The session lifecycle: from binding to cleanup

Once a binding is in place, every write the agent performs is routed to the worktree path. The original checkout stays untouched.

In practical terms, create_worktree_binding does a few things:

Walks up from the given path to find the Git repository root (the user's opened repo)
Reads the current branch — that becomes the merge target
Generates a branch name from the session ID and a sanitized task description, shaped like aiagent/{session}/{task}-{hash}, so any future change can be traced back to the session that produced it
Creates the worktree in a dedicated directory outside the repository
Records the binding — {project_root} → {worktree_path, branch, base_branch} — in session state

When the session enters finalization, the system runs merge and cleanup in a fixed sequence:

Stage and commit any uncommitted changes left in the worktree
Switch back to the base branch and pull --ff-only to pick up remote updates first
Perform a non-fast-forward merge of the worktree branch into base — preserving an explicit merge commit
Remove the worktree directory
Delete the temporary branch

That last non-fast-forward merge is deliberate. Branch history stays intact in the git log, and every stretch of agent work shows up as a distinct merge node. Anyone reviewing, auditing, or trying to revert later has a clean boundary to operate on.

After all of that, main has one new merge commit. The session's intermediate state is gone. No half-staged files, no orphan branches, no leftover worktree directories.

5. Cross-repo sessions: one binding per repository

Real engineering tasks rarely fit inside a single repository. A gRPC migration may touch backend, frontend, and a shared library. An analytics event addition may need both an app and a tracking SDK update.

Helix accounts for this at the worktree layer itself: one session can hold multiple worktree bindings — one per repository.

The session state holds a map of project_root → {repo_path, worktree_path, branch, base_branch}. Every repository gets:

its own isolated branch
its own worktree directory
its own base branch (each repo's main might be named differently)

At finalization, the system processes each binding in turn: merge, then clean up. If a particular repository's merge fails, the error is surfaced explicitly and that repository's worktree is preserved for human inspection — but repositories that have already merged successfully are not dragged into the failure.

This is what makes cross-repo tasks actually tractable. A single session can span five repositories without dirtying any of them; at the end, each repository receives its own clean merge commit.

6. Working with Manager Mode: parallel SubAgents inside a shared boundary

The worktree system becomes significantly more powerful when combined with Manager Mode's parallel SubAgent execution.

In Manager Mode the Execution Agent can dispatch multiple SubAgents to run concurrently. Each SubAgent has its own context, its own tool calls, its own LLM interaction. Without worktree isolation, parallel SubAgents writing to the same repository would collide instantly.

With worktree isolation, the picture is different:

The worktree is created by the top-level Execution Agent before any SubAgent is dispatched.
All SubAgents work in the same worktree path — that path is the shared isolation boundary.
SubAgents cannot create their own worktree bindings. create_worktree_binding is filtered out of the SubAgent tool list at the system level.
They inherit the worktree context that was set up before they were dispatched, but they cannot change it.

In other words, no matter how many SubAgents run in parallel — ten, twenty — the Execution Agent remains the single point of coordination for repository state. Manager Mode guards the boundary of intent; Automatic Worktree guards the boundary of physical writes. Together they make "a fleet of agents working in one repository without hurting each other" something the user no longer has to think about.

7. The code review gate: skip review, skip merge

Worktree finalization — merge plus cleanup — is gated on a passing code review. The session cannot complete and merge until that gate flips green.

This is not a prompt instruction asking the agent to review its work. It is a state machine check inside the session: a "review passed" flag must be set, or the finalize call refuses to proceed. The merge does not happen.

In practice, the workflow is forced into this exact order:

The agent decides it is done with the task
A code review is triggered
Review passes → the flag is set; review fails → the session continues working
Changes are summarized
The worktree branch is merged to main

Skipping the review is not a way to merge faster. Skipping review means skipping merge, period. The two are tied together at the system level. If you want to bypass review, you have to give up the merge — and the worktree stays quietly in its isolated directory waiting for human attention.

This turns "review before merge" from a good practice into a path the agent cannot work around.

8. Failure and cleanup: an explicit destructive boundary

Worktree creation and merge can both partially fail — directory exists but branch creation didn't, or the session is interrupted before finalize runs. Helix splits these into two clearly separated paths:

Explicit failure during finalize. The error is surfaced as-is, the session is not marked complete, and the worktree and branch both stay intact. The user can inspect the state, fix things manually, and retry. This is the "nothing is broken, it just hasn't merged yet" path.

Abandoning without merge. When a session is being given up — task cancelled, error makes the changes unwanted — the system invokes an explicit "best-effort cleanup": remove the worktree directory, and force-delete the unmerged branch.

The force-delete is intentional. The normal git branch -d refuses to delete an unmerged branch, which is a protection in regular development. On the "discard agent work-in-progress" path, that protection becomes an obstacle. So Helix opts in to destructive deletion only on this specific path.

This path is explicit about what it is: it represents discarded work.

With both paths clearly separated, user expectations become predictable. Merge failure leaves things intact and recoverable. Abandonment leaves the repository clean and free of residue. The two never cross-contaminate.

9. Why automatic beats opt-in

The case for opt-in worktree is "give users control" — skip the worktree overhead on small tasks.

Turn that around: "small tasks" are exactly where most accidental main-branch pollution starts.

The task looked small. The user did not bother to set up worktree. The agent did nine things the user expected and one thing the user did not. Now the user is untangling a mixed-up working copy with no clean boundary to revert.

Helix's position is that the overhead of worktree isolation is now low enough — creating a worktree is a fast git operation, cleanup is automatic, branch naming is automatic — that the tradeoff is worth making unconditionally. The constraint removes an entire category of repository-state problems from the user's mental load.

You don't enable worktree isolation. You don't remember which tasks to enable it for. It's simply how Helix works.

10. What users actually get

Translated into day-to-day experience, Helix's Automatic Worktree system means:

Main is never dirty from agent work. In-progress tasks always live in isolated branches and separate directories.
Parallel sessions don't interfere. Each session has its own worktree and its own branch. Five concurrent sessions, five completely independent working copies.
Merge stops being a manual step. When a task completes and clears review, the change lands on main as a proper merge commit.
History is clean and traceable. Every agent-driven change appears as a distinct merge commit. Branch names encode session ID and task description — any change traces back to the session that produced it.
Rollback is unambiguous. If a session produced changes you don't want, the merge commit itself is a clean revert target. No need to untangle half-finished file edits.
Cross-repo tasks are a first-class citizen. One session gracefully handles changes across multiple repositories, each receiving its own clean merge commit.

11. How to use it

Worktree isolation is enabled by default in every Helix session. There is nothing to configure.

Open a session, run a task that writes to a repository, and the agent will set up the worktree before writing; when the task completes, merge and cleanup happen automatically. From the user's perspective, the experience is just "tell an AI coworker what to do, watch the result land on main" — every isolation step in between is invisible.

If you want to observe the behavior directly:

Open a Helix session in a workspace with a git repository
Run any task that writes to repository files
While it runs, peek at ~/.aiagent/worktree/ — you'll see the isolated working copy
When the task completes, the worktree is gone and the changes are on main as a merge commit

For complex tasks that span multiple repositories or need parallel execution, combine Automatic Worktree with Manager Mode to get the full benefit of multiple SubAgents working safely inside one isolation boundary.

12. What's coming

A few worktree workflow improvements are in flight:

Worktree inspection UI — view active worktrees, their branches, and pending changes directly from the session panel, without dropping to a terminal.
Selective merge — approve or reject individual commits from a session before they land on main.
Cross-session worktree sharing — let related sessions share a worktree boundary for coordinated multi-session work.
Conflict resolution tooling — a better UI for cases where automatic merge fails and human intervention is required.

One core principle is not going to change:

Agents work on isolated branches. Main only receives deliberate, reviewed merges.

This is one of the inevitable consequences of designing agents as an engineering system rather than as a chat box. Together with Manager Mode's goal-keeping and HelixVM's execution boundary, Automatic Worktree forms Helix's answer to a deceptively simple question: when an agent is genuinely doing the work for you, who is keeping its boundaries?

Try it out

Manager Mode: How Helix Keeps AI Agents on Track for Real Engineering Tasks

Wed, 01 Apr 2026 00:00:00 GMT

Most AI agents fail not because they lack capability — they fail because they drift.

They start with your intent, pick up momentum, and end up doing twelve things you never asked for. Or they declare success before anything actually ships.

Manager Mode in Helix is the architectural answer to that problem. It is not a longer prompt or a cleverer system message. It is a real multi-agent orchestration layer implemented at the system level — one agent guards intent, one agent runs the work, and several SubAgents go deep in parallel, with strict separation of duties and mutual constraint.

1. The drift problem nobody talks about

You ask an AI to "refactor the authentication module." Three minutes later it has:

refactored authentication ✓
"improved" a bunch of unrelated utilities
changed the error handling convention across five files
added a new dependency it thought was "clearly better"
written a summary explaining why all of this was necessary

The core task might be done. But now you have a diff that touches forty files, your code review is a nightmare, and you have no idea what actually changed versus what the agent decided to change on its own.

This is scope drift. It happens because a single-agent system has no separation between understanding what was asked and executing what it thinks is needed. Those two things collapse into one thread with no guardrails.

There is a dual problem that is just as common — premature completion. The agent writes a tidy summary: "I've finished refactoring the authentication module with changes X, Y, Z." You go check the repository and find no commit in git log, no merge to main, no test run at all.

Drift and premature completion look like two different bugs. The root cause is the same: no independent role is responsible for "what did the user originally ask?" and "is this task actually done?".

Manager Mode solves this with a three-layer architecture where intent preservation, task execution, and parallel subtask handling are handled by separate, specialized agents.

2. What is Manager Mode

Manager Mode is Helix's orchestration layer for complex, multi-step engineering work.

When you enable it, your session gains a Manager Agent that sits between you and execution. The Manager does not write code. It does not run tools. Its job is to:

Receive your request and forward it to the Execution Agent — faithfully, without modification
Verify that what actually got done matches what you actually asked for
Enforce a strict definition of "done" that includes commit, merge, verification, and clean workspace
Refuse to call anything complete until all five criteria are met — with evidence

The Execution Agent handles the real work, breaking tasks into subtasks and running them in parallel using SubAgents. But it always operates under the Manager's scope constraints.

Think of it as having a technical project manager and a senior engineer on every task, where the project manager's only job is to make sure the engineer doesn't go off-script. That role split has worked in real engineering teams for decades; Helix transplants it verbatim into the AI agent system.

3. Three-layer architecture

Here is how the three layers interact:

You submit a task via chat or UI.
Manager Agent locks the original intent as a baseline, forwards the task verbatim to the Execution Agent, and checks completion across six dimensions: implementation + commit + merge + verify + clean workspace + no scope creep. It demands evidence, not summaries.
Execution Agent is the layer that actually does the work: plans and implements, decomposes work, manages all tool calls (file edits, shell, LSP, MCP), and reports back with verifiable evidence.
SubAgent A / B / C are parallel execution units dispatched by the Execution Agent via run_subagent. Each one focuses on a single independent task and all of them share the Execution Agent's git worktree.

SubAgents share the Execution Agent's git worktree — they operate within the same working copy of the repository. The Execution Agent coordinates all file changes and, when the task is complete, commits and merges the result back to the main branch.

SubAgents cannot spawn their own SubAgents. This is intentional. Unbounded recursion in agent systems leads to unpredictable resource usage and hard-to-trace execution paths. The three-layer limit is enforced at the system level, not as a polite reminder in a prompt.

The key design stance: no single agent simultaneously owns "defining the goal" and "executing the goal." The PM does not write code, and the engineer does not change the requirements. Real engineering teams call this "separation of duties." In AI agents, we call it Manager Mode.

4. Core mechanisms

4.1 Scope locking

When the Manager Agent receives your request, it records your original intent as a baseline. Every subsequent action by the Execution Agent is evaluated against this baseline.

The rules are strict:

"Improvements" that weren't requested → out of scope
Refactors that touch files not related to the task → out of scope
New dependencies added because the agent thought they were better → out of scope

The Manager maintains a mental model of what belongs to this task and what does not. If the Execution Agent tries to expand the scope — even with good justification — the Manager flags it and either rejects it or surfaces it to you as an explicit proposal.

An often-overlooked detail: the Manager has no file tools and no shell access. It can only observe and direct. This "invisible hand" design is precisely what makes scope enforcement credible — the enforcer cannot be tempted to "just fix one more thing."

Why can't a prompt solve this? This is the first question many people ask about Manager Mode — if the Manager only "guards intent," why not just write "don't go out of scope" in the system prompt?

The answer: prompts have no enforcement. If the same agent both interprets what you want and decides what to do, it will eventually override "the user didn't ask for this" with "I think this change is better." Only when the enforcer and the executor are two separate instances, and the enforcer literally has no ability to act, does the constraint actually hold. That is an architectural decision, not something prompt engineering can patch over.

4.2 The five completion criteria

The Manager will not report a task as complete until all five of these are true — with verifiable evidence:

#	Criterion	What counts as evidence
1	Original requirement implemented correctly	Test output, output of relevant commands
2	Changes committed	`git log` showing the commit
3	Merged to main branch	`git log main` showing the merge
4	Main branch verified post-merge	Build/test run on main after merge
5	Workspace clean, no scope violations	`git status` clean, diff shows only expected files

"The agent said it's done" does not count. The Manager requires actual command output, tool results, or test runs. This eliminates the most common — and most insidious — failure mode: an agent that summarizes success without actually delivering it.

This principle will feel "overly strict" at times, until you first run into a situation where the Execution Agent reports "done" and the Manager pulls up git status to find uncommitted changes still sitting in the worktree, and sends the task back. That is the moment you realize what the word "done" really means.

4.3 Parallel SubAgent execution

When the Execution Agent identifies independent subtasks, it dispatches them as SubAgents that run concurrently:

Execution Agent calls run_subagent("Fix auth module",  model="claude-sonnet-4-5")
Execution Agent calls run_subagent("Write tests",      model="claude-sonnet-4-5")
Execution Agent calls run_subagent("Update docs",      model="claude-haiku-4-5")
          ↓                              ↓                      ↓
    [runs in parallel]            [runs in parallel]     [runs in parallel]
          ↓                              ↓                      ↓
    SubAgent returns result       SubAgent returns result  SubAgent returns result
          ↓                              ↓                      ↓
                    Execution Agent collects all results
                              ↓
                    Runs verification, hands evidence to Manager

SubAgents share the Execution Agent's worktree and coordinate their file changes through the Execution Agent, which sequences writes and manages the final merge. This avoids the race condition of concurrent SubAgents stepping on each other's writes.

Parallelism is not only a performance win. It fundamentally changes the waiting experience — the original "ask, wait, ask again, wait again" serial rhythm is replaced by "set the full goal once, watch multiple workstreams converge."

And because the Manager Agent is guarding scope upstream, the "loss of control" risk that usually comes with parallelism is held in check. No matter how fast three SubAgents run, the Execution Agent still merges in order, and the Manager still verifies the whole delivery against the same standard. Faster, without losing edges.

4.4 Context management under long tasks

Long-running tasks accumulate a lot of history. Helix uses two mechanisms to keep sessions healthy:

KV Caching: Large tool outputs (file reads, command results, search results) are cached so they don't need to be re-sent with every LLM request. The cache is transparent — you don't configure it, it just works.

Auto-compression: When conversation history grows beyond a threshold, Helix compresses older messages into a concise summary and moves the "active window" forward. The agent retains full context of what happened without paying the token cost of the full history.

Both mechanisms are invisible during normal use. They're what makes a 50-turn task feel as responsive as a 5-turn one.

4.5 Session-level identity isolation

When you run multiple Manager sessions in parallel — one refactoring auth, one running a data migration, one building a new feature — Helix guarantees:

Each session has an independent Manager / Execution / SubAgent stack
Sessions do not bleed into each other — task A's scope baseline does not pollute task B
Switching workspaces also switches session state, model selection, and connection configs — you don't re-explain context

This isolation is a background mechanism. Users rarely notice it. But it is exactly what makes "leave Manager Mode running on multiple tasks and walk away" a safe thing to do.

5. How to use Manager Mode

5.1 Enabling it

On the workspace selector page, you will find a Manager entry alongside the Chat option. Click it to open a Manager session. The three-layer architecture is automatic — no extra configuration required.

5.2 Writing good task requests for Manager Mode

Manager Mode is most effective when your request is specific about scope boundaries. Compare:

Less effective:

Improve the login flow

More effective:

Refactor the login flow to use the new AuthService interface. Only touch files in src/auth/ and src/components/Login/. Don't change the API contracts.

The Manager uses your request as its scope baseline. The more precisely you describe what's in scope, the more precisely it can guard against drift.

You don't need to be exhaustive — the Manager can handle ambiguity. But explicit scope boundaries give it harder constraints to enforce.

A simple heuristic: if you would normally write a task spec or ticket before handing this work to another person, it's a Manager Mode task.

5.3 Handling scope expansion proposals

Sometimes the Execution Agent will identify something it thinks should be part of the task. The Manager will surface this to you as an explicit question rather than silently including it:

Execution Agent found an issue in the session middleware that may affect
the auth refactor. This was not in the original scope.

Expand scope to include middleware fix? [Yes / No / Defer]

Saying "No" or "Defer" keeps the current task clean. You can always start a new session for the follow-up work. This is Manager Mode's explicit boundary between focus and flexibility.

5.4 Monitoring progress

While a Manager session is running, you can see in real time:

Which SubAgents are active and what they're working on
Token usage per agent
Tool calls in flight
What changes are currently pending in the worktree

All of this is visible in the session's live event stream. If you step away and come back, you can scroll the stream to see which SubAgents the Execution Agent dispatched, their individual results, and the Manager's per-criterion completion checks.

6. Real-world scenarios

Scenario 1: Multi-module API migration

The task: Migrate three service modules from REST to gRPC.

Without Manager Mode: You start a session, the agent begins migrating auth service, notices the user service uses a similar pattern, starts touching that too, then realizes the test fixtures need updates, then decides to refactor the error types "since we're here anyway." Two hours later you have a diff across eleven modules and a broken build.

With Manager Mode:

You submit:

Migrate auth-service, payment-service, and notification-service from REST to gRPC. Use the existing proto definitions in /proto/. Don't touch other services or shared utilities.

The Manager locks this scope. The Execution Agent dispatches three SubAgents — one per service — running in parallel:

SubAgent A: auth-service migration      [parallel]
SubAgent B: payment-service migration   [parallel]
SubAgent C: notification-service        [parallel]

Each SubAgent works independently. When all three complete, the Execution Agent runs the merge sequence and verifies the build. The Manager reviews the final diff, confirms it only touches the three specified services, then presents you with a commit hash, test results, and a clean git status.

Total scope: exactly what you asked for.

Scenario 2: Large codebase refactor with test coverage

The task: A legacy data model class (LegacyUserRecord) needs to be replaced with the new UserProfile type across a large codebase — 60+ files.

Without Manager Mode: A single-agent session will lose track of its own progress in long tasks. It might fix 40 files, think it's done, write a summary, and stop. Or it might fix 60 files but introduce subtle differences in how it handled edge cases across different parts of the codebase.

With Manager Mode:

The Execution Agent uses LSP tools to find all 63 references to LegacyUserRecord, groups them into logical clusters by module, and dispatches SubAgents for each cluster:

SubAgent A: core domain models (12 files)
SubAgent B: API layer (8 files)
SubAgent C: service layer (18 files)
SubAgent D: repository layer (14 files)
SubAgent E: test files (11 files)

Each cluster is internally consistent. When all SubAgents complete, the Execution Agent runs the full test suite. The Manager verifies:

All 63 references migrated (via grep -r LegacyUserRecord returning empty)
Tests pass
No unrelated files changed

If any SubAgent missed a reference or introduced a regression, the Manager identifies the gap and sends the Execution Agent back to fix specifically that issue — not restart everything. In long tasks this matters: one missing reference should never force a full redo.

Scenario 3: Parallel feature development with merge coordination

The task: Implement a new analytics dashboard that requires backend API endpoints, frontend components, and database migrations — all independent work streams.

The challenge: Three engineers would normally do this in parallel. With a single AI agent, it becomes a serial slog.

With Manager Mode:

You send:

Build the analytics dashboard feature. Backend: add /api/analytics/summary and /api/analytics/events endpoints in src/api/. Frontend: create AnalyticsDashboard component in src/components/. Database: add migration for analytics_events table. These are independent — parallelize them.

The Execution Agent dispatches three SubAgents simultaneously:

SubAgent A [backend]   → writes API endpoints, runs unit tests
SubAgent B [frontend]  → builds React component with mock data
SubAgent C [database]  → writes migration, tests locally

SubAgent A and SubAgent C finish first. SubAgent B finishes 40 seconds later. The Execution Agent then:

Collects and applies all three SubAgents' results in sequence
Runs integration tests that connect all three layers
Fixes one minor import path conflict from the merge
Verifies the full test suite passes

The Manager confirms: three independent workstreams, completed in roughly the time it would have taken to do one serially, with verified integration.

7. When to use Manager Mode

Manager Mode adds orchestration overhead. For quick, scoped tasks it's often more than you need. Here's a rough guide:

Task type	Recommended mode
Quick question, explanation, code snippet	Standard chat
Single-file edit or small bug fix	Standard or Coder mode
Multi-file refactor within one module	Coder mode
Cross-module refactor, feature spanning multiple layers	Manager Mode
Large migration (many files, parallel workstreams)	Manager Mode
Long-running task where you need to walk away	Manager Mode
Task where scope drift has burned you before	Manager Mode

The signal: if you'd normally write a task spec or ticket before handing it to another person, you probably want Manager Mode.

The reverse is also true: if you just want to ask "why is this code erroring?", the three-layer architecture is overkill — standard chat is enough. One mark of a good tool is knowing when not to use it.

8. What makes this work in practice

A few design decisions that make the system reliable rather than just theoretically sound:

The Manager never executes. It has no file tools, no shell access. It can only observe and direct. This separation is what makes scope enforcement credible — the enforcer can't be tempted to "just fix one more thing."

SubAgents are recursion-limited. SubAgents cannot spawn their own SubAgents. This is a hard system-level constraint, not a prompt instruction. It keeps execution depth predictable and prevents runaway branching.

Evidence is required, not requested. The Manager's completion check is not "did the agent say it's done?" It's "can I see the command output that proves it?" The prompting enforces that distinction explicitly, and the system makes it part of the completion judgment.

Worktree is managed by the Execution Agent. SubAgents share the Execution Agent's git worktree. The Execution Agent coordinates write sequencing across parallel subtasks, so changes from concurrent SubAgents are applied in a controlled order rather than colliding.

Retries are built in. Every LLM call uses exponential backoff retry (up to 3 attempts, 2s initial delay). Transient API failures don't break long tasks.

Sessions are isolated. Multiple Manager sessions don't bleed into each other — role baselines, scope memory, SubAgent state are all kept separate. That's what makes "run several tasks in parallel" a safe thing to do.

9. Get started

If you haven't enabled Manager Mode yet:

Open Helix and create a new session
Click the Manager entry on the workspace selector page
Write your task with explicit scope boundaries
Watch the live event stream as subtasks execute in parallel

The first time a task that would have drifted stays clean — or the first time you see three SubAgents completing a week's worth of parallel work in minutes — is when the model clicks.

To understand the bigger picture of how Helix treats agents as an engineering system, read Introducing Helix.
Manager Mode runs on top of Automatic Worktree — the Execution Agent never modifies your main branch directly; all changes happen in an isolated branch first.
If your Manager Mode task is doing environment-heavy work, HelixVM turns the execution boundary into a security boundary as well.

10. What's coming

We're continuing to improve Manager Mode:

Richer evidence pages — visual breakdowns of what each SubAgent did, with diff summaries and test results inline
Scope proposals UI — cleaner interface for reviewing and approving scope expansion requests
Workflow templates — pre-built task templates for common patterns (migration, feature build, test coverage)
Team visibility — let collaborators see live task execution status too, so "what is the agent doing right now?" becomes a team-level signal rather than a single user's view

Questions, edge cases where it broke, tasks where it surprised you in a good way — send them our way. Manager Mode gets better from real workloads.

Switch Models Mid-Conversation: No Restarts, No Lost Context

Wed, 01 Apr 2026 00:00:00 GMT

Want a cheaper model? Paste your requirements again.
Want a stronger model? Re-explain the whole context.
That is what switching models looks like in most AI tools today.

Switching models mid-session sounds simple. In most systems, it actually means: start over.

You pick a model at the beginning of a conversation. You build context — twenty messages deep, a dozen tool calls, a pile of file reads. Then you realize the model is too slow, too expensive, or missing a capability you need. Your options: abandon the session and start fresh, or keep going with the wrong tool for the job.

Helix is not designed that way.

In Helix, the model is the session's current engine, not its identity. Users can change models at any point in a conversation — the same history, the same tool config, the same thinking context, just a different engine driving the next message.

This is the concrete product expression of the position stated in the Helix introduction: "the session is the unit of continuity; the model is just the current engine."

1. Committing to a model upfront is a structural waste

Every model has a different cost-capability tradeoff.

A model that's ideal for deep reasoning on a complex architecture problem is expensive for quick drafting work. A fast, cheap model that handles routine edits well falls short when you need multi-step reasoning across a large codebase.

Real work doesn't fit neatly into one category.

A coding session often starts with exploration — reading files, understanding structure, asking clarifying questions — and ends with implementation work that demands more capability. Or the reverse: start with a powerful model on the hard part, then switch to something faster for the follow-through.

The conventional approach forces users to make this choice once, at session start, with the least information they will ever have about what the task actually requires.

This isn't a limitation of the models themselves. It's the product form putting "pick the model" at the wrong moment.

2. Helix model switching: traditional way vs Helix way

In Helix, the model selector is always available in the chat toolbar. Users can change it at any point during a conversation.

The next message they send uses the new model — with full access to everything that happened before.

No reset. No re-explaining. No "let me catch you up."

The conversation history travels with the user across model changes. This is not a prompt injection trick where the previous messages are summarized and handed off — the actual message history is transferred to the new model directly, so it has the same context depth as if it had been in the conversation from the start.

Helix exposes model switching at three different levels of granularity:

Default model — the account- or workspace-level default engine; new sessions inherit from here
In-session switch — replace the engine at any point mid-conversation; takes effect on the next message
Per-tool-call routing — certain specialized tools (lightweight prompts, code completion) can use a cheaper model than the main conversation, routed automatically by the Agent system

All three levels share the same underlying switching mechanism; they differ only in where the switch is triggered.

3. Three switching paths

Not all model switches are the same. Helix handles three distinct scenarios differently, based on what needs to change under the hood.

Same provider, same base URL (e.g., gpt-4o → gpt-4.1): the LLM session updates its model ID in place. The existing connection, tool configuration, and message history are untouched. This is nearly instantaneous.

Same provider family, different base URL (e.g., switching between two custom OpenAI-compatible endpoints): the session updates its model ID, base URL, and API key. No Runner rebuild required.

Cross-provider switch (e.g., GPT-4o → Claude Sonnet, or any model → CLI mode): a full Runner rebuild happens. The message history is extracted from the old Runner, sanitized, and loaded into the new one. This is the most interesting case — and the one worth understanding in detail.

At the engineering level, "light switches" and "heavy switches" travel different code paths. From the user's point of view, they are all the same single click in the toolbar. Hiding that complexity is part of what the product is for.

4. What happens to your conversation history on a cross-provider switch

When you switch across providers, Helix performs a canonicalization pass on your message history before handing it to the new model.

Here's why that's necessary.

Different providers have subtly different requirements for what constitutes a valid message sequence. After a long session, your history may contain:

Empty assistant messages — left behind when a response was interrupted before content arrived
Orphaned tool calls — the assistant requested a tool but the result was never received (cancellation, network interruption, etc.)
Consecutive messages from the same role — an artifact of certain error recovery paths

Any one of these can cause the new provider's API to return a 400. The session would appear broken even though the underlying content is intact.

Canonicalization fixes this before the new model ever sees the history, in three passes:

Pass 1 — Remove empty assistant messages. content="" with no tool_calls triggers "non-empty content" errors. Strip them first.
Pass 2 — Merge consecutive user messages. After Pass 1 you may end up with two adjacent user messages. Identical ones are deduplicated; different ones are joined with \n.
Pass 3 — Trim unpaired tool calls. Scan the last 5 assistant tool-call messages, find any tool_call_id with no matching tool response, and truncate from that point.

The cleaned history is loaded into the new Runner. Tool config is restored (toolChoice: auto). If the new model supports extended thinking and it was enabled before, it is re-enabled — otherwise it is automatically disabled for that model.

Result: the new model sees a complete, clean conversation. Content is fully preserved. The provider-specific quirks from the old session are gone.

5. Three real-world scenarios

Scenario 1: Draft fast, refine with depth

You're writing a technical spec for a new API. The structure is straightforward — resource definitions, endpoint signatures, error codes. You want to get the draft out quickly without burning expensive reasoning capacity on scaffolding work.

You start with a fast, cost-efficient model. It handles the scaffolding well: proposes the initial endpoint list, drafts the request/response schema, sketches the error taxonomy. Thirty messages in, you have a solid skeleton.

Now the hard part: inconsistencies in the auth model, edge cases in the pagination design, questions about backward compatibility. This is where you want the strongest reasoning you can get.

You switch to your most capable model — right there, same session. It picks up the draft exactly where it is. You ask it to audit the auth design. It reads the full thirty-message history of decisions already made, flags two contradictions you hadn't noticed, and proposes a cleaner approach that's consistent with the patterns already established.

You didn't restart anything. You didn't paste the spec into a new window. The fast model did the work it was good at; the powerful model did the work it was good at. Total cost: a fraction of what it would have cost to run everything on the capable model from the start.

Scenario 2: Hit a capability wall mid-session, keep going

You're in a debugging session. A Go service is misbehaving under load — requests are stalling and you suspect a goroutine leak. You've been using a model with strong reasoning capability. Over the past fifteen messages it has traced the issue to a goroutine that's consuming from a message queue without a timeout.

Now you need to fix it: edit three files, run the test suite, check that the queue consumer behavior changes as expected. Your current model doesn't support tool calls.

You switch to a model with tool access. Same session, same history.

The new model can see the full diagnostic trail — the stack traces you explored, the hypothesis you validated, the exact files you identified. It doesn't need any re-explanation. It goes straight to the implementation, runs the tests, confirms the fix holds.

No re-diagnosis. No "can you summarize what we found?" The context is already there because the history is already there.

Scenario 3: Cost-aware multi-phase review

Your team has a batch of pull requests queued for AI-assisted review. Most are mechanical — check for common patterns, flag style violations, confirm test coverage. A few are genuinely complex — architecture decisions, security-sensitive changes, subtle logic in concurrent code.

You work through the batch in a single session. For the routine reviews, you stay on a fast, cost-effective model. It handles the pattern-matching well. When you hit a PR that touches the auth layer and the billing service simultaneously, you switch to your highest-capability model for that one.

Then switch back.

The session thread keeps the full record of every review, every flag, every comment. The model switches are invisible to anyone reading the session history — they just see a coherent thread of review work. The cost profile matches the actual complexity of each piece of work, not the worst-case complexity of any single piece.

6. Why the model is a parameter, not an identity

The design decision here is that the session, not the model, is the persistent entity.

Your conversation state lives in the session. The model is a parameter of how the next message gets processed. Switching the model doesn't change whose conversation this is — it's still the same conversation; only the next message gets handled by a different engine.

This means the model selector in Helix works differently from a provider switcher in other tools. You're not starting a "new conversation with Claude" — you're continuing the same conversation, but with a different engine processing the next message.

The WebSocket protocol reflects this. Every outbound message carries the current model ID. The backend checks it against the session's current model on each message and runs the appropriate switch path before sending to the LLM. There is no separate "switch model" API call. The switch and the message are one atomic operation.

Every message over WebSocket:
{
  "type": "message",
  "content": "...",
  "model": "builtin-anthropic:claude-sonnet-4-5",   ← current selection
  "req_id": "req_xxx"
}

Backend on receipt:
  if msg.Model != session.CurrentModel {
      runSwitchPath(session, msg.Model)   // one of the three paths above
  }
  // then process message with (possibly new) model

A side effect of this design: you can change models as often as you want. There is no accumulated penalty for switching back and forth. Every switch is evaluated fresh against the current state. There is no "the session gets weird after three switches" trap — because the switch itself carries no accumulated state.

It also means the session's continuity does not depend on "not changing models." Continuity comes from the message history stored on the session; the model is just the component consuming that history and producing the next response.

7. Get started

Model switching requires no configuration. The model selector is in the chat toolbar of every Helix session.

A few things worth knowing before you use it:

Switch any time. There is no right or wrong moment. The switch takes effect on the next message you send.
History is fully preserved. The new model sees everything that happened before it — not a summary, the actual history.
Tool configuration carries over. The new model gets the same tool access, provided it supports tool calls.
Thinking mode follows capability. If the new model supports extended thinking and you have it enabled, it continues. If it doesn't, it's disabled automatically for that model.
Switching is free. There's no cost to the switch itself — only to the messages you send after it.

The session is the conversation. The model is just whichever engine is running it right now.

Go deeper into Helix

Model switching is one concrete expression of a broader stance in Helix: the session is the persistent entity. The same thinking shows up in several larger places in the product:

Or just open Helix, grab a session that's currently stuck, and switch to a different model — see whether it picks up where the last one left off.

Introducing Helix — Not a Smarter AI Assistant, but an AI Teammate That Actually Delivers

Tue, 20 Jan 2026 00:00:00 GMT

Most AI coding assistants keep getting smarter at answering.
They've barely moved at delivering.
That's what Helix is built to change.

Over the past year, AI coding assistants have multiplied. Models got smarter, context windows grew longer, reasoning chains went deeper — and yet, for people actually running real engineering tasks through them, the lived experience didn't improve nearly as much.

They all follow a similar arc. The first session feels magical: it writes a function that almost looks production-ready. After a few weeks of real use, it starts to feel off. After a few more weeks, the pattern becomes clear: what these tools are really good at is answering beautifully, not getting things done.

Ask one to "refactor the auth module." It returns a polished, well-structured explanation: "Here's the new AuthService design…" — and then what? Editing the code, running the tests, fixing the build, committing, merging — that's still on the user.

It hands back an answer, not an outcome.

Real work needs outcomes. A cross-module refactor that actually runs to completion. A test suite that turns green. A commit that lands on main. A board where progress is visible to the team. None of that involves "saying it well."

That's the problem Helix was built to solve.

Helix isn't another AI assistant that answers questions — it's designed to be an AI teammate that actually delivers.

1. The ceiling of single-thread chat

Most AI coding tools are, at heart, just a smarter chat box.

The user talks, the model replies. Every piece of work gets compressed into the same conversational thread: understanding intent, exploring code, writing code, running tests, explaining errors, summarizing progress — all in sequence, all in one timeline.

The real problem isn't model capability. It's that the chat-box shape structurally can't carry complex work:

A task spans three modules; thirty turns in, the user has lost track of what the agent promised and what it skipped.
Long sessions get expensive and fragile: every new message replays the entire history through the model, costs scale linearly, but quality often doesn't.
The user can't see what the agent is actually doing. It says "done" — open the IDE, and the work is incomplete, or wrong.
Tasks cross local and remote environments; every switch means re-explaining context.

None of this is the model's fault. It's the ceiling of "conversation" as a shape.

How do human engineering teams handle problems like this? They split the work. A PM keeps the goal. Engineers implement. Multiple people work on sub-tasks in parallel. A board makes progress visible.

AI agents should do the same.

2. Helix treats agents as a system, not a conversation

The core stance of Helix can be put in one sentence:

The session is the unit of continuity; the model is just the current engine. The task is the unit of delivery; the conversation is just the record of how it happened.

Once the frame shifts from "who am I talking to" to "how does this task ship," the rest follows naturally:

Tasks can be decomposed, so there should be Manager + Execution + SubAgent role separation.
Sub-tasks are independent, so they should run in parallel instead of serially queuing.
Long tasks inevitably accumulate context, so Cache + Compact should protect quality and contain cost.
Real work moves between local and remote, so a workspace should be an independent execution boundary, not tied to a specific model or chat.

The rest of this post unpacks each of those.

3. Three-layer multi-agent architecture: agents that constrain each other

The most damaging failure mode of single-agent systems is scope drift. The user says "refactor the auth module," and three minutes later the agent has also "optimized" five unrelated utilities, swapped in a new error-handling library, and pulled in a new dependency — then written a thorough summary explaining "why all of this was necessary."

The root cause is direct: a single-agent system never separates "understand what the user asked for" from "decide what to do." Both responsibilities get fused into the same loop, with no guardrail between them.

Helix's answer is to split those responsibilities across different agents, so they can constrain each other.

Manager Agent — guards the goal, never touches code

The Manager doesn't write code, doesn't run shell commands, doesn't call tools. Its single responsibility is this: make sure what gets delivered is what was originally asked for.

It locks the user's original request as a baseline. Every action by the execution layer is evaluated against that baseline for scope creep. And it requires evidence of completion — not "the agent said it's done," but the commit actually in git log, the working tree actually clean, the test output actually green.

Execution Agent — drives execution, decomposes work

This is the layer that actually works. It reads code, writes code, calls tools, runs tests. But it doesn't try to be a hero — it actively breaks work into parallelizable pieces and dispatches them to SubAgents below.

SubAgent — goes deep, runs in parallel

Each SubAgent is an isolated execution context. It focuses on one thing inside its own view of the world, then reports results back to the Execution Agent.

The elegance of this architecture isn't "we have N agents." It's this:

No single agent is responsible for both defining the goal and executing it.

PMs don't write code. Engineers don't redefine requirements. For the first time, an AI system has role boundaries that resemble a real team.

For a deep dive into how the three-layer architecture defends scope in complex tasks, see Manager Mode.

4. Parallel scheduling: three tasks in the time of one

A chat box is a serial interface. Real work is often parallel.

Once the Execution Agent decomposes a task, Helix actually runs the sub-tasks concurrently — not just lists them, but dispatches and executes them at the same time.

Concrete example. The user says: "Migrate auth, payment, and notification services from REST to gRPC."

In traditional chat-based AI, this becomes:

user → model → auth migrated → user reviews → user says continue →
model → payment migrated → user reviews → user says continue → ...

In Helix it becomes:

Execution Agent
   ├─→ SubAgent A: auth migration         [parallel]
   ├─→ SubAgent B: payment migration      [parallel]
   └─→ SubAgent C: notification migration [parallel]
                ↓
   Execution Agent: collect → verify → deliver

Three independent pieces of work, finished in roughly the time of one.

Parallelism isn't just a speedup. It changes the waiting experience: the rhythm of "say something, wait, say something else, wait again" gets replaced by "state the full goal once, watch multiple tracks advance simultaneously."

More importantly — because the Manager Agent guards the upstream boundary, the "loss of control" usually associated with parallel agents is contained. SubAgents can run as fast as they want; the Execution Agent still merges their outputs in sequence, and the Manager still validates the integrated delivery against the original scope.

Faster, but the boundary holds.

5. Context management: a 50-turn task that feels like a 5-turn one

The most demoralizing thing about long sessions is that the deeper progress gets, the worse the experience becomes.

In the first few turns the AI is fast and accurate. After dozens of turns, the whole session feels heavy — every message reloads the entire history through the model, latency creeps up, quality drifts down, and the token bill starts to hurt.

Helix counters that curve with two mechanisms:

KV Caching. Large tool outputs — file reads, shell command results, search results — are cached. The agent knows "that history is sitting there, retrieve it when needed" instead of re-sending it verbatim with every model request.

Auto-compression. Once history exceeds a threshold, Helix compresses older sections into concise summaries and slides the active window forward. The agent still understands what happened; the user doesn't pay token cost for the entire transcript.

Both are enabled by default and invisible to the user. The goal isn't to look technically sophisticated — it's to make a 50-turn complex task feel as responsive as a 5-turn quick one.

6. Multi-workspace: local and remote stop being two separate products

Many AI tools assume the user works in one place — a local repo, a cloud IDE, or a remote dev container. The moment work crosses that boundary, the experience breaks.

Helix treats a workspace as an independent execution boundary.

Within a workspace, the sessions running there, the code those sessions write, the tools they call, the ports they reach — all stay inside that workspace's boundary. The user can have a local repo, a remote dev environment, an ephemeral VM, and a Docker container open at the same time. They're independent, but all driven by the same agent system.

That means:

"An experimental task on the laptop" and "a real deployment task on the remote machine" can run at the same time.
Switching workspaces carries session state, model choice, and connection config along — no re-explaining.
When something goes wrong, the workspace is a natural isolation unit — one workspace blowing up doesn't affect the others.

This idea extends further in HelixVM: when the workspace itself is a virtual machine, the agent's execution boundary becomes a safety boundary as well.

And at the source-control layer, Helix provides Automatic Worktree: the agent never touches the user's main branch directly. All changes happen on isolated branches, with code review gating the merge back to main.

7. The commitments Helix makes to itself

By now, the way Helix talks about agents probably feels different from most AI products.

Helix doesn't lean on phrases like "smarter model," "longer context," or "deeper reasoning." The team has seen too many products where the model keeps getting stronger but the user experience barely moves — great demo videos, same old problems in real use.

Helix holds itself to a short, plain list:

Not a pretty demo — a real task that actually runs to completion. A task that "looks done" doesn't count. A commit on main with a green test run does.
Not a smooth conversation — a change that actually lands. What matters in the end is whether the commit exists in git log, not how nicely the chat read.
Not the quality of a single answer — the reliability of the whole delivery. Turn 47 should still behave like turn 1.
Not the model doing the work for the user — the agent system doing it. The model is the engine; the architecture surrounding it is what shapes the experience.

If someone just wants an AI that answers questions, there are plenty of options out there.

But for people who actually run engineering tasks through AI — the kind of 30-turn task that spans five files, needs parallelism, needs visibility, needs to keep going when something breaks — Helix is built for that.

8. What's coming next

Helix is iterating quickly. On the roadmap:

Richer evidence views: visualizing what each SubAgent did, which files it touched, which commands it ran.
Stronger workflow templates and a skill system: turning repeatable engineering patterns — migrations, test coverage, logging instrumentation — into reusable "playbooks."
Cross-platform desktop parity: bringing macOS, Windows, and Linux experiences to the same level.
Team collaboration paths: letting humans and agents co-operate on the same task flow.

One thing won't change:

An agent shouldn't be designed as a toy that talks. It should be designed as a teammate that delivers.

Want to try it?

Helix is currently open to beta users.

If the idea that "agents should be designed as an engineering system" resonates, take Helix for a spin on the project the user knows best — and don't pick an easy task. Pick the one that makes them hesitate to hand it to AI again. That's the case Helix really wants to be tested against.

Helix Blog

HelixVM: Stop Letting AI Agents Run Naked on Your Real Machine

1. Aggressive prompting looks safe — but isn't really​

2. Full permission is genuinely dangerous​

3. Cloud containers and sandboxes — right direction, wrong fit for many users​

4. Traditional local VMs aren't the answer either​

5. So HelixVM's goal: make local VM isolation feel invisible​

6. How it works in Helix: Helix + HelixVM + Helix Agent​

Layer 1: Helix — the user-facing experience​

Layer 2: HelixVM — the local VM control plane​

Layer 3: Helix Agent — inside the guest​

7. Invisible pairing: no manual pairing codes​

8. Port forwarding + ready checks: making the in-VM agent feel local​

9. Image marketplace: a safe environment shouldn't start from scratch​

10. For the first time, safety and efficiency aren't opposites​

11. HelixVM isn't a "VM feature" — it's infrastructure for agent products​

Closing​

Automatic Worktree: Stop Letting Agents Run Around on Your Main Branch

1. Writing directly to main: the part agent demos quietly skip​

2. The industry's answer: opt-in worktree + manual merge​

3. Helix's approach: worktree as a system constraint​

4. The session lifecycle: from binding to cleanup​

5. Cross-repo sessions: one binding per repository​

6. Working with Manager Mode: parallel SubAgents inside a shared boundary​

7. The code review gate: skip review, skip merge​

8. Failure and cleanup: an explicit destructive boundary​

9. Why automatic beats opt-in​

10. What users actually get​

11. How to use it​

12. What's coming​

Try it out​

Manager Mode: How Helix Keeps AI Agents on Track for Real Engineering Tasks

1. The drift problem nobody talks about​

2. What is Manager Mode​

3. Three-layer architecture​

4. Core mechanisms​

4.1 Scope locking​

4.2 The five completion criteria​

4.3 Parallel SubAgent execution​

4.4 Context management under long tasks​

4.5 Session-level identity isolation​

5. How to use Manager Mode​

5.1 Enabling it​

5.2 Writing good task requests for Manager Mode​

5.3 Handling scope expansion proposals​

5.4 Monitoring progress​

6. Real-world scenarios​

Scenario 1: Multi-module API migration​

Scenario 2: Large codebase refactor with test coverage​

Scenario 3: Parallel feature development with merge coordination​

7. When to use Manager Mode​

8. What makes this work in practice​

9. Get started​

10. What's coming​

Switch Models Mid-Conversation: No Restarts, No Lost Context

1. Committing to a model upfront is a structural waste​

2. Helix model switching: traditional way vs Helix way​

3. Three switching paths​

4. What happens to your conversation history on a cross-provider switch​

5. Three real-world scenarios​

Scenario 1: Draft fast, refine with depth​

Scenario 2: Hit a capability wall mid-session, keep going​

Scenario 3: Cost-aware multi-phase review​

6. Why the model is a parameter, not an identity​

7. Get started​

Go deeper into Helix​

Introducing Helix — Not a Smarter AI Assistant, but an AI Teammate That Actually Delivers

1. The ceiling of single-thread chat​

2. Helix treats agents as a system, not a conversation​

3. Three-layer multi-agent architecture: agents that constrain each other​

Manager Agent — guards the goal, never touches code​

Execution Agent — drives execution, decomposes work​

SubAgent — goes deep, runs in parallel​

4. Parallel scheduling: three tasks in the time of one​

5. Context management: a 50-turn task that feels like a 5-turn one​

6. Multi-workspace: local and remote stop being two separate products​

7. The commitments Helix makes to itself​

8. What's coming next​

Want to try it?​

1. Aggressive prompting looks safe — but isn't really

2. Full permission is genuinely dangerous

3. Cloud containers and sandboxes — right direction, wrong fit for many users

4. Traditional local VMs aren't the answer either

5. So HelixVM's goal: make local VM isolation feel invisible

6. How it works in Helix: Helix + HelixVM + Helix Agent

Layer 1: Helix — the user-facing experience

Layer 2: HelixVM — the local VM control plane

Layer 3: Helix Agent — inside the guest

7. Invisible pairing: no manual pairing codes

8. Port forwarding + ready checks: making the in-VM agent feel local

9. Image marketplace: a safe environment shouldn't start from scratch

10. For the first time, safety and efficiency aren't opposites

11. HelixVM isn't a "VM feature" — it's infrastructure for agent products

Closing

1. Writing directly to main: the part agent demos quietly skip

2. The industry's answer: opt-in worktree + manual merge

3. Helix's approach: worktree as a system constraint

4. The session lifecycle: from binding to cleanup

5. Cross-repo sessions: one binding per repository

6. Working with Manager Mode: parallel SubAgents inside a shared boundary

7. The code review gate: skip review, skip merge

8. Failure and cleanup: an explicit destructive boundary

9. Why automatic beats opt-in

10. What users actually get

11. How to use it

12. What's coming

Try it out

1. The drift problem nobody talks about

2. What is Manager Mode

3. Three-layer architecture

4. Core mechanisms

4.1 Scope locking

4.2 The five completion criteria

4.3 Parallel SubAgent execution

4.4 Context management under long tasks

4.5 Session-level identity isolation

5. How to use Manager Mode

5.1 Enabling it

5.2 Writing good task requests for Manager Mode

5.3 Handling scope expansion proposals

5.4 Monitoring progress

6. Real-world scenarios

Scenario 1: Multi-module API migration

Scenario 2: Large codebase refactor with test coverage

Scenario 3: Parallel feature development with merge coordination

7. When to use Manager Mode

8. What makes this work in practice

9. Get started

10. What's coming

1. Committing to a model upfront is a structural waste

2. Helix model switching: traditional way vs Helix way

3. Three switching paths

4. What happens to your conversation history on a cross-provider switch

5. Three real-world scenarios

Scenario 1: Draft fast, refine with depth

Scenario 2: Hit a capability wall mid-session, keep going

Scenario 3: Cost-aware multi-phase review

6. Why the model is a parameter, not an identity

7. Get started

Go deeper into Helix

1. The ceiling of single-thread chat

2. Helix treats agents as a system, not a conversation

3. Three-layer multi-agent architecture: agents that constrain each other

Manager Agent — guards the goal, never touches code

Execution Agent — drives execution, decomposes work

SubAgent — goes deep, runs in parallel

4. Parallel scheduling: three tasks in the time of one

5. Context management: a 50-turn task that feels like a 5-turn one

6. Multi-workspace: local and remote stop being two separate products

7. The commitments Helix makes to itself

8. What's coming next

Want to try it?