How I Built a Deterministic Multi-Agent Dev Pipeline Inside OpenClaw (and Contributed a Missing Piece to Lobster)

TL;DR: I needed a code → review → test pipeline with autonomous AI agents, where the orchestration is deterministic (no LLM deciding the flow). After two months exploring Copilot agent sessions, building my own wrapper (Protoagent), evaluating Ralph Orchestrator, and diving deep into OpenClaw’s internals, I found that Lobster (OpenClaw’s workflow engine) was the right foundation — except it lacked loops. So I contributed sub-workflow steps with loop support to Lobster, enabling fully deterministic multi-agent pipelines where LLMs do creative work and YAML workflows handle the plumbing. GitHub Copilot coding agent wrote 100% of the implementation.

Table of Contents

  • The Backstory: Two Months of Chasing Autonomous Dev Agents
  • The Problem
  • Attempt 1: Ralph Orchestrator
  • Attempt 2: OpenClaw Sub-Agents
  • Attempt 3: The Event Bus Architecture (Overengineered)
  • The Breakthrough: Reading the Docs More Carefully
  • Attempt 4: Skill-Driven Self-Orchestration
  • Attempt 5: Plugin Hooks as an Event Bus
  • The Solution: Lobster + Sub-Lobsters
  • The Architecture
  • What I Learned
  • Current Status
  • How This Was Built

The Backstory: Two Months of Chasing Autonomous Dev Agents

This didn’t start last weekend. It started two months ago when GitHub shipped the Copilot coding agent — the ability to assign a GitHub issue to @copilot and have it work autonomously in a GitHub Actions environment, pushing commits to a draft PR. The Agent Sessions view in VS Code gave you a mission control for all your agents, local or cloud.

That planted the seed: if a cloud agent can work on one issue autonomously, what if you could chain multiple specialized agents into a pipeline? Programmer → reviewer → tester, all running in the background, all pushing to PRs.

Building Protoagent

The first thing I built was Protoagent — a multi-channel AI agent wrapper in TypeScript/Bun that bridges Claude SDK and GitHub Copilot CLI to Telegram and REST API. The idea was to control AI agents from my phone, using my own subscriptions, with no vendor lock-in. It supported multi-provider switching, voice messages via Whisper, session management, crash recovery, and a REST API for Siri/Apple Watch integration.

Protoagent solved the “talk to an agent from anywhere” problem, but not the orchestration problem. It was still one agent, one session, one task at a time. I needed the pipeline.

Discovering Ralph and OpenClaw

Around the same time, I found Ralph Orchestrator — an elegant pattern for autonomous agent loops with hard context resets. And then OpenClaw — which turned out to be a much more complete version of what I was trying to build with Protoagent: multi-channel, multi-agent, with a full tool ecosystem, skills marketplace, and a Gateway architecture.

OpenClaw made Protoagent redundant. But none of these tools solved the specific problem I was after.

The Problem

I wanted autonomous AI agents working as a dev team: a programmer, a reviewer, and a tester, running in parallel across multiple projects. The pipeline: code → review (max 3 iterations) → test → done. No human in the loop unless something breaks.

The requirements were clear:

  • Deterministic orchestration — a state machine controls flow, not an LLM deciding what to do next
  • Parallel execution — 4 projects × 3 roles = up to 12 concurrent agent sessions
  • Event-driven coordination — agents finish work and the next step triggers automatically
  • Full agent capabilities — each agent gets its own tools, memory, identity, and workspace

I spent a full day exploring options. This is the journey.

Attempt 1: Ralph Orchestrator

Ralph Orchestrator implements the “Ralph Wiggum technique” — an elegant pattern where you trade throughput for correctness by doing hard context resets between iterations. The agent has no memory except a session file (goal, plan, status, log), and each iteration starts fresh with only that file as context.

Ralph is solid, and it does support multiple parallel loops with Telegram routing (reply-to, @loop-id prefix). But for my use case it fell short:

  • Event detection is opaque. Ralph expects agents to emit events (like human.interact for blocking questions), but it’s unclear how to define custom events — say, code_complete or review_rejected — that would trigger transitions between different loops. The orchestration between agents (programmer finishes → reviewer starts) would require inventing the event emission and routing mechanism myself.
  • Limited channel connectivity. Ralph has basic Telegram integration for human-in-the-loop, but it’s not a multi-platform messaging gateway. I needed agents reachable from Telegram, WhatsApp, Discord, and potentially webhooks from CI systems.
  • No tool ecosystem. Each agent in my pipeline needs different tools — the programmer needs code execution and write access, the reviewer needs read-only access, the tester needs test runners. Ralph doesn’t have a plugin/skill/MCP management layer; you’d hardcode tool access per loop.
  • Agents aren’t fully customizable. No isolated workspaces, no per-agent identity or personality, no per-agent model selection (e.g., Opus for the programmer, Sonnet for the reviewer to save costs).

Ralph solved the “how to make one agent iterate reliably with hard context resets” problem beautifully. The session file pattern (goal, plan, status, log) is elegant. But I needed inter-agent coordination with event-driven transitions, not better intra-agent loops.

Attempt 2: OpenClaw Sub-Agents

OpenClaw is the open-source AI agent platform (150K+ GitHub stars) that connects to messaging platforms and runs locally with full tool access. It already had multi-agent support, so the obvious question was: can I use OpenClaw’s built-in sessions_spawn to create my pipeline?

Short answer: no. Here’s why.

sessions_spawn creates child agents within a parent session. The parent is an LLM that decides when to spawn children. This means:

  • Non-deterministic flow control. The LLM decides when the reviewer runs, when to retry, when to give up. That’s exactly what I wanted to avoid.
  • Auto-generated session IDs. Sub-agent sessions get keys like agent:<agentId>:subagent:<uuid>. I can’t address them by project name.
  • Spawn depth limits. maxSpawnDepth defaults to 1, max 2. An orchestrator pattern needs depth 2, and sub-agents at depth 2 can’t spawn further children.
  • Concurrency ceiling. maxConcurrent: 8 globally. With 4 projects × 3 roles, I’d hit the limit immediately.

The sub-agent model is designed for “main agent delegates subtask to helper” scenarios, not for peer-to-peer agent coordination with deterministic state machines.

Attempt 3: The Event Bus Architecture (Overengineered)

At this point I started sketching a custom architecture:

[Telegram] → [OpenClaw Gateway] ← WebSocket ← [External Orchestrator]
                    │                                    │
              [Agent Workspaces]                   State Machine
              - programmer/                        Redis Streams
              - reviewer/                          Worker Pool
              - tester/

The idea: use OpenClaw purely as I/O (messaging + agent execution), and build an external event bus with Redis Streams or NATS for routing, a state machine engine per project, and a worker spawner with pool control.

It would work. It would also be a massive amount of infrastructure for what should be a simple pipeline. I was reinventing half of what OpenClaw already does.

The Breakthrough: Reading the Docs More Carefully

Three OpenClaw features changed everything when I actually found them:

1. agentToAgent — Native Peer Messaging

Buried in the multi-agent docs:

{
  "tools": {
    "agentToAgent": {
      "enabled": true,
      "allow": ["programmer", "reviewer", "tester"]
    }
  }
}

When enabled, agents can send messages directly to other agents. Not sub-agents, not spawned children — peer agents with their own workspaces and identities.

2. sessions_send — Addressable Sessions

sessions_send(sessionKey, message, timeoutSeconds?)

An agent can send a message to any session key. Fire-and-forget with timeoutSeconds: 0, or synchronous (wait for the response). Combined with OpenClaw’s session key convention (agent:<agentId>:<key>), this means:

agent:programmer:project-a
agent:reviewer:project-a
agent:tester:project-b

The session key is the address. Agent + project as coordinates.

3. Webhooks with Session Routing

curl -X POST http://127.0.0.1:18789/hooks/agent 
  -H 'Authorization: Bearer SECRET' 
  -d '{
    "message": "Implement JWT auth",
    "agentId": "programmer",
    "sessionKey": "hook:project-a:programmer",
    "deliver": false
  }'

External triggers that route to specific agents and sessions. The deliver: false flag keeps everything internal — no Telegram notification until you explicitly want one.

Attempt 4: Skill-Driven Self-Orchestration

With these primitives, I could have each agent carry a “pipeline skill” that tells it to use sessions_send to pass the baton:

# Pipeline Skill
When you finish coding, call sessions_send to notify the reviewer.
When you finish reviewing, call sessions_send to notify the tester or programmer.
Read the session history to know which iteration you're on.

This works, but the state machine lives inside the LLM’s head. It’s reading the skill, interpreting rules, and deciding what to do. If the LLM misinterprets the iteration count or forgets to call sessions_send, the pipeline breaks silently.

I wanted deterministic orchestration. The LLM does creative work (writing code, reviewing code, running tests). A machine does the routing.

Attempt 5: Plugin Hooks as an Event Bus

OpenClaw supports custom hooks — TypeScript handlers that fire on events like message_sent, tool_result_persist, etc. My idea:

  1. Each agent emits a structured event at the end of its response: [event:code_complete] {"project": "project-a"}
  2. A plugin hook intercepts the output, parses the event
  3. The hook looks up a subscriptions.json to find the next agent
  4. It calls POST /hooks/agent to trigger the next step
const handler: HookHandler = async (event) => {
  const match = event.context.lastMessage.match(/[event:(w+)]s*({.*})/s);
  if (!match) return;

  const [, eventType, payload] = match;
  const targets = subscriptions[eventType];

  for (const target of targets) {
    await fetch("http://127.0.0.1:18789/hooks/agent", {
      body: JSON.stringify({
        message: data.message,
        agentId: target.agentId,
        sessionKey: `hook:${data.project}:${target.role}`,
        deliver: false
      })
    });
  }
};

This was closer — deterministic routing, testable without LLMs, extensible via JSON config. But it required writing a custom plugin, maintaining subscription mappings, and handling iteration counting in the hook.

Then I found the real solution.

The Solution: Lobster + Sub-Lobsters

What is Lobster?

Lobster is OpenClaw’s built-in workflow engine. It’s a typed, local-first pipeline runtime with:

  • Deterministic execution — steps run sequentially, data flows as JSON between them
  • Approval gates — side effects pause until explicitly approved
  • Resume tokens — paused workflows can be continued later without re-running
  • One call instead of many — OpenClaw runs a single Lobster tool call and gets a structured result

The analogy: Lobster is to OpenClaw what GitHub Actions is to GitHub — a declarative pipeline spec that runs within the platform.

A Lobster workflow file looks like this:

name: email-triage
steps:
  - id: collect
    command: inbox list --json
  - id: categorize
    command: inbox categorize --json
    stdin: $collect.stdout
  - id: apply
    command: inbox apply --json
    stdin: $categorize.stdout
    approval: required

Lobster can call any OpenClaw tool via openclaw.invoke, including agent-send (to message other agents) and llm-task (for structured LLM calls with JSON schema validation).

The Missing Piece: Loops

My pipeline needs to loop the code→review cycle up to 3 times. Lobster’s step model was linear — no native loop construct.

So I built it.

Sub-Lobsters: Nested Workflows with Loops

I opened PR #20 on the Lobster repo, introducing sub-lobster steps — the ability to embed a .lobster file as a step, with optional loop support.

New fields on WorkflowStep:

Field Description
lobster Path to a .lobster file to run as a sub-workflow
args Key/value map passed to the sub-workflow
loop.maxIterations Maximum number of iterations
loop.condition Shell command evaluated after each iteration. Exit 0 = continue, non-zero = stop

The loop condition receives LOBSTER_LOOP_STDOUT, LOBSTER_LOOP_JSON, and LOBSTER_LOOP_ITERATION as environment variables, so you can inspect the sub-workflow’s output to decide whether to continue.

The Final Pipeline

Main workflow (dev-pipeline.lobster):

name: dev-pipeline
args:
  project: { default: "project-a" }
  task: { default: "implement feature" }

steps:
  - id: code-review-loop
    lobster: ./code-review.lobster
    args:
      project: ${project}
      task: ${task}
    loop:
      maxIterations: 3
      condition: '! echo "$LOBSTER_LOOP_JSON" | jq -e ".approved" > /dev/null'

  - id: test
    command: >
      openclaw.invoke --tool agent-send --args-json '{
        "agentId": "tester",
        "message": "Test the approved code: $code-review-loop.stdout",
        "sessionKey": "pipeline:${project}:tester"
      }'
    condition: $code-review-loop.json.approved == true

  - id: notify
    command: >
      openclaw.invoke --tool message --action send --args-json '{
        "provider": "telegram",
        "to": "${chat_id}",
        "text": "✅ ${project}: pipeline complete"
      }'
    condition: $test.exitCode == 0

Sub-workflow (code-review.lobster):

name: code-review
args:
  project: {}
  task: {}

steps:
  - id: code
    command: >
      openclaw.invoke --tool agent-send --args-json '{
        "agentId": "programmer",
        "message": "${task}. Iteration $LOBSTER_LOOP_ITERATION.",
        "sessionKey": "pipeline:${project}:programmer"
      }'

  - id: review
    command: >
      openclaw.invoke --tool agent-send --args-json '{
        "agentId": "reviewer",
        "message": "Review this: $code.stdout",
        "sessionKey": "pipeline:${project}:reviewer"
      }'
    stdin: $code.stdout

  - id: parse
    command: >
      openclaw.invoke --tool llm-task --action json --args-json '{
        "prompt": "Did the review approve? Return approved (bool) and feedback (string).",
        "input": $review.json,
        "schema": {
          "type": "object",
          "properties": {
            "approved": {"type": "boolean"},
            "feedback": {"type": "string"}
          },
          "required": ["approved", "feedback"]
        }
      }'
    stdin: $review.stdout

Here’s what happens when someone sends “project-a: implement JWT” on Telegram:

  1. Lobster runs code-review.lobster as a sub-workflow
  2. The programmer agent writes code (full OpenClaw agent with tools, memory, identity)
  3. The reviewer agent reviews it (different agent, different workspace, potentially different model)
  4. llm-task parses the review into structured JSON: {approved: false, feedback: "..."}
  5. The loop condition checks $LOBSTER_LOOP_JSON.approved — if false and iteration < 3, go to step 2
  6. When approved (or max iterations reached), control returns to the parent workflow
  7. The tester agent runs tests
  8. Telegram notification sent

All deterministic. All inside OpenClaw. Zero external infrastructure.

The Architecture

Telegram
    │
    ▼
OpenClaw Gateway (:18789)
    │
    ├── Agents (isolated workspaces, tools, identity, models)
    │   ├── programmer/
    │   ├── reviewer/
    │   └── tester/
    │
    ├── Lobster (workflow engine)
    │   ├── dev-pipeline.lobster    (main: loop → test → notify)
    │   └── code-review.lobster     (sub: code → review → parse)
    │
    ├── llm-task plugin (structured JSON from LLM, schema-validated)
    │
    └── Webhooks (/hooks/agent)
        └── Trigger pipelines per project with isolated session keys

Each agent is a full OpenClaw agent:

  • Own workspace with AGENTS.md, SOUL.md
  • Own tools (programmer gets exec, write; reviewer gets read only; tester gets exec + test runners)
  • Own model (Opus for programmer, Sonnet for reviewer to save cost)
  • Own memory and session history

The LLMs do what LLMs are good at: writing code, analyzing code, running tests. Lobster does what code is good at: sequencing, counting, routing, retrying.

What I Learned

1. Don’t orchestrate with LLMs. Every time I tried to put flow control in a prompt (“when you’re done, send to the reviewer”), I introduced a failure mode. LLMs are unreliable routers. Use them for creative work, use code for plumbing.

2. Read the docs twice. I almost built an entire external event bus before discovering that OpenClaw already had agentToAgent, sessions_send, and webhooks with session routing. The primitives were there — I just hadn’t found them yet.

3. Contribute the missing piece instead of working around it. Lobster didn’t have loops. Instead of building a wrapper script or a plugin hook to simulate loops, I added loop support to Lobster itself. The sub-lobster PR is 129 lines of implementation + 186 lines of tests. It took less time than any of the workarounds would have.

4. Session keys are your data model. The pattern pipeline:<project>:<role> gives you project isolation, role separation, and addressability in one string. No database needed — the session key is the address.

5. Typed pipelines beat prompt engineering for coordination. A YAML file with condition, loop, and stdin piping is infinitely more reliable than telling an LLM “if the review is negative, go back to step 2, but only up to 3 times.”

Current Status

  • PR #20 is open on the Lobster repo — sub-workflow steps with optional loop support
  • The architecture works end-to-end with OpenClaw’s existing multi-agent, webhooks, and Lobster tooling
  • Next step: production testing with real projects

If you’re building multi-agent systems, consider whether your orchestration layer needs to be an LLM at all. Sometimes the best agent architecture is one where the agents don’t know they’re being orchestrated.

How This Was Built

This article describes work that spanned about two months and involved several different tools and approaches.

Claude helped me think through the architecture options — bouncing ideas, evaluating trade-offs between approaches, and structuring the decision tree. It was a thinking partner for the design phase.

The exploration of OpenClaw’s internals was largely manual. Claude wasn’t able to fully parse OpenClaw’s documentation and source code to surface the key primitives I needed (agentToAgent, sessions_send, Lobster workflows, plugin hooks). I found those by reading the docs myself, tracing through the codebase, and connecting dots that weren’t obvious from search results alone. If you’re building on a fast-moving open-source project, there’s no substitute for reading the source.

GitHub Copilot coding agent wrote 100% of the Lobster fork code. I assigned the task, described what I wanted (sub-workflow steps with loop support), and Copilot worked autonomously in its cloud environment. My only involvement was code review on the PR. The irony isn’t lost on me: an autonomous coding agent built the loop primitive that enables autonomous coding agent pipelines.

Leave a Reply