Building SwiftDeploy: A Self-Writing Infrastructure Tool with OPA Policy Enforcement and Prometheus Observability

Building SwiftDeploy: A Self-Writing Infrastructure Tool with OPA Policy Enforcement and Prometheus Observability

Introduction

What if your deployment tool could refuse to deploy when your disk is full? What if it could block a canary promotion when error rates spike — automatically, based on policy — without a single hardcoded if statement in the CLI?

That’s exactly what I built for Stage 4b of the HNG14 DevOps track. In this post I’ll walk through the full journey: from a manifest-driven deployment engine to a policy-enforced, fully observable stack with a live terminal dashboard and audit trail.

The Architecture at a Glance

manifest.yaml  (single source of truth)
      |
      v
swiftdeploy CLI
      |
      +-- Jinja2 templates --> docker-compose.yml + nginx.conf
      |
      +-- OPA policy check --> allow / block + reason
      |
      v
Docker Compose Stack
  ├── app (FastAPI + /metrics)
  ├── nginx (public ingress on swiftdeploy-net)
  └── opa (isolated on opa-internal, queried via docker exec)

The core principle: manifest.yaml is the only file a human ever edits. Everything else — config files, policy decisions, audit reports — is generated.

Stage 4a Recap: The Engine

In Stage 4a I built the foundation:

  • A manifest.yaml that describes the entire stack (image, port, mode, network)
  • A Python CLI (swiftdeploy) that reads the manifest and renders Jinja2 templates into docker-compose.yml and nginx.conf
  • Subcommands: init, validate, deploy, promote, teardown
  • A FastAPI service with /, /healthz, and /chaos endpoints
  • Canary/stable mode switching via promote

The key insight: the CLI never writes config by hand — it always renders from templates. Change one field in manifest.yaml, re-run init, and the entire stack config regenerates consistently.

Stage 4b: The Eyes and the Brain

Stage 4b adds three major capabilities:

  1. The Eyes — Prometheus /metrics endpoint
  2. The Brain — OPA policy sidecar enforcing deploy/promote gates
  3. The Memory — audit trail and report generation

1. Instrumentation: The /metrics Endpoint

The FastAPI service now exposes a /metrics endpoint in Prometheus text format. I implemented the metrics collector entirely in Python without any external library — just a middleware that intercepts every request and records it.

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    dur = time.time() - start
    if request.url.path != "/metrics":
        record_request(request.method, request.url.path,
                       response.status_code, dur)
    return response

Five metrics are exposed:

Metric Type Description
http_requests_total counter Requests by method, path, status_code
http_request_duration_seconds histogram Latency with 11 standard buckets
app_uptime_seconds gauge Seconds since process start
app_mode gauge 0=stable, 1=canary
chaos_active gauge 0=none, 1=slow, 2=error

The histogram uses standard Prometheus buckets (0.005s through 10s) so P99 latency can be calculated from bucket counts — no extra libraries needed.

2. The Policy Sidecar: OPA

Why OPA?

The spec had a critical requirement: the CLI must not make any allow/deny decision itself. All decision logic lives exclusively in OPA. This is the separation of concerns that makes the system auditable and extensible — you can change policy without touching the CLI.

Isolation Architecture

OPA runs as a sidecar in Docker Compose but on a completely separate network from nginx:

networks:
  swiftdeploy-net:    # nginx + app live here
    driver: bridge
  opa-internal:       # OPA lives here, isolated
    driver: bridge

services:
  nginx:
    networks: [swiftdeploy-net]   # can NOT reach OPA
  opa:
    networks: [opa-internal]      # can NOT be reached via nginx

This means there is zero path from the public port 8081 to the OPA API. The No "Leakage" requirement from the spec is satisfied architecturally, not just by configuration.

Domain-Isolated Policies

I wrote two completely independent Rego policies, each owning exactly one domain:

policies/infrastructure.rego — answers: Is this host safe to deploy onto?

package swiftdeploy.infrastructure

default allow := false

allow if { count(violations) == 0 }

violations contains msg if {
    input.disk_free_gb < data.infrastructure.min_disk_free_gb
    msg := sprintf("Disk free (%.1f GB) is below minimum threshold (%.1f GB)",
                   [input.disk_free_gb, data.infrastructure.min_disk_free_gb])
}

violations contains msg if {
    input.cpu_load > data.infrastructure.max_cpu_load
    msg := sprintf("CPU load (%.2f) exceeds maximum threshold (%.2f)",
                   [input.cpu_load, data.infrastructure.max_cpu_load])
}

policies/canary.rego — answers: Is the canary safe to promote?

package swiftdeploy.canary

default allow := false

allow if { count(violations) == 0 }

violations contains msg if {
    input.error_rate_percent > data.canary.max_error_rate_percent
    msg := sprintf("Error rate (%.2f%%) exceeds maximum threshold (%.2f%%)",
                   [input.error_rate_percent, data.canary.max_error_rate_percent])
}

Crucially, all threshold values live in policies/data.json — not in the Rego files:

{
  "infrastructure": {
    "min_disk_free_gb": 10.0,
    "max_cpu_load": 2.0,
    "min_mem_free_percent": 10.0
  },
  "canary": {
    "max_error_rate_percent": 1.0,
    "max_p99_latency_ms": 500
  }
}

To change the disk threshold from 10GB to 20GB, you edit only data.json. The Rego files never need to change. This is the single source of truth for policy thresholds.

OPA Never Returns a Bare Boolean

Every OPA decision carries the reasoning behind it:

{
  "allow": false,
  "violations": [
    "Error rate (46.94%) exceeds maximum threshold (1.00%) over the observation window"
  ]
}

The CLI surfaces this directly to the operator — no cryptic error codes, just a plain English explanation of why deployment was blocked.

3. The CLI: Gated Lifecycle

Pre-Deploy Check

Before bringing up the stack, swiftdeploy deploy collects host stats and sends them to OPA:

> swiftdeploy deploy
  Checking infrastructure policy...
  > Host -> disk: 328.6 GB | CPU: 0.20 | mem free: 37.4%
  + [OPA/INFRASTRUCTURE] Policy passed — proceeding
  > Bringing up the stack...
  + Stack healthy -> http://localhost:8081

If I were to fill up the disk to below 10GB, the output would instead show:

  x [OPA/INFRASTRUCTURE] Policy FAILED — blocked
              - Disk free (3.2 GB) is below minimum threshold (10.0 GB)
  x Deployment blocked by policy.

Pre-Promote Check (The Chaos Test)

This is where it gets interesting. Before promoting a canary to stable, the CLI scrapes /metrics, calculates error rate and P99 latency, and sends them to OPA.

I injected an 80% error rate using the chaos endpoint:

Invoke-RestMethod -Method Post -Uri http://localhost:8081/chaos `
  -ContentType "application/json" `
  -Body '{"mode":"error","rate":0.8}'

Then tried to promote:

> swiftdeploy promote stable
  Checking canary health policy...
  > Canary -> error rate: 46.94% | P99: 10 ms
  x [OPA/CANARY] Policy FAILED — blocked
              - Error rate (46.94%) exceeds maximum threshold (1.00%) over the observation window
  x Promotion blocked — canary is not healthy enough.

The canary policy gate caught a 47x threshold breach and blocked the promotion. This is exactly the kind of automated safety net that prevents bad canaries from reaching production.

4. The Status Dashboard

swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:

------------------------------------------------------------
  SwiftDeploy Status              2026-05-07 09:54:00
------------------------------------------------------------

  Mode: canary   Chaos: none   Uptime: 3420s

  Metric                           Value
  --------------------------------------------
  Throughput (req/s)               2.40
  Error Rate                       0.00%
  P99 Latency                      10 ms

  Policy Compliance
  --------------------------------------------
  [+]  Infra: Disk >= 10 GB
  [+]  Infra: CPU load <= 2.0
  [+]  Infra: Mem free >= 10%
  [+]  Canary: Error rate <= 1%
  [+]  Canary: P99 latency <= 500ms

  Refreshing every 5s — Ctrl+C to exit

Every scrape is appended to history.jsonl — a newline-delimited JSON file that forms the audit trail.

5. The Audit Report

swiftdeploy audit parses history.jsonl and generates audit_report.md with four sections:

  • Timeline — every deploy, promote, teardown, and policy check with timestamps
  • Mode Changes — when the stack switched between stable and canary
  • Policy Violations — every time a check failed, with the full violation message
  • Metrics Summary — min/max/avg of error rate, P99 latency, and throughput

The report renders perfectly as GitHub Flavored Markdown.

The Windows Challenge: OPA Port Binding

This section is for anyone running Docker Desktop on Windows — I hit a wall that took significant debugging to solve.

The problem: OPA’s port 8181 was correctly configured in docker-compose.yml as "0.0.0.0:8181:8181", and docker inspect confirmed the binding was set. But netstat showed nothing listening on 8181, and curl http://localhost:8181/health failed with connection refused.

This is a known Docker Desktop + WSL2 bug where port forwarding from WSL2 containers to the Windows host is unreliable for certain port ranges.

The solution: Instead of querying OPA via HTTP from the host, I switched to docker exec with the OPA CLI directly inside the container:

cmd = (
    f'docker exec -i {opa_container} opa eval '
    f'--data /policies '
    f'--stdin-input '
    f'--format json '
    f'"{opa_path}"'
)
r = subprocess.run(cmd, shell=True, input=input_json,
                   capture_output=True, text=True, timeout=10)

This approach:

  • Bypasses host port binding entirely
  • Works identically on Linux, Mac, and Windows
  • Is actually more reliable — no network stack involved at all
  • Satisfies the isolation requirement (OPA is still on its own network, nginx can’t reach it)

The lesson: when Docker networking misbehaves on Windows, docker exec is your escape hatch.

Lessons Learned

1. Separation of concerns is worth the complexity. Having OPA own all policy decisions and the CLI own only orchestration made both parts easier to test and reason about independently.

2. Thresholds in data, logic in code. Putting OPA thresholds in data.json instead of hardcoding them in Rego files means ops teams can tune policy without touching code or redeploying anything.

3. Every failure mode needs a distinct message. The spec said “every distinct failure mode must produce a different, human-readable outcome.” I ended up with five distinct OPA error states (unreachable, timeout, malformed JSON, undefined result, policy failed) each producing a clear, actionable message.

4. Platform-specific bugs are real. The Docker Desktop port binding issue cost hours. The fix (docker exec) is actually cleaner than HTTP anyway — but you only find that out after hitting the wall.

5. The audit trail is free if you build it from the start. Appending JSON to history.jsonl on every event costs almost nothing at runtime but provides complete forensic history for free.

Conclusion

SwiftDeploy Stage 4b is a deployment tool that can see (metrics), think (OPA policy), remember (audit trail), and refuse (policy gates). The entire stack — from /metrics to audit_report.md — is driven by a single manifest.yaml.

The code is available at: [your GitHub repo URL here]

If you’re building something similar, the key takeaways are: isolate your policy engine, never return bare booleans from policy checks, and always give operators a human-readable reason when you block them.

Built for HNG14 DevOps Track — Stage 4b

Leave a Reply