Beyond Dictation: Building Software Just by Talking

tl;dr: Kiro Steering Studio is a voice-powered tool that generates structured Kiro steering files through natural conversation – not dictation. Built on Amazon Nova 2 Sonic’s bidirectional streaming, it routes what you say to the right files, tracks open questions, and produces AI-optimized markdown context for your workspace. This post covers how it’s built, why it’s different from the voice AI tools you already use, and what I learned along the way.

Voice is becoming a first-class interface in developer tooling.
The voice-to-text space has exploded in 2026, with several products competing to make typing obsolete. Some treat it as a faster way to input text, while others use it to drive structured, agentic workflows.

The Voice AI Landscape for Devs in Q1 2026

OpenAI’s Codex has voice a supported feature, but the scope is intentionally narrow. The official Codex macOS app—released in February 2026 includes voice commands that let developers speak prompts directly into the agent interface. The VS Code extension similarly supports voice-to-text dictation for entering instructions. In both cases, voice is a prompt delivery mechanism: you speak a task, Codex executes it in an isolated sandbox, and proposes a PR. The voice interface doesn’t change the interaction model, but it removes your keyboard from the loop.

Anthropic has introduced an official Voice Mode for the general Claude app (mobile and web), but voice capabilities for Claude Code are largely based on community-developed, third-party integrations.

Cursor introduced official voice support in Cursor 2.0, where Voice Mode lets you control the editor and its AI features with spoken commands – things like “open file app.ts,” “extract function,” or “refactor this to use async/await.” The AI drafts a patch in response. It’s a meaningful step beyond pure dictation where voice connects directly to the agent layer, so spoken instructions can trigger multi-step edits.

SuperWhisper and WisprFlow sit at the other end of the spectrum – general-purpose dictation tools that developers have adopted for everything from crafting prompts to drafting documentation. WisprFlow wins on seamless “flow” with auto-edits that make dictation feel natural. Both integrate via keyboard shortcuts, and are excellent at transcription.

All of these tools validate the same insight: voice is faster and more natural than typing. However, they all act as input mechanisms.

When you use any of these tools to build software, you’re still doing the cognitive work of:

  • Structuring information into the right format
  • Maintaining consistency in terminology and conventions
  • Organizing content into logical sections

You might speak faster than you type, but you’re still manually authoring markdown files.

What Kiro Steering Files Actually Do

Before explaining how Kiro Steering Studio works, it’s worth understanding what steering files are and why they matter. At its core, steering gives Kiro persistent knowledge about your workspace through markdown files. Instead of explaining your conventions in every chat, steering files ensure Kiro consistently follows your established patterns, libraries, and standards.

Kiro Steering Files

Three core files capture project context:

  • product.md defines what you’re building: application one-liner, target users, MVP user journeys and features, non-goals, success metrics, and domain glossary.
  • tech.md defines how to build it: frontend stack, backend approach, authentication, data storage, infrastructure as code, observability, and styling guide.
  • structure.md defines project organization: repository layout, naming conventions, import patterns, architecture patterns, and testing approach.

Writing these by hand is tedious. Kiro offers an easy-button to auto-generate these if you already have a well-established codebase, but this is not the case if you’re building a new application from scratch.

How Steering Studio Is Different

Kiro Steering Studio treats voice as an interface to structured knowledge generation, versus simple transcription. Instead of writing steering files, you talk about your project. The AI asks clarifying questions, probes for details you might have overlooked, and generates properly structured steering files in real-time. The conversation becomes the documentation.

Conversational Extraction

Instead of dictating pre-structured content, you have a natural conversation. Tell Kiro Steering Studio “I’m building a task management app for internal engineering teams using React with TypeScript and Node.js.” The AI doesn’t just transcribe your words verbatim, but asks clarifying questions you might not have considered:

  • “What’s your state management approach – Redux, React Query, or Context API?”
  • “How do you handle authentication?”
  • “What’s your testing approach?”
  • “Should authentication use OAuth or magic links?”

Each answer updates the appropriate steering file, as the conversation probes for completeness.

Intelligent Routing

The AI understands where information belongs. When you mention “React with TypeScript,” it automatically updates the frontend section of tech.md. When you describe user journeys, they populate product.md. When you explain your directory structure, structure.md gets updated.

Active Gap Detection

The AI tracks what’s missing. If you haven’t specified your frontend stack or naming conventions, it logs open questions and prompts you to resolve them. When you answer, the steering files update automatically. Our goal here was to ensure completeness vs. passively record.

How It’s Built

The architecture splits into four concerns: streaming, session management, steering state, and tool handling.
Architecture Diagram: Kiro Steering Studio

NovaSonicClient: Bidirectional Audio Streaming

At the center of our app is real-time, bidirectional streaming with Amazon Bedrock. Specifically, with Nova 2 Sonic, Amazon’s speech-to-speech foundation model. Unlike request-response models where you record speech, send it as one request, wait for a response, then execute tool calls in a batch, Nova 2 Sonic processes audio as you speak and interleaves tool execution with the conversation.

Traditional voice AI flow:

  • Record all speech
  • Send complete audio
  • Wait for response
  • Execute tool calls
  • Send results
  • Wait for final response

Bidirectional streaming flow:

  • Audio streams continuously—no waiting for speech to finish
  • Model responds while you’re still talking
  • Tool calls happen mid-conversation, not after
  • Results flow back immediately, model continues speaking

Audio buffers queue up (max 220 chunks) and process in batches of five to prevent overwhelming the stream. When the queue fills under pressure, old chunks shed to maintain real-time responsiveness. The client handles session lifecycle—start, audio content, prompts, tool results, and graceful shutdown—through a state machine that tracks what events have been sent.

Tool System: Synchronous Execution, Interleaved with Speech

Tool calls don’t wait until you finish speaking. The model might be mid-sentence describing your project, realize it should update the product steering, emit a toolUse event, get the result back, and continue talking. This happens through the toolResult event handler:

session.onEvent('toolEnd', async (d: unknown) => {
  const toolData = d as ToolEndData;
  const result = runTool(store, toolData.toolName, toolData.toolUseContent);  // Synchronous
  await sonic.sendToolResult(socket.id, toolData.toolUseId, result);          // Send back to model
});

Seven tools control steering files, each with Zod schemas for validation:

  • set_product_steering: App description, user journeys, MVP features, success metrics
  • set_tech_steering: Frontend/backend stack, auth, data, infrastructure, constraints
  • set_structure_steering: Repo layout, naming conventions, architecture patterns
  • add_open_question: Log decisions that need resolution
  • resolve_open_question: Close out questions with documented decisions
  • get_steering_summary: Check what’s missing
  • checkpoint_steering_files: Persist to disk

Each tool description guides the AI toward producing content optimized for LLM-friendly bullets, exact versions, anti-patterns, file purposes. The descriptions are the secret sauce:

const techDescription = `Write in terse bullet-point format. For each field include:
- Exact versions (e.g., "Next.js 14.2" not "Next.js")
- Key conventions to follow
- What NOT to do (anti-patterns)
- Relevant CLI commands where applicable`;

SteeringStore: In-Memory State with Atomic Writes

The store maintains steering state in memory and writes atomically to disk. Merge mode (merge vs replace) controls whether updates extend existing content or overwrite it. Session state persists to a JSON file for recovery:

{
  "version": 1,
  "updatedAt": "2025-01-26T18:30:00.000Z",
  "product": {
    "appOneLiner": "A task management app for remote teams",
    "targetUsers": "Distributed engineering teams"
  },
  "tech": {
    "frontend": "React with TypeScript",
    "backend": "Node.js with Express"
  }
}

Restart the server, and the conversation picks up where you left off.

Design Decisions

A few things I learned building this:

  1. Conversational state is harder than it looks
    The first major challenge was maintaining conversational state – tracking what topics have been covered, what remains outstanding, and storing open questions for later follow-up. The solution was a state file management system combined with Zod-validated tool calling. This lets the AI atomically update steering files mid-conversation while persisting session context to a state.json file that enables recovery across interruptions. The schemas validate structure; the state file captures continuity.

  2. Humans think (long) before they answer
    The second challenge arose from human behavior during technical decision-making – humans need time to think before answering questions about architecture and tech stack choices, which creates longer pauses that could timeout the streaming session. We addressed this with a keep-alive timer mechanism in the session manager that sends periodic signals to maintain the Bedrock connection during extended thinking pauses without prematurely terminating active conversations.

  3. Tool descriptions matter more than schemas
    Early versions had minimal tool descriptions with detailed JSON schemas. The AI would call tools correctly but produce generic content. The [obvious] fix was treating tool descriptions as prompts: specific format guidance, examples of good output, explicit anti-patterns. The schemas validate structure; the descriptions shape quality. Really just good ol’ prompt engineering here, to no surprise!

  4. Tool execution must be fast
    Because our tools execute synchronously and interleaved with speech, slow tools would create audible pauses. All seven steering tools are designed for sub-millisecond execution: in-memory state updates with no blocking I/O. File persistence happens via checkpoint_steering_files(), which the model calls at natural conversation breaks. If you add custom tools, keep them fast or make them async with immediate acknowledgment.

If you’re building something voice-powered, I want to hear about it – leave a comment below!

Interested in giving Kiro Steering Studio a try? The code is available at our GitHub repo:
https://github.com/aws-samples/sample-kiro-steering-studio

Leave a Reply