SayFight – A Real-Time Voice-Controlled Party Game

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

SayFight is a chaotic, real-time, multiplayer party game where players shout funny 3-word phrases to activate their character’s actions in fast-paced mini-games. The game uses AssemblyAI’s Universal-Streaming API to transcribe speech in real time, allowing players to say phrases like “banana punch blast” or “snake sneak roll” to score points or move their team forward.

This project is built for the Real-Time Performance prompt, showcasing how AssemblyAI enables lightning-fast, voice-driven interactions for games and other real-time apps.

⚠️ Note: This is a proof of concept. Due to time constraints, the project focuses on core functionality:

Real-time voice command recognition

Phrase-to-action triggering

Multiplayer game loop

AssemblyAI integration via WebSocket streaming

The current version includes two simple game modes, each designed to demonstrate the real-time voice interaction loop. While limited in scope, SayFight lays the foundation for a more fully featured voice-controlled party game in future iterations.

Demo

🎮 Live Demo: https://sayfight.com

📸 Screenshots:

GitHub Repository

🔗 https://github.com/rezelco/sayfight.com

Technical Implementation & AssemblyAI Integration

SayFight uses AssemblyAI’s Universal Streaming API to transcribe player voice input with low latency. The entire gameplay loop is built around fast transcription, text matching, and synchronized game logic.

🛠 Stack

Frontend: Next.js 14 (App Router) with Tailwind CSS + shadcn/ui, HTML5 Canvas for game visuals
Backend: Node.js + TypeScript with Socket.io for real-time multiplayer
Voice Capture: Web Audio API with MediaStream (not WebRTC – direct audio capture)
Voice Streaming: AssemblyAI Universal Streaming WebSocket API
Audio Processing: 16kHz PCM audio format

🔁 Voice-to-Action Flow

Player joins room and grants microphone permission
- Browser captures audio using MediaStream API
- Audio is processed at 16kHz sample rate with echo cancellation
Audio is streamed to backend via Socket.io
- PCM audio chunks sent as ArrayBuffer
- Not using WebRTC — direct Socket.io binary transport
Backend relays audio to AssemblyAI
- Each player has a dedicated AssemblyAI WebSocket session
- Audio is forwarded in real-time to AssemblyAI’s streaming endpoint
Transcribed text is received with confidence scores
- AssemblyAI returns partial and final transcripts
- Only final transcripts with >0.7 confidence are processed
Command extraction and matching
- Text is matched against the player’s assigned phrase (e.g., “pandas hack wifi”)
- Scoring system:
  - Full phrase match: 2× points
  - Partial phrase: 1.5× points
  - Trigger word only: 1× point
- Total latency target: ~300–400ms
Game state update and broadcast
- Server updates authoritative game state at 60 FPS
- State changes broadcast to all clients via Socket.io
- No client-side prediction needed due to voice command delay envelope

🎯 Key Technical Features

🐼 Phrase-based matching: 100+ unique animal-themed phrases with homophone avoidance
🏆 Multi-tier scoring: Rewards for full phrase vs partial or single-word triggers
🧠 Debug visualization: Host UI shows expected vs received transcription for tuning
🔄 Persistent rooms: Players can reconnect and resume if disconnected
👥 Team-based audio: Tug-of-war mode assigns unique phrases to each team member

🛠️ Lessons Learned

Streaming audio reliably is tricky: WebRTC wasn’t ideal for this use case—direct capture with Socket.io worked better for latency and control.
Phrase design matters: Carefully crafted, phonetically distinct phrases dramatically improved transcription accuracy.
Fast > perfect: A snappy game loop with ~300ms latency was more fun than waiting for ultra-precise recognition.
AssemblyAI impressed: The API was reliable, fast, and handled noisy party environments better than expected.

🚀 Future Plans

If I continue this project, here’s what’s next:

🎮 More game modes like rapid-fire trivia, fighting, or co-op modes
🧠 Flexible NLU with fuzzy matching and intent-based phrase recognition
🌍 Multilingual support via AssemblyAI’s language capabilities
🧩 User-generated content: Let players design their own phrases and game types
📱 Mobile-first UI/UX with animated avatars, emotes, and voice indicators
🧪 Noise resilience and support for simultaneous voice inputs across teams

Thanks to AssemblyAI and DEV for hosting this challenge — building this was a blast!