Voice Agent — System Architecture
mulaw 8kHz (Twilio)
Streaming + VAD
Regex→Rasa→SetFit→LLM
+ Function Calls
SSML support
mulaw (Twilio)
Tool Call Feedback Loop
When the LLM invokes a function (e.g. check_order_status), the pipeline executes the tool, feeds results back to the LLM (up to 3 rounds), and disables tools on follow-up rounds to force a natural language response.
Speech-to-Text Engine
Streaming Deepgram Nova-2 WebSocket integration with VAD events, smart endpointing (800ms), and utterance buffering with debounce logic.
- ClassDeepgramSTTEngine
- ProtocolWebSocket streaming
- EventsTranscript, SpeechStarted, UtteranceEnd
- Formatsmulaw/8kHz (Twilio), linear16/16kHz (browser)
- FeaturesSmart format, punctuation, filler words, keywords
LLM Engine
Multi-provider streaming LLM with function calling. Supports Gemini, OpenAI, and Anthropic with automatic role merging for Gemini's alternating-role constraint.
- ClassLLMEngine
- ProvidersGemini, OpenAI, Anthropic
- StreamingAsync generator (token, tool_call, done)
- ToolsOpenAI-format function declarations
- Featuressystem_override, role merging, TTFT logging
NLU / Intent Router
4-tier hybrid intent classification: instant regex patterns, Rasa NLU, SetFit transformer, and LLM fallback. Includes emotion detection and entity extraction.
- TiersRegex (0ms) → Rasa (<10ms) → SetFit (<5ms) → LLM (~1s)
- Intents17 predefined (order, appointment, transfer, etc.)
- Entitiesorder_id, date, time, purpose, email, phone
- Emotions6 classes (neutral, happy, frustrated, sad, angry, confused)
Text-to-Speech Engine
Dual-provider TTS with smart routing. ElevenLabs for premium quality, Cartesia Sonic for ultra-low latency. Includes SSML builder and emotion-aware prosody.
- ProvidersElevenLabs, Cartesia Sonic
- RouterTTSRouter (auto, quality, speed, cost)
- SSMLBreaks, emphasis, prosody, say-as, phoneme
- Formatsmulaw_8000, pcm_16000, mp3
- FeaturesVoice cloning, multilingual, emotion prosody
Conversation Memory
Dual-layer memory: session-level turn tracking with auto-summarization, plus persistent cross-call user profiles with preferences and history.
- SessionConversationMemory (turn tracking, summarization)
- PersistentUserMemory (name, tier, preferences, call history)
- StorageSQLite (user_memory, conversation_turns)
- FeatureAuto-summarize when exceeding max_turns
RAG Knowledge Engine
Retrieval-Augmented Generation with FAISS vector search (or numpy fallback). Loads documents from a knowledge base directory, chunks text, embeds, and retrieves relevant context.
- Vector StoreFAISS IndexFlatIP (or numpy cosine fallback)
- Embeddingssentence-transformers (all-MiniLM-L6-v2)
- FormatsMD, TXT, HTML, JSON, CSV
- Chunking300-char voice-optimized paragraphs
- CachingFAISS index persistence + content hash invalidation
Realtime Audio LLM
OpenAI GPT-4o Realtime API integration for direct audio-in/audio-out streaming. Bypasses the separate STT + LLM + TTS pipeline for ultra-low latency.
- ProtocolWebSocket bidirectional audio
- VADServer-side voice activity detection
- ToolsFunction calling via Realtime API
- Barge-inResponse cancellation support
WebRTC Audio Handler
Browser-to-server audio streaming with format conversion. Resamples 48kHz float32 browser audio to 16kHz int16 for the STT pipeline.
- InputFloat32, 48kHz, mono (browser)
- OutputInt16, 16kHz, mono (STT)
- ResamplerLinear interpolation
- Buffering100ms frames
Twilio Handler
Inbound call webhook (TwiML response), WebSocket media stream handler, outbound dialing, and call session lifecycle management.
- InboundTwiML + Media Streams WebSocket
- Audiobase64 mulaw encoding/decoding
- DTMFDigit handling (0 = transfer to human)
- Barge-inMark-based audio sync + clear
- Stateactive_sessions dict (call_sid → CallSession)
Outbound Campaign Manager
Batch dialing engine with DNC compliance, answering machine detection, TCPA calling hours enforcement, and real-time campaign analytics.
- AMDAnswering machine detection (HUMAN/MACHINE/FAX)
- DNCDo-Not-Call list management + scrubbing
- TCPA9 AM – 9 PM calling hours enforcement
- ConcurrencySemaphore-based rate control
- AnalyticsContact attempt tracking, status filtering
Security Middleware
Twilio signature validation (HMAC-SHA1), PII redaction (SSN, credit cards, DOB, etc.), admin authentication, IP-based rate limiting (200 req/60s).
Observability
Prometheus-format metrics export, circuit breaker pattern for provider failover (CLOSED → OPEN → HALF_OPEN), and latency tracking across all pipeline stages.
Compliance Auditor
Automated HIPAA/PCI-DSS compliance checking with 20 audit controls, PII scanning (10 pattern types with severity levels), risk assessment, and score calculation.
Redis Session Cache
Distributed session management for horizontal scaling. Redis implementation with pub/sub for cross-instance coordination, plus automatic in-memory fallback.
| Tool Name | Parameters | Returns | Description |
|---|---|---|---|
check_order_status |
order_id |
Status, items, tracking, ETA | Look up order by ID, return full shipping details |
schedule_appointment |
date, time, name, purpose |
Confirmation #, details | Book a new appointment with conflict checking |
cancel_appointment |
confirmation_number |
Cancellation status | Cancel an existing appointment |
reschedule_appointment |
confirmation_number, new_date, new_time |
Updated details | Change date/time of existing appointment |
transfer_to_human |
department, reason |
Queue position, wait time | Request transfer to human agent |
look_up_account |
identifier (phone/email/ID) |
Customer profile, history | Find customer record by any identifier |
get_business_hours |
department (optional) |
Hours, current status | Check if open/closed, return schedule |
collect_feedback |
rating, comment |
Thank you + feedback ID | Record customer satisfaction rating |
end_call |
reason (optional) |
Goodbye message | End the conversation politely |
Source: tools/functions.py · 672 lines · Database tables: orders, appointments, customers, feedback, transfers
| Source | Destination | Protocol | Data |
|---|---|---|---|
| Browser / Phone | Nginx | HTTP / WS | Audio frames, JSON commands |
| Nginx | FastAPI (Uvicorn) | Reverse proxy | IP-hash sticky sessions |
| FastAPI | Deepgram | WebSocket | Raw audio → transcript events |
| FastAPI | Gemini / OpenAI / Claude | HTTPS (streaming) | Prompt + history → token stream |
| FastAPI | ElevenLabs / Cartesia | HTTPS / WS | Text → audio chunks |
| FastAPI | Twilio | REST + WS | TwiML, media stream, outbound dial |
| FastAPI | Redis | TCP | Session state, pub/sub events |
| FastAPI | SQLite | File I/O | Config, call logs, orders, appointments |
| Prometheus | FastAPI /metrics | HTTP scrape | Counter + histogram metrics |
| Grafana | Prometheus | HTTP query | PromQL dashboard queries |
/metrics. Scraped every 15s.Circuit Breaker States
All requests pass through
Traffic blocked for 30s
1 request allowed through
Docker Services
- voice-agentFastAPI server (port 8000) · 2 CPU, 2GB RAM
- redisSession cache (port 6379) · 256MB, AOF, LRU
- rasaNLU server (port 5005) · profile: with-rasa
- prometheusMetrics scraping (port 9090) · profile: monitoring
- grafanaDashboards (port 3000) · profile: monitoring
- nginxLoad balancer (port 80) · profile: production
CI/CD Pipeline
- Stage 1Lint — Ruff check + format + mypy type check
- Stage 2Test — pytest + coverage (with Redis service)
- Stage 3Security — Safety + Bandit vulnerability scan
- Stage 4Docker — Image build + health check test
- Stage 5Deploy — Production deploy (main branch only)
Nginx Load Balancer
- StrategyIP hash (caller affinity / sticky sessions)
- API Rate30 req/s per IP (burst 20)
- WS Rate10 req/s per IP
- WS Timeout3600s (1 hour for long calls)
- Connections100 concurrent WebSockets per IP