Building Real-Time Voice AI Agents with LiveKit

Building a real-time voice AI agent is one of those challenges that sits at the intersection of several hard problems: low-latency audio streaming, language model inference, and multilingual support. Over the past few months at Tahreez, I had the opportunity to build exactly this.

The Architecture Challenge

The core problem is straightforward to describe but tricky to solve: a user speaks into a microphone, and the system needs to understand what they said, process it through an LLM, and respond with synthesized speech — all in real-time, with minimal perceived latency.

Traditional speech-to-speech (S2S) solutions handle this as a monolithic pipeline, but we found that a cascaded approach gave us significantly better control over each stage.

Why Cascaded Over S2S?

Our cascaded pipeline breaks the problem into discrete stages:

Speech-to-Text — Transcribe the audio input
Language Detection — Identify Arabic or English
LLM Processing — Generate a contextual response
Text-to-Speech — Synthesize natural-sounding audio

This approach reduced inference costs by 32% and lowered perceived latency by 10% compared to the S2S alternative. The key insight: you can optimize each stage independently.

# Simplified pipeline orchestration
async def process_utterance(audio_chunk: bytes) -> AudioResponse:
    transcript = await stt_engine.transcribe(audio_chunk)
    language = detect_language(transcript)
 
    response = await llm.generate(
        prompt=transcript,
        language=language,
        context=conversation_history
    )
 
    audio = await tts_engine.synthesize(
        text=response,
        voice=get_voice_for_language(language)
    )
 
    return audio

LiveKit and WebRTC

LiveKit handles the real-time communication layer. The framework abstracts away most of the WebRTC complexity while giving you enough control to optimize for your specific use case.

The key challenge was managing the audio buffer to minimize the time between when a user stops speaking and when they hear the response. We achieved this by starting TTS synthesis as soon as the first tokens from the LLM were available, streaming the audio back progressively.

Lessons Learned

Latency is everything in voice AI. Users notice delays of more than 500ms.
Language switching mid-conversation is common in bilingual contexts. The system needs to handle this gracefully.
Error recovery in streaming systems requires careful state management. When one stage fails, you need to fail gracefully without breaking the audio stream.

Building voice AI agents taught me that the best architectures are the ones that give you control over the things that matter most — in this case, latency and reliability.