AI TV Voice Assistant

Internship work — Voice-enabled assistant for connected TV using on-device recognition and cloud AI agents.

Skills: Anthropic API, MCP (Model Context Principle), Android, Voice Recognition, Kotlin/Java, System Integration.

Context

Overview

Project: Internship project to build an AI-powered voice assistant tailored for TV interactions.

Goal: Provide natural voice navigation and search that feels fast and intuitive on TV devices.

Timeline: Internship period (3 months)

Team

1 Senior Engineer

3 Junior Engineers (Me)

My Role

Platform & Integration: Implemented MCP (Model Context Principle) servers and connected agent runtimes to manage model context and tool routing.

Voice Recognition: Implemented Android voice recognition frontend and integration with the assistant over a low-latency channel.

Problem

TV remote experiences are clunky: voice is underutilized, responses can be slow or inconsistent, and personalization is limited.

Users expect fast, conversational responses to control playback, search for shows, or ask for recommendations. They also expect results tailored to their profile (watch history, preferences) and actions that integrate with device tools (play, enqueue, open apps). Typical voice flows— sending audio to a cloud endpoint or relying only on on-device models—either introduce unpredictable latency or lack the tooling and personalization needed for a rich TV assistant. To address that, we adopted an MCP (Model Context Principle) approach to manage model context: it enables secure session routing, controlled tool invocation (playback APIs, catalog queries, personalization stores), and edge-hosted agents that balance latency, privacy, and capability.

Research

We performed a focused competitive analysis of leading voice systems (for example, Google Assistant and platform voice SDKs) and complemented that with small lab prototype sessions to validate latency and conversational expectations. Key insights from this combined analysis:

Latency under ~500ms feels instant on TV; anything higher noticeably degrades trust.
Users prefer short, conversational confirmations rather than long multi-sentence responses.
Privacy matters: users want explicit affordances and the option to keep audio processing local where possible.

Competitive analysis and prototype findings led us to prioritize a hybrid architecture: local voice recognition for capture + MCP-hosted agents for language understanding, personalization, and tool use (playback APIs, app interactions, catalog queries, weather retrieval, travel search, web search, session memory, etc.). We also experimented with on-device LLMs (fine-tuned) to keep reasoning local, but latency and accuracy were problematic (~5s+ end-to-end) and degraded the experience. The company preferred hosting its own model and edge servers for tighter control and personalization, which made MCP a natural fit.

Idea

Create a low-latency assistant by pairing Android voice recognition (on-device capture and VAD) with MCP (Model Context Principle) servers that orchestrate agent runtimes and calls to the Anthropic API for reasoning and response generation. MCP-hosted agents maintain short-term model context and invoke tools (playback APIs, app interactions, catalog queries, weather retrieval, flight & hotel search, web search, personalization) on behalf of the device. We used streaming and message-level caching to reduce repeated latency and improve perceived speed. We briefly tested an on-device LLM path (with fine-tuning) but the combined latency/accuracy trade-offs led us back to MCP-hosted reasoning for production flows.

Solution

Technical Contributions

MCP servers: Provisioned and configured MCP servers to host agent runtimes that handle conversation state, tool calls, and session routing.
Agent orchestration: Built a lightweight orchestrator to translate UI intents into Anthropic API prompts, manage short-term memory, and fallback heuristics.
Android voice path: Implemented voice capture using Android Speech/Recognition APIs with VAD, local transcription when available, and secure upload streams for cases where cloud NLU was needed. We also prototyped an on-device voice recognition pipeline but found the Android built-in SDK provided better accuracy and integration in the final product.
On-device LLM experiments: Built early fine-tuned on-device model prototypes to evaluate fully-local reasoning; measured end-to-end latency and accuracy (~5s+), which proved unsuitable for the TV experience.
Low-latency streaming: Adopted streaming for partial transcripts and incremental assistant responses to improve perceived speed.

UX & Product Highlights

Short, one-line confirmations for navigation commands; richer responses for searches with quick actions (play, enqueue, details).
Visible mic state and history to improve user trust and discoverability.

Reflection

This internship taught me the importance of engineering for perceived performance: small changes ( partial streaming, faster confirmations, clear mic states) significantly improved usability. Working with MCP and managed agent runtimes clarified how to separate realtime capture from heavier reasoning tasks safely and scalably. Early on-device LLM experiments were invaluable: the on-device LLMs (even when fine-tuned) suffered from latency and accuracy issues (~5s+), and our custom voice pipeline underperformed compared to the Android built-in SDK—lessons that shaped the final architecture. I also learned valuable lessons about privacy controls and designing opt-in cloud features that respect users' expectations.

If I continued this work, I would explore robust on-device LLM options for offline scenarios, add better personalization controls, and build a longer-term telemetry pipeline for production safety monitoring.

Live demo

Internal demo only. No photos shown for privacy reasons.

Metric	Result
Median end-to-end latency	~350ms
Task success (navigation/search)	92%