We Were Voice AI Skeptics — Until We Watched a Broker Use Our CRM
Two years ago, if someone had told us we would be writing about voice AI with enthusiasm, we would have laughed. We had used Alexa. We had tried Siri. We had watched Google Assistant fail to understand our accents for the hundredth time. Voice assistants, to us, were toys that could set timers and occasionally play the wrong song.
Then we were building LeadRegister — our CRM for real estate brokers in India. During a site visit, we watched a broker try to update a lead record while sitting in an auto-rickshaw between property showings. He was typing on a 6-inch screen, in bumpy traffic, switching between Hindi and English, misspelling half the words. It took him four minutes to log what should have been a 15-second update.
That moment changed our perspective entirely. The problem was not that voice AI was bad. The problem was that we had been evaluating it from behind a desk, where we had a keyboard and a large screen. For someone whose hands are busy, whose eyes are on the road, whose primary language is not English — voice is not a convenience. It is the only interface that works.
Where Voice Genuinely Outperforms Screens — And We Mean Genuinely
Voice is not universally better. But there are situations where it is so much better that going back to a screen feels absurd. After more than a decade of building software for field workers, brokers, NGO staff, and warehouse operators across India and the Middle East, we have a clear picture of when voice works.
It comes down to one principle: voice wins when hands and eyes are occupied, or when the user's relationship with text is complicated.
Before recommending voice for any project, we ask one question: "Would your users prefer to talk to your product, or are they already efficient with the screen?" If the answer involves field workers, drivers, regional language speakers, or phone-based support — voice is worth exploring. If the answer involves knowledge workers at desks — save your money.
Where Voice AI Fails — And We Are Speaking From Experience
Voice does not solve everything. There are situations where adding voice to your product is genuinely a waste of money, and we have watched companies learn this the hard way.
Complex data entry. Try dictating a 15-digit account number accurately. Or an address with a specific spelling. Or a spreadsheet. Voice is terrible for precise, structured data input. If getting a single character wrong breaks something, use a keyboard.
Browsing and comparison. You cannot browse a product catalog with voice. You cannot compare three pricing plans side by side. Screens are spatial — everything visible at once. Voice is sequential — one thing at a time. For anything that involves scanning, comparing, or visual pattern recognition, screens win decisively.
Privacy-sensitive environments. Nobody wants to dictate their bank details in a coffee shop. Or discuss medical symptoms on a crowded bus. Or have a sensitive HR conversation where colleagues can hear. If your users cannot speak aloud where they use your product, voice is dead on arrival.
Power users who type faster. Developers, data analysts, financial modelers — anyone who lives at a keyboard will find voice slower, more frustrating, and less precise. Do not force voice on users who already have an efficient workflow. You will annoy your best customers.
Early on, we suggested adding voice input to a dashboard analytics tool. The idea was that executives could ask questions like "show me revenue by region for Q3" instead of clicking through filters. In testing, every single user preferred clicking. The dashboard was already intuitive. Voice added friction, not convenience. The feature was scrapped after two weeks. Voice solves access problems — it does not improve already-good interfaces.
The Technology Stack — What Actually Goes Into a Voice Agent
Building a voice agent is more complex than building a text chatbot because you have two additional layers: understanding speech and generating speech. Each layer introduces latency, and latency kills the conversational illusion.
OpenAI Whisper
Google Cloud STT
Deepgram
Language detection
Noise filtering
Intent understanding
Context from history
Knowledge retrieval
Action decisions
Response generation
ElevenLabs
Google Cloud TTS
Amazon Polly
Voice cloning
Emotion control
The latency numbers in that diagram matter more than anything else. In a text chat, a 2-second delay feels normal. In a voice conversation, a 2-second silence feels like the call dropped. The total round trip — from the user finishing their sentence to the agent starting its response — needs to stay under 1.5 seconds for the conversation to feel natural.
This is where most voice projects struggle. Each layer adds latency. Whisper takes 200-500ms for transcription. The LLM takes 300-800ms for reasoning. TTS takes 100-300ms for speech generation. Add network latency and you are already pushing the limit. The engineering challenge is not making each layer work — it is making them work fast enough together.
What It Actually Costs and Takes to Build
Here is the honest breakdown — because the range is enormous and most vendors are deliberately vague.
In the field → voice wins
Driving → voice essential
Factory floor → voice wins
Hindi speakers → voice wins
Code-switchers → voice helps
Low literacy → voice essential
Data entry → screens win
Status checks → voice wins
Comparisons → screens win
Open floor → maybe not
Outdoors → yes
Public space → no
The India Play That Most Companies Are Missing
We keep coming back to this because it is genuinely one of the biggest untapped opportunities in Indian tech right now.
India has 800 million smartphone users. The majority are more comfortable speaking than typing — not because they cannot type, but because voice in their native language is faster, more natural, and less error-prone than typing in English or transliterated Hindi on a small screen.
We saw this firsthand with Mom's Cuddle — the parenting platform we built. The content is bilingual, English and Hindi. The engagement on Hindi content is dramatically higher, not because the content is better, but because the audience can consume it without the cognitive overhead of reading in their second language.
Now extend that insight to voice interfaces. A voice-enabled CRM in Hindi for Indian brokers. A voice-based data collection tool in regional languages for NGO fieldworkers. A voice customer support system that understands Tamil without forcing callers to speak English. A voice-first health information service for parents in rural India.
Each of these doubles the addressable market for existing products overnight. The technology is ready. The users are waiting. Most competitors have not even started thinking about it.
If the broader question is what teams are actually building with AI agents in 2026 — across voice, chatbots, and autonomous workflows — read the companion piece: AI Agents in 2026: What Businesses Are Actually Building — From Chatbots to Autonomous Workflows.
For the practitioner walkthrough of shipping a production AI agent — architecture, guardrails, lead capture, and the mistakes that teach the most — read the companion piece: How We Built an AI Agent That Knows Our Entire Business — And What We Learned.
And if the question behind the question is whether your business website should have a conversational agent at all — voice or text — read the companion piece: Why Every Business Website Needs an AI Chatbot in 2026.
Voice AI is not a feature you add because it sounds impressive in a pitch deck. It is a design decision that fundamentally changes who can use your product. Build it where it solves a real access problem — hands busy, eyes occupied, language barriers, literacy challenges. Skip it everywhere else. The technology is finally mature enough for production. The question is not whether voice AI works — it is whether your specific use case is right for voice. And the only way to know that is to put a microphone button in front of your actual users and watch what happens.
At Entexis, we build voice-enabled applications and AI agents for businesses across North America, MENA, and India — from IVR replacements for customer support to voice-first CRMs for field teams to multilingual voice interfaces for regional language markets. If you are evaluating whether voice makes sense for your product, or adding a voice layer to something you already have, let us run you through a no-pressure discovery session. Start the conversation with Entexis.