Production AI systems. Real numbers, real customers.

Case studies from live deployments — voice agents taking calls, chatbots deflecting tickets, agents reactivating customers, enterprise RAG over permissioned document estates. Every metric below is from production traffic, not a demo environment. NDAs kept where required; architecture shared openly.

Voice Agents Chatbots AI Agents RAG Evaluation HIPAA

Voice Agent · Restaurant · NYC

Milina — AI voice agent for a NYC restaurant at $0.09 per call

50+ reservations a night, bilingual (English + Spanish), sub-700ms response latency. LiveKit + Deepgram + GPT-4o-mini + Cartesia. Callers routinely don't realize they're talking to AI.

LiveKitDeepgram Nova-2GPT-4o-miniCartesiaResyToast POS

91%Task completion

$0.09Per call

+22%Bookings MoM

<700msp50 latency

Read the Milina case → Voicemail ASR · Swiss B2B Food · Schwyzerdütsch

Swiss German voicemail → structured orders for a food wholesaler

Three-model ensemble (Whisper Turbo + Gemini 2.5 Pro + FHNW Swiss German) with a Claude 4.7 Opus arbiter. Killed Whisper's looping-hallucination failure mode on 796 telephone voicemails. 99 of 100 previously-unusable files recovered.

Whisper TurboGemini 2.5 ProFHNW Swiss GermanClaude 4.7 OpusFastAPIDocker

99/100Files recovered

0Looping halluc.

~$15All-in, 796 files

3×vs. Swiss fine-tune

Read the Swiss German case → Voice Agent · HIPAA · Dental

CleverAnswerAI — HIPAA dental receptionist, 20+ offices

Self-hosted LiveKit on a BAA-covered stack, live for a year. 100% answer rate. 28% more new-patient bookings. Direct integration with Dentrix, Open Dental, Curve, Eaglesoft.

LiveKit (self-hosted)Deepgram EnterpriseAzure OpenAIElevenLabs EnterpriseDentrix

100%Answer rate

+28%New bookings

20+Offices

Read the CleverAnswerAI case → LLM Evaluation · iGaming

iGaming QA — 66% to 91% with schema-guided reasoning

Took a Tier-1 operator's QA accuracy from 66% to 91% and coverage from 2% to 25%. Rubric-as-code, 1,200-case eval harness, two-model ensemble on regulatory criteria.

GPT-4oClaude Sonnet 3.5LangGraphLangSmithPydantic

66→91%Accuracy

2→25%Coverage

$0.04Per audit

Read the iGaming QA case → AI Agent · Retail · Reactivation

Dry cleaning chain — AI reactivation agent, 3.5x ROI

192K customer × category intervals scored daily. LangGraph agent picks channel, message, offer, and timing per customer. 18.7% reactivation across 23 treatment categories.

LangGraphGPT-4oTwilio SMSWhatsApp Businessn8n

3.5xROI vs. control

18.7%Reactivation

60+Locations

Read the reactivation case → Call QA · Sales Ops · B2B SaaS

ConvoTune — AI call transcription & scoring for a 40-seat sales org

3,000+ calls scored per month against a 30-point playbook. 89% agreement with human reviewers. Real-time coaching prompts at <300ms. Entire pipeline in client AWS.

Whisper fine-tunedDeepgram Nova-2Azure OpenAILangGraphTerraform

3,000+Calls/month

89%Scoring agreement

$34Per seat/mo

Read the ConvoTune case → Enterprise RAG · UK Construction · NDA

Corporate RAG — ~500 internal users, permission-scoped

A UK building repair, maintenance and refurbishment group (under NDA). Permission-scoped RAG over SharePoint with AWS Kendra + Bedrock + OpenFGA + Keycloak. Document search collapsed from ~15 minutes to seconds (~150× faster).

AWS BedrockAWS KendraOpenFGAKeycloakNestJS 11Lambda

~150×Faster search

~500Internal users

30+Use cases shipped

Read the corporate RAG case → Multi-tenant RAG · DE · NDA

German technical RAG — when the framework wasn't enough

Two German tenants (under NDA): a concrete-products manufacturer and a regional municipal water utility. We deleted the off-the-shelf RAG framework and wrote a single-orchestrator pipeline (rag2). DIN / EN / DWA norms preserved per chunk.

OpenAI embeddingsPinecone v6bm25smmarco-mMiniLMv2DoclingFlask

2Tenants live

0DAG frameworks

23+Golden eval cases

Read the German technical RAG case →

Older work

Earlier research and data-platform case studies.

Before the commercial RAG work landed, we ran an open RAG benchmark write-up and built data stacks on dbt, Snowflake, and Arabic-optimized analytics platforms. That work still pays bills for the clients — and we still do selective analytics engineering for existing AI clients — but it's no longer our primary practice.

Want similar results? Let's see if your use case ships.

One 20-minute call. Bring us your call volume, your tech stack, or your current conversion rate — we'll tell you honestly whether we can build it, what the architecture looks like, and what it'll cost.

Book a discovery call → See pricing