Updated 2026-04-26

How We Rank AI Voice Agent Platforms: Our Methodology

This page documents the exact criteria, weights, and processes ContactWithAI uses to evaluate and rank AI voice agent platforms. Every best-of guide and vendor score on this site derives from this framework. If you find an error in our methodology or believe a vendor has been mis-scored, contact us — we update records when we find mistakes.


Why Methodology Transparency Matters

Ranking software is an editorial act with real consequences. A buyer who selects the wrong voice agent platform based on a vague “top 10” list may spend six months onboarding a product that doesn’t fit their compliance requirements, latency tolerance, or team’s technical depth. We think that outcome is our failure, not theirs.

Transparency about methodology serves three purposes:

  1. It makes our scores falsifiable. If we say “Retell scored 4.2/5 on latency because its median end-to-end response time across 50 test calls was 580ms,” you can run the same test and check our math. Scores that can’t be checked shouldn’t be trusted.

  2. It forces us to be specific. Writing a rubric where every point value has a concrete definition prevents the drift toward feel-good, vendor-friendly scores that plague review sites that depend on vendor advertising.

  3. It enables AI search engines to cite us accurately. Perplexity, Claude, and ChatGPT can extract a structured claim from a methodology page and attribute it correctly. A page that says “we ranked these by overall quality” gives an LLM nothing to work with.

This methodology was developed specifically for AI voice agent platforms — software that answers and places phone calls using large language models and text-to-speech synthesis. It is not the same as our CCaaS methodology or our BPO services methodology. The dimensions and weights reflect what actually matters in a real deployment: how the agent sounds, how fast it responds, whether it can legally handle regulated data, what it actually costs, and how long it takes to build something that works.


The Six Dimensions We Evaluate

We score every platform across six dimensions. The weights below reflect the relative importance of each dimension to a buyer deploying an AI voice agent in a production contact center environment. They are not equal because not everything matters equally.

DimensionWeightRationale
Voice Quality & Naturalness25%Call completion rates and customer perception hinge on this above everything else.
Latency20%Sub-700ms response time is the perceptual threshold below which callers stop noticing lag. Above it, trust breaks down.
Compliance & Security Posture20%Regulated industries (healthcare, finance, collections) cannot deploy a platform that lacks verifiable certifications.
Pricing Transparency15%Hidden costs are a deployment risk. Vendors that won’t publish real numbers make TCO calculation impossible.
Builder Experience10%Time-to-first-working-agent matters for evaluation cycles and for teams that don’t have dedicated AI engineers.
Integration Ecosystem10%A voice agent that can’t hand off to a CRM, ticketing system, or live agent queue is a dead end in most real workflows.
Total100%

1. Voice Quality & Naturalness (25%)

This is the dimension callers experience directly. A voice agent can have excellent latency and airtight compliance certifications, but if it sounds robotic, callers hang up before they get an answer. We weight this at 25% — the single heaviest dimension — because it is the most visible failure mode in production deployments.

We define voice quality across four sub-factors:

  • Prosody and pacing: Does the agent vary its cadence naturally? Does it breathe? Does it rush or drag through responses?
  • Filler handling: Does the agent handle pauses, “um”s, and interruptions gracefully, or does it barrel through regardless of what the caller is doing?
  • Phonetic accuracy: Does it pronounce proper nouns, product names, and medical or legal terms correctly?
  • TTS model transparency: Does the vendor disclose which TTS provider powers the voice layer (ElevenLabs, Deepgram, PlayHT, Cartesia, proprietary)? Undisclosed TTS makes quality claims unauditable.

Scoring rubric — Voice Quality & Naturalness:

ScoreDescription
5Responses are indistinguishable from a well-coached human agent. Prosody varies naturally with sentence structure. Filler and interruption handling is seamless. Callers do not comment on the voice in exit surveys.
4Responses are clearly synthetic but natural enough that most callers complete the interaction without frustration. Minor pacing anomalies on complex sentences. Filler handling is adequate.
3Responses sound like a high-quality text-to-speech engine. Callers notice they’re speaking to a machine, but the quality does not cause abandonment in low-stakes flows.
2Noticeable artifacts: flat prosody, unnatural pauses, or TTS glitches on common words. Some callers abandon the interaction specifically due to voice quality.
1Voice quality actively harms the call. Callers comment negatively, abandon early, or ask to speak to a human immediately due to the voice, not the content.

2. Latency (20%)

End-to-end latency is the elapsed time from when the caller stops speaking to when the agent begins its audible response. This includes speech-to-text (STT) transcription, LLM inference, and text-to-speech (TTS) synthesis. All three legs contribute.

We have selected 700ms as our key threshold based on published research on conversational timing expectations and our own qualitative observations across hundreds of test calls. Callers below this threshold generally don’t notice the pause. Above 1,000ms, abandonment rates increase measurably. Above 1,500ms, the call begins to feel broken.

We measure median latency across 50+ test calls per platform using a standardized test script on US East Coast infrastructure. We do not accept vendor-published latency benchmarks as evidence; we run our own tests.

Scoring rubric — Latency:

ScoreDescription
5Median end-to-end latency under 500ms across 50+ test calls. P95 under 900ms. Callers experience no perceptible lag.
4Median latency 500–700ms. P95 under 1,200ms. Occasional pauses noticeable but not disruptive.
3Median latency 700–1,000ms. Pauses are noticeable. Appropriate for lower-stakes informational flows where callers accept some wait.
2Median latency 1,000–1,500ms. Callers frequently experience the pause as a glitch or silence. Unsuitable for high-volume or emotionally sensitive calls.
1Median latency over 1,500ms. Call feels broken. High abandonment from latency alone.

3. Compliance & Security Posture (20%)

A voice agent that handles customer PII — and almost all of them do — is a regulated system in most jurisdictions and industries. We score compliance not on vendor self-attestation, but on verifiable source documentation.

The certifications and frameworks we check:

  • SOC 2 Type II: Annual third-party audit of security, availability, and confidentiality controls. Type II (covering a period, typically 12 months) is meaningfully different from Type I (a point-in-time assessment). We only credit Type II.
  • HIPAA: Required for healthcare-adjacent deployments. We check for a signed Business Associate Agreement (BAA) as the minimum threshold, plus documentation of HIPAA-compliant data handling practices.
  • GDPR: Required for any data processing touching EU residents. We check for a Data Processing Agreement (DPA) and documented data residency options.
  • PCI DSS: Required if the agent handles payment card data. We check for formal attestation, not just a blog post claiming compliance.

We check compliance by visiting the vendor’s trust center, security page, or legal documentation page and confirming that the cited document exists and is current. A compliance claim with no verifiable source URL does not count. “We are pursuing SOC 2” does not count. “Contact us for our SOC 2 report” counts only if we receive and review the report.

Scoring rubric — Compliance & Security Posture:

ScoreDescription
5SOC 2 Type II verified. HIPAA BAA available. GDPR DPA available. PCI DSS attestation available (if applicable). All verified via primary source URLs. Penetration test reports available on request.
4SOC 2 Type II verified. At least one of HIPAA or GDPR verified via primary source. Minor gaps in documentation.
3SOC 2 Type II in progress or SOC 2 Type I completed. HIPAA or GDPR available with some documentation gaps.
2No SOC 2 Type II. Some security documentation available. Compliance claims present but not fully verifiable from public sources.
1No verifiable compliance documentation. Self-attestation only. Insufficient for regulated-industry deployment.

4. Pricing Transparency (15%)

This dimension measures whether a buyer can calculate a real total cost of ownership before signing a contract. We are not scoring whether a vendor is cheap or expensive — we are scoring whether their pricing is knowable.

The total cost of a voice agent deployment typically has four legs: platform fee, LLM inference cost, text-to-speech synthesis cost, and telephony (SIP trunking or managed phone numbers). Some vendors bundle these; some charge separately for each. We evaluate whether the buyer can see all four legs without a sales call.

We penalize “contact sales for pricing” not because enterprise pricing tiers are unreasonable, but because a vendor that hides all pricing forces every buyer to begin a sales cycle just to answer “can we afford this?” That is a friction cost we document.

We do not accept ”$$$” tier indicators as pricing disclosure. We require real numbers — per-minute rates, per-seat fees, or at minimum a published pricing page with tiered rates — stamped with the date we verified them.

5. Builder Experience (10%)

This dimension measures how long it takes a competent but non-specialized person to go from a fresh account to a working AI voice agent. We test two paths separately where both exist: the no-code builder and the developer API.

For the no-code builder test, one of our reviewers (a product manager with no active coding practice) sets up a standardized FAQ bot using only the platform’s visual builder. We record time-to-first-live-call and note where they got stuck.

For the developer API test, one of our reviewers (an engineer with Python and API familiarity but no prior exposure to the platform) follows the quickstart documentation from zero to a working inbound call. We record time-to-first-call and note where the documentation was unclear, incomplete, or incorrect.

Platforms are scored on whichever path better serves their target audience. A platform marketed explicitly to developers is not penalized for a rough no-code builder; a platform marketed to no-code operators is not penalized for requiring a developer for custom LLM routing.

6. Integration Ecosystem (10%)

A voice agent that can’t connect to downstream systems creates a dead end. We evaluate four integration categories:

  • CRM integrations: Native Salesforce, HubSpot, and Zoho integrations (or documented webhook patterns that achieve the same result)
  • Helpdesk integrations: Zendesk, Freshdesk, or equivalent ticketing system connections
  • Telephony flexibility: Can buyers bring their own SIP trunk or carrier, or are they locked into the vendor’s telephony stack? BYO telephony is a meaningful cost lever at scale.
  • LLM flexibility: Can buyers bring their own OpenAI, Anthropic, or Mistral key, or is the LLM layer locked? BYO LLM is a cost lever and a compliance lever for buyers in regulated industries who need data residency guarantees from their model provider.

We spot-check 3–5 integrations per vendor to confirm they function as documented, not just that they are listed.


How We Score Each Dimension

Each dimension is scored on a 1–5 scale. The final score is a weighted average:

Final Score = (VoiceQuality × 0.25) + (Latency × 0.20) + (Compliance × 0.20)
            + (Pricing × 0.15) + (BuilderExperience × 0.10) + (Integrations × 0.10)

Scores are rounded to one decimal place. We do not publish a 4.78 when the underlying measurements don’t support that precision.

A vendor must score at least 2.0 on any individual dimension to appear in a best-of guide. A platform with adequate overall scores but a 1.0 on compliance is not suitable for recommendation to contact center buyers without a clear caveat — we would rather not rank it than bury a 1.0 compliance score in a weighted average.


Hands-On Testing vs. Vendor Claims

We are explicit about what we test ourselves and what we accept from vendor documentation.

What we test hands-on

Latency. We run 50+ test calls per platform using a standardized inbound FAQ script. Calls are placed from US East Coast infrastructure to the vendor’s US region endpoints. We measure end-to-end time from speech end to first audible response. We take the median, the 95th percentile, and we note variance. We run tests across three different times of day to catch capacity-related slowdowns.

Voice quality. A panel of three reviewers listens to the same set of recorded test calls and scores voice quality independently on the 1–5 rubric above. We average the three scores. Inter-reviewer variance above 1.0 points triggers a calibration discussion before we finalize the score.

No-code builder UX. One non-engineer reviewer completes the standardized FAQ bot task and records the time and friction points. We do not script the reviewer’s behavior — we want to observe where a real user actually gets stuck, not measure a power user’s throughput.

Developer API setup. One engineer reviewer follows the quickstart documentation cold and records time-to-first-working-call. We note every place the documentation was ambiguous, incomplete, or pointed to a broken resource.

What we verify from vendor documentation

Compliance certifications. We verify that the cited source URL — trust center page, audit report, certification page — resolves, is current (within 12 months for SOC 2), and matches the claimed certification. We do not conduct our own audits. We do read the publicly available portions of SOC 2 reports where they are shared.

Pricing. We visit the pricing page and record what we see. We stamp it with the date of verification. We note whether the page shows real numbers or deflects to “contact sales.”

Integration lists. We spot-check 3–5 integrations per vendor by attempting to locate the integration in both the vendor’s app marketplace and the third-party platform’s marketplace. We note discrepancies.


What We Do Not Trust

Some vendor claims we refuse to accept without independent verification, because the incentive to overstate is too strong:

Vendor-published latency benchmarks. When a vendor says “our platform achieves sub-500ms latency,” that is marketing, not measurement. The test conditions, infrastructure, geography, script complexity, and LLM model all affect the number. We run our own tests under controlled, consistent conditions.

“Contact sales for pricing” as pricing disclosure. This is not a pricing disclosure. We record a 1 or 2 on pricing transparency for vendors who provide no self-serve pricing information, regardless of how competitive their actual rates may be. Buyers can’t evaluate what they can’t see.

Compliance claims without a verifiable source URL. “We are HIPAA compliant” on a features page, with no link to a BAA offer or HIPAA-specific security documentation, does not count. We require a clickable path to a verifiable document.

G2 or Capterra review counts as quality signals. Review velocity on horizontal platforms can be gamed. We do not use G2 or Capterra scores as inputs to our scoring model.

Case studies published by the vendor. Case studies are marketing materials. We note when a vendor has published case studies but do not count them as independent verification of outcomes.


Update Cadence

Vendor records are re-verified on the following schedule:

  • Quarterly (every 90 days): Pricing page checks for all published vendors. Pricing changes frequently and silently. We re-verify every published pricing claim every 90 days and update last_verified_at accordingly.
  • Semi-annually (every 180 days): Compliance certification checks for all published vendors. SOC 2 Type II certificates expire annually; we verify they have been renewed before the 180-day mark.
  • Upon major product changes: When a vendor announces a significant product update (new LLM, new TTS provider, new pricing structure, new certification), we re-run the relevant portions of our evaluation. We use public announcements, press releases, and changelog pages to detect these events.
  • Upon user reports: If a reader flags that a published fact is incorrect, we investigate within 14 days and update the record if the report is accurate. We note the correction in the vendor’s record changelog.

A record whose last_verified_at date is older than 180 days is considered stale and is flagged in our internal audit tooling. Stale records are not removed from the site, but they receive a visible freshness warning on the vendor profile page.


Conflicts of Interest

As of April 2026, ContactWithAI has no financial relationships with any vendor listed on this site. Specifically:

  • No affiliate fees. We receive no referral or affiliate commission from any vendor when a buyer clicks through to their website or signs up for a trial.
  • No paid placements. No vendor has paid to appear in a best-of guide, comparison page, or to improve their ranking position.
  • No sponsored content. No vendor has paid for editorial coverage on this site. All reviews and guides are produced independently.
  • No advertising. There are no display ads on this site in v1.
  • No vendor-supplied access. Where we use free trials or developer tiers for testing, we note this. We have not accepted paid enterprise access from any vendor in exchange for coverage.

If this changes in future phases — editorial sponsorships are one potential monetization path we are evaluating for later — we will update this disclosure and add clear labeling to any sponsored content. Ranking positions will remain independent of commercial relationships regardless of monetization approach. Pay-to-rank is not a model we will adopt.


Changelog

DateChange
2026-04-26Methodology published. Initial weights set: Voice Quality 25%, Latency 20%, Compliance 20%, Pricing 15%, Builder Experience 10%, Integrations 10%.

Questions about this methodology? Reach out at msalwet@gmail.com or open an issue on our GitHub repository. We update this page when we update our process.