QA built for conversational AINow in beta

Test your AI agents like you ship real software.

Obodek is the human-in-the-loop QA platform for teams shipping chatbots, voice bots, and AI assistants. Run structured tests, gate releases, and catch what automation misses — before your users do.

Start free trial →Watch demo

obodek.app / acme-support-bot / sprint-24 / regression

Sprint 24 Regression · 84 cases

Run grid

Test caseReviewerTurnsStatus

Refund > 30 days, gift purchasejordan7Pass

Address change mid-checkoutpriya4Pass

Escalate to human, after hoursmateo3Review

PII redaction, voice transcriptjordan11Fail

Multi-language switch (es → en)——Queued

Politely refuse out-of-scope ask——Queued

Trusted by QA teams testing AI agents at

SyntaraVoxlabMeridian AICrestlineNorthpeak

The Problem

Conversational AI broke your QA stack.

The tools and rituals built for deterministic software don't survive contact with non-deterministic agents.

⚠

Generic tools weren't built for conversation

Test-case managers assume one input, one output. Real agent QA spans multi-turn dialogue, tone, memory, and edge-case recovery.

⚠

Automation misses what humans catch

LLM-as-judge catches the obvious. Humans catch the subtle — passive-aggressive tone, brittle escalation, the answer that's technically right but wrong for your brand.

⚠

Spreadsheets don't enforce QA gates

Tabs full of test results can't block a bad release. You need environment gates, audit trails, and sign-off — not another Google Sheet.

Platform

Everything your QA org needs to ship agents.

From structured test grids to release gates, Obodek replaces six tools with one system of record for agent quality.

🧪

Test Grids

Structured multi-turn cases with reviewer assignment, rubrics, and pass/fail criteria — versioned per agent.

🚦

Environment Gating

Block promotion from staging to production unless required test grids pass at your defined threshold.

🐞

Bug Reporting

Reviewers file bugs from inside a test run — full transcript, trace, and reproduction context attached.

📚

Prep Library

Reusable personas, intents, and seed conversations so new test grids start at 80%, not zero.

📝

Change Log

Every prompt, tool, model, and dataset change is timestamped — so you know exactly what broke when.

🔒

Audit Trail

Immutable record of who reviewed what, when, and what verdict they gave. SOC 2-ready out of the box.

How it works

Three steps from prompt to production.

Wire Obodek into your agent pipeline once. Then every release flows through the same gate.

Set up your agent

Connect via API or SDK. Define environments, reviewers, and the rubric you care about.

Run structured tests

Build grids of multi-turn cases. Mix human review with automated checks for breadth and depth.

Gate and promote

Releases are blocked until quality bars are met. Ship to production with the receipts.

Pricing

Start free. Pay when you ship.

Simple, predictable plans. No per-seat surprises, no per-evaluation gotchas.

Free

$0/forever

For solo builders and weekend agents.

1 agent workspace
50 test runs per month
2 reviewers
7-day history
Community support

Get started

Questions, answered.

How is Obodek different from LLM eval frameworks?+

Eval frameworks score outputs. Obodek is a QA platform — it manages reviewers, enforces release gates, tracks bugs against test cases, and gives you an audit trail. We integrate with eval frameworks; we don't replace the scoring, we replace the workflow around it.

Do reviewers need engineering skills?+

No. The reviewer experience is built for QA analysts, support leads, and domain experts. Engineers wire up the agent and the gates; everyone else runs grids and files bugs through the UI.

What does Obodek connect to?+

Any agent reachable via HTTPS — OpenAI, Anthropic, custom orchestrators, voice stacks like Vapi or Retell. We ship SDKs for TypeScript and Python, plus webhooks for CI integration.

How do you handle voice agents?+

Voice runs are captured with synchronized transcript, audio, and tool-call trace. Reviewers can scrub the audio inline while marking turns pass or fail — latency and tone both count.

Is our data used to train models?+

Never. Customer conversations, test grids, and reviewer notes are tenant-isolated and never used for model training. Enterprise plans add data residency and customer-managed encryption keys.

Ship agents your team can stand behind.

Set up your first test grid in under 10 minutes.

Create your free account →Talk to sales