Test your AI agents like you ship real software.
Obodek is the human-in-the-loop QA platform for teams shipping chatbots, voice bots, and AI assistants. Run structured tests, gate releases, and catch what automation misses — before your users do.
Sprint 24 Regression · 84 cases
Run gridTrusted by QA teams testing AI agents at
The Problem
Conversational AI broke your QA stack.
The tools and rituals built for deterministic software don't survive contact with non-deterministic agents.
Generic tools weren't built for conversation
Test-case managers assume one input, one output. Real agent QA spans multi-turn dialogue, tone, memory, and edge-case recovery.
Automation misses what humans catch
LLM-as-judge catches the obvious. Humans catch the subtle — passive-aggressive tone, brittle escalation, the answer that's technically right but wrong for your brand.
Spreadsheets don't enforce QA gates
Tabs full of test results can't block a bad release. You need environment gates, audit trails, and sign-off — not another Google Sheet.
Platform
Everything your QA org needs to ship agents.
From structured test grids to release gates, Obodek replaces six tools with one system of record for agent quality.
Test Grids
Structured multi-turn cases with reviewer assignment, rubrics, and pass/fail criteria — versioned per agent.
Environment Gating
Block promotion from staging to production unless required test grids pass at your defined threshold.
Bug Reporting
Reviewers file bugs from inside a test run — full transcript, trace, and reproduction context attached.
Prep Library
Reusable personas, intents, and seed conversations so new test grids start at 80%, not zero.
Change Log
Every prompt, tool, model, and dataset change is timestamped — so you know exactly what broke when.
Audit Trail
Immutable record of who reviewed what, when, and what verdict they gave. SOC 2-ready out of the box.
How it works
Three steps from prompt to production.
Wire Obodek into your agent pipeline once. Then every release flows through the same gate.
Set up your agent
Connect via API or SDK. Define environments, reviewers, and the rubric you care about.
Run structured tests
Build grids of multi-turn cases. Mix human review with automated checks for breadth and depth.
Gate and promote
Releases are blocked until quality bars are met. Ship to production with the receipts.
Pricing
Start free. Pay when you ship.
Simple, predictable plans. No per-seat surprises, no per-evaluation gotchas.
Free
For solo builders and weekend agents.
- 1 agent workspace
- 50 test runs per month
- 2 reviewers
- 7-day history
- Community support
Pro
For QA teams shipping agents to real users.
- Unlimited test grids & runs
- Environment gating & release blocks
- Up to 20 reviewers
- Bug reporting + change log
- 90-day history & exports
- Priority email support
Enterprise
For regulated industries and large agent fleets.
- SSO, SCIM, role-based access
- Immutable audit trail
- SOC 2 Type II & data residency
- Custom rubrics & integrations
- Dedicated CSM & SLAs
14-day free trial on Pro. No credit card required.
FAQ
Questions, answered.
How is Obodek different from LLM eval frameworks?+
Eval frameworks score outputs. Obodek is a QA platform — it manages reviewers, enforces release gates, tracks bugs against test cases, and gives you an audit trail. We integrate with eval frameworks; we don't replace the scoring, we replace the workflow around it.
Do reviewers need engineering skills?+
No. The reviewer experience is built for QA analysts, support leads, and domain experts. Engineers wire up the agent and the gates; everyone else runs grids and files bugs through the UI.
What does Obodek connect to?+
Any agent reachable via HTTPS — OpenAI, Anthropic, custom orchestrators, voice stacks like Vapi or Retell. We ship SDKs for TypeScript and Python, plus webhooks for CI integration.
How do you handle voice agents?+
Voice runs are captured with synchronized transcript, audio, and tool-call trace. Reviewers can scrub the audio inline while marking turns pass or fail — latency and tone both count.
Is our data used to train models?+
Never. Customer conversations, test grids, and reviewer notes are tenant-isolated and never used for model training. Enterprise plans add data residency and customer-managed encryption keys.
Ship agents your team can stand behind.
Set up your first test grid in under 10 minutes.