VRUNAI

Evaluate agents beyond output.
Path, tools, and outcome in one run.

The eval framework that catches what output-only testing misses. Define your agent in YAML. Run against any provider. See exactly where it fails — path, tools, and outcome.

$ npm install -g vrunai |

$ Get Started Try in browser — no install

Share on X

vrunai evaluate — Code Review Assistant

zsh

EVAL COMPLETE | gpt-4.1-mini · 2 scenarios · 10 runs

$0.0133

Scenario	path	tool	out	runs	cost
✓ style_violation_pr	100%	100%	100%	5/5	$0.0071
× security_issue_pr	80%	80%	100%	4/5	$0.0062

run 1/5 · 1 failed step 2.0s total

[fetch] → fetch_diff ✓ 672ms

[security] → check_security ✓ 694ms

[style] → check_style × not called

[review] → post_review ✓ 617ms

Average path 90% tool 90% out 100%

vrunai v1.2.0 node v20.11

Core Features

Beyond output.
Beyond accuracy.

Path Accuracy

Did the agent follow the expected execution path? Catches agents that reach the right answer through the wrong steps.

classify → lookup → escalate → refund

Tool Accuracy

Were the right tools called in the right order? Detects skipped, hallucinated, or misordered tool calls.

✓ fetch_diff

✓ check_security

× check_style — not called

Outcome Accuracy

Did the agent produce the correct final output? Classic output evaluation, but now in context with path and tool data.

expected: { success: true }

received: { success: true }

Consistency Scoring

Run each scenario N times and measure how often the agent takes the same path. Surface non-deterministic behavior before production.

Cost Tracking

Real-time cost calculation per scenario per provider based on token usage. Model pricing and context window stats included.

YAML-Based Agent Definition Language

Define tools, mock data, conditional flows, and test scenarios in a single YAML spec. No code required to evaluate an agent.

Supported Providers

One spec.
Every provider.

Run the same scenarios against multiple providers simultaneously. Compare results, costs, and traces side by side.

OpenAI

Anthropic

Google

xAI

DeepSeek

Mistral

Ollama

+ Custom

+ Growing model catalog

Continuously adding new models — bring your own too

Workflow

Three steps.
Full visibility.

Define

Write your agent spec in YAML — tools, flows, mock data, and test scenarios using the Agent Definition Language.

agent.yaml

agent:
  name: "support-bot"
  tools: [search, reply]
  scenarios: 12

Evaluate

Run scenarios against any provider. VRUNAI tracks path, tool, and outcome accuracy across every execution.

running evaluation

path accuracy

94%

tool calls

87%

outcomes

91%

Analyze

Compare providers side by side. See exactly where each agent fails — wrong paths, missed tools, bad outputs.

provider comparison

sonnet

✓ ✓ ✓ ✗ ✓

4/5

gpt-4o

✓ ✗ ✓ ✓ ✗

3/5

Quick Start

Up and running
in two commands.

Terminal — quickstart

zsh

1 Install globally

$ npm install -g vrunai

+ vrunai@1.2.0 installed

2 Launch the interactive TUI

$ vrunai

VRUNAI v1.2.0

? What would you like to do?

▶ Evaluate

LLM Providers

Model Catalog

History

3 Define your agent in YAML

customer_support.yml

agent:
  name: "Customer Support Triage"
  instruction: "You are a customer support assistant..."

tools:
  - name: "classify_inquiry"
    input:  { message: "string" }
    output: { type: "string", urgency: "string" }

  - name: "lookup_order"
    input:  { order_id: "string" }
    output: { status: "string", eligible_for_refund: "boolean" }

scenarios:
  - name: "late_delivery_auto_refund"
    input: "My order #ORD-8821 hasn't arrived"
    expected_path: ["classify", "lookup", "auto_refund"]
    expected_tools: ["classify_inquiry", "lookup_order", "issue_refund"]

providers:
  - { name: "openai", model: "gpt-4o" }
  - { name: "anthropic", model: "claude-sonnet-4" }

4 Run evaluation

$ vrunai evaluate --spec customer_support.yml

Running 3 scenarios × 2 providers × 3 runs each...

openai/gpt-4o

done

anthropic/claude-sonnet-4

14/18

vrunai v1.2.0 node v20.11

Terminal TUI

Interactive interface

Full-featured terminal UI with screens for evaluation, provider management, model catalog, and history.

▶ Evaluate

▶ Providers

▶ Catalog

▶ History

Web App

React interface

Same evaluation power in the browser. Preview scenarios, visualize execution traces, and compare results.

$ vrunai web or app.vrunai.com

Example Specs

Ready-to-run YAML files

● customer_support.yml

● expense_approval.yml

● security_incident.yml

● +3 more in use_cases/ →

Web UI

See your results.
In the browser.

Compare providers side by side, visualize execution traces, inspect tool calls, and browse evaluation history — all from a local web interface.

$ vrunai web

Runs locally in your browser

Or try the hosted version

localhost:3000

VRUNAI Web UI — Model comparison, execution traces, and tool call visualization

Privacy Zero backend

Your keys.
Your machine.

The CLI runs entirely on your machine — your keys never leave your terminal. In the web app, API keys are stored only in your browser's localStorage and are sent directly to the provider APIs you configure. Nothing passes through our servers.

No accounts. No backend. No telemetry. Fully client-side.

Open Source AGPL-3.0

Built in the open.
Fork it. Ship it.

VRUNAI is fully open source. Read every line, contribute fixes, extend the evaluation engine, or build your own provider plugins. No vendor lock-in, no hidden code.

View on GitHub Report Bug Feature Request

Why VRUNAI

What makes it
different.

Most eval tools only check the final output. VRUNAI evaluates the entire agent execution — path, tools, and outcome.

✓

Path accuracy tracking

Verify agents follow the expected execution path, not just produce the right output.

✓

Tool call validation

Detect skipped, hallucinated, or misordered tool calls across every run.

✓

No backend required

Runs entirely on your machine. Keys never leave your terminal.

✓

YAML-only config

Define tools, flows, mock data, and scenarios — no code required.

✓

Multi-provider comparison

Run the same scenarios against multiple providers side by side.

✓

Consistency scoring

Run each scenario N times to surface non-deterministic behavior.

✓

Built-in cost tracking

Real-time cost per scenario per provider. 26 models with pricing included.

✓

Fully open source

AGPL-3.0 licensed. Read every line, fork it, extend it.

FAQ

Questions.
Answered.

How are accuracy scores calculated?

Scores are derived from a composite weighted average of three vectors: semantic pathing (30%), tool call validation (40%), and terminal state verification (30%). Every test case includes an assertion layer defined in your YAML spec.

Can I run private models?

Yes. VRUNAI supports local inference via Ollama or custom API endpoints through the provider-plugin architecture. Any model accessible via an OpenAI-compatible API works out of the box.

Is there a web interface?

Yes. VRUNAI ships both a terminal TUI and a React web app. Run vrunai web locally or use the hosted version at app.vrunai.com.

Do I need to write code to create evaluations?

No. Everything is defined in YAML using the Agent Definition Language — tools, mock data, conditional flows, and test scenarios. See the use_cases/ directory for ready-to-run examples.

Is VRUNAI free?

Yes, completely. VRUNAI is open source under the AGPL-3.0 license. You only pay for the LLM API calls to the providers you configure — VRUNAI itself has no fees, subscriptions, or usage limits.

Ready to evaluate
your agents?

Install the CLI, define your agent in YAML, and see exactly where it fails — in under two minutes.

$ npm install -g vrunai

$ Get Started Try in browser View on GitHub

VRUNAI

Beyond output.Beyond accuracy.