Built by LLM Hut Apache-2.0 Node.js ≥ 18 npm · npx

The prompt diff
tool for engineers.

Stop guessing whether your prompt changes actually helped. llm-diff gives you a precise, technical breakdown — token deltas, cost savings, latency shifts, and a word-level response diff — all in one command.

bash — llm-diff

$ llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o llm-diff openai/gpt-4o tokens 312 → 289 -23 (-7.4%) input 45 → 38 -7 output 267 → 251 -16 cost $0.0041 → $0.0038 -$0.0003 (-7.3%) latency 1247ms → 943ms -304ms (-24.4%) --- prompt-v1.txt +++ prompt-v2.txt The capital of France is Paris. - It is located in northern France and has a population of approximately - 2.1 million people in the city proper… + Paris, with ~2.1M residents, serves as the political and cultural + center of the country…

Tokens saved

−23

▼ 7.4% fewer tokens

Cost delta

−$0.0003

▼ 7.3% cheaper per call

Latency saved

−304ms

▼ 24.4% faster response

📊

Token Delta

See exactly how many input and output tokens were saved or added between prompt versions.

💸

Cost Tracking

Real-time cost calculation based on the latest published pricing for every supported model.

⚡

Latency Profiling

Measure time-to-first-token and total latency shifts. Spot regressions before they reach prod.

🔍

Word-Level Diff

Colored inline diff of actual model responses — not just stats, but what actually changed.

🔁

Multi-Run Averaging

LLM outputs are stochastic. Average over N runs to get statistically reliable comparisons.

🛠️

Programmatic API

Import runDiff() directly into your eval suite, CI pipeline, or observability stack.

Why llm-diff?

Prompt engineering is still largely vibes-driven. You tweak wording, run it manually, eyeball the output, and guess whether it's better. llm-diff adds rigor to this loop.

The core insight: a prompt that saves 20 tokens per call costs $200 less per million calls on GPT-4o. At scale, prompt quality is a cost-control lever — and you can't optimize what you can't measure.

Unlike playground UIs or LangSmith traces, llm-diff is designed to live in your terminal and CI pipeline. It speaks the language of engineers: flags, pipes, JSON, exit codes.

Installation

Run instantly with npx (no install)

npx llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o

Install globally

npm install -g llm-diff

Add to a project

npm install --save-dev llm-diff

Requirements: Node.js ≥ 18. No external dependencies beyond the provider SDKs.

Quick Start

1. Set your API key

Export the key for your chosen provider:

export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...
# or
export GEMINI_API_KEY=AIza...
# or
export GROQ_API_KEY=gsk_...

2. Compare two prompt files

llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o

3. Or compare inline text

llm-diff -a "Explain gravity" -b "Explain gravity to a child" -m gpt-4o-mini

4. Add a system prompt

llm-diff -a v1.txt -b v2.txt -m claude-sonnet-4-20250514 -s "You are a concise science teacher"

5. Average multiple runs for reliability

llm-diff -a v1.txt -b v2.txt -m gpt-4o --runs 5

Tip: Use --runs 5 or higher any time you're comparing prompts with non-zero temperature. Single-run comparisons can be misleading due to stochastic variation.

Commands

llm-diff --a <prompt-a> --b <prompt-b> --model <model> [options]

Required flags

Flag	Description
`--a, -a`	Prompt A — file path or inline text string
`--b, -b`	Prompt B — file path or inline text string
`--model, -m`	Model name. Run `llm-diff --models` for the full list.

Useful one-liners

# Get JSON output and pipe to jq
llm-diff -a v1.txt -b v2.txt -m gpt-4o --json | jq '.delta'

# Show full response diff with word-level highlighting
llm-diff -a v1.txt -b v2.txt -m gpt-4o --full

# Run sequentially (not in parallel) for rate-limited accounts
llm-diff -a v1.txt -b v2.txt -m gpt-4o --no-parallel

# Print all supported models with pricing
llm-diff --models

Full Options

Flag	Default	Description
`--a, -a`	—	Path to file or inline text for Prompt A.
`--b, -b`	—	Path to file or inline text for Prompt B.
`--model, -m`	—	Model name (e.g. `gpt-4o`, `claude-sonnet-4-20250514`).
`--system, -s`	—	System prompt — file path or inline text.
`--temperature`	`0`	Sampling temperature (0–2). Use 0 for deterministic runs.
`--max-tokens`	`2048`	Maximum output token budget per call.
`--runs`	`1`	Average results over N runs. Recommended ≥ 3 at temp > 0.
`--no-parallel`	—	Run A and B sequentially instead of in parallel.
`--full`	—	Show full word-level inline diff of the response text.
`--json`	`false`	Output results as machine-readable JSON.
`--base-url`	—	Gateway URL override (e.g. `https://gw.llmhut.com/v1`).
`--timeout`	`60000`	Request timeout in milliseconds.
`--models`	—	Print all supported models and their current pricing, then exit.

Models & Pricing

llm-diff bundles up-to-date pricing for all major providers. Run llm-diff --models for the live list. Pricing below is indicative — always verify with your provider dashboard.

Model	Provider	Input / 1M tk	Output / 1M tk	Notes
`gpt-4o`	OpenAI	$2.50	$10.00	Flagship multimodal model
`gpt-4o-mini`	OpenAI	$0.15	$0.60	Best cost/quality for most tasks
`gpt-4-turbo`	OpenAI	$10.00	$30.00	128K context window
`o1`	OpenAI	$15.00	$60.00	Extended reasoning; slow
`o3-mini`	OpenAI	$1.10	$4.40	Reasoning at lower cost
`claude-sonnet-4-20250514`	Anthropic	$3.00	$15.00	Current Sonnet; strong reasoning
`claude-3.5-haiku`	Anthropic	$0.80	$4.00	Fast, cheap Anthropic option
`claude-3-opus`	Anthropic	$15.00	$75.00	Most capable Claude model
`gemini-2.0-flash`	Google	$0.10	$0.40	Fastest Gemini; great for evals
`gemini-1.5-pro`	Google	$1.25	$5.00	2M token context window
`llama-3.3-70b`	Groq	$0.59	$0.79	Extremely fast inference via Groq
`mixtral-8x7b`	Groq	$0.24	$0.24	MoE; efficient for classification

JSON Output

Pipe results into your own scripts, dashboards, or eval pipelines with --json:

llm-diff -a v1.txt -b v2.txt -m gpt-4o --json | jq '.delta'

The full JSON schema:

JSON output schema

{ "model": "gpt-4o", "provider": "openai", "promptA": { "tokens": { "input": 45, "output": 267, "total": 312 }, "cost": 0.004125, "latencyMs": 1247 }, "promptB": { "tokens": { "input": 38, "output": 251, "total": 289 }, "cost": 0.003832, "latencyMs": 943 }, "delta": { "totalTokens": -23, "totalTokensPct": -7.4, "inputTokens": -7, "outputTokens": -16, "cost": -0.000293, "costPct": -7.1, "latencyMs": -304, "latencyPct": -24.4 }, "runs": 1, "diff": "--- prompt-v1.txt\n+++ prompt-v2.txt\n The capital…" }

How It Works

Under the hood, llm-diff coordinates a small pipeline for every comparison:

Resolve model → provider

The model name is matched against the built-in registry to determine the provider, API base URL, and per-token pricing rates.

Read prompts A and B

Each argument is treated as a file path first; if the file doesn't exist, the value is used as inline text. System prompts follow the same logic.

Fire both requests (in parallel by default)

Requests are sent concurrently using Promise.all() to minimize wall-clock time. Use --no-parallel for rate-limited accounts.

Collect usage metadata

Token counts and latency are read directly from the API response object — no client-side estimation. Cost is computed from the live pricing registry.

Compute deltas

Absolute and percentage deltas are calculated for tokens, cost, and latency. With --runs N, all values are averaged across runs.

Generate word-level diff

Response texts are diffed at word granularity using a Myers-style algorithm. Added words are green, removed words are red.

Render to terminal or JSON

Output is formatted using ANSI color codes for terminal display, or serialized to structured JSON with --json.

Gateway Support

Route requests through any OpenAI-compatible gateway — LLM Hut, LiteLLM, Azure OpenAI, or your own proxy — using --base-url:

llm-diff -a v1.txt -b v2.txt -m gpt-4o \
  --base-url https://gw.llmhut.com/v1

Gateway auth: When using a gateway, the gateway handles authentication on your behalf. You don't need provider-specific API keys — just your gateway credentials.

Azure OpenAI example

export OPENAI_API_KEY=your-azure-key
llm-diff -a v1.txt -b v2.txt -m gpt-4o \
  --base-url https://my-resource.openai.azure.com/openai/deployments/my-gpt4o

LiteLLM Proxy

llm-diff -a v1.txt -b v2.txt -m claude-sonnet-4-20250514 \
  --base-url http://localhost:8000

Averaging Multiple Runs

LLM responses are stochastic — the same prompt can return different token counts, different response lengths, and different latencies on each call. A single-run comparison can easily mislead.

Use --runs N to fire each prompt N times and average all metrics:

llm-diff -a v1.txt -b v2.txt -m gpt-4o --runs 5 --temperature 0.7

What gets averaged: token counts, cost, and latency. The response text shown in the diff is taken from the final run.

Recommended practice: at temperature=0, 1–3 runs is sufficient. At temperature > 0, use at least 5 runs. For critical prompt decisions, consider 10+ runs and pipe results to a JSON aggregator.

Piping averaged results

llm-diff -a v1.txt -b v2.txt -m gpt-4o --runs 10 --json \
  | jq '{avgCostDelta: .delta.cost, avgLatencyDelta: .delta.latencyMs}'

Programmatic API

Import runDiff() directly to integrate llm-diff into eval suites, CI checks, or observability pipelines.

Basic usage

import { runDiff } from 'llm-diff';

const result = await runDiff({
  promptA: 'Summarize the history of Rome',
  promptB: 'Summarize the history of Rome in 2 sentences',
  model: 'gpt-4o-mini',
});

console.log(result.delta.cost);      // -0.0004
console.log(result.delta.latencyMs); // -210
console.log(result.diff);            // word-level diff string

Full options

const result = await runDiff({
  promptA: 'path/to/v1.txt',   // or inline string
  promptB: 'path/to/v2.txt',
  model: 'claude-sonnet-4-20250514',
  system: 'You are a helpful assistant',
  temperature: 0,
  maxTokens: 1024,
  runs: 5,
  parallel: true,
  baseUrl: 'https://gw.llmhut.com/v1',
});

CI integration example (Node.js)

import { runDiff } from 'llm-diff';

const result = await runDiff({
  promptA: 'prompts/system-v1.txt',
  promptB: 'prompts/system-v2.txt',
  model: 'gpt-4o-mini',
  runs: 3,
});

// Fail the CI step if prompt B costs more than A
if (result.delta.cost > 0) {
  console.error(`Prompt B is more expensive by $${result.delta.cost.toFixed(5)}`);
  process.exit(1);
}

console.log(`✓ Prompt B saves ${Math.abs(result.delta.costPct).toFixed(1)}% per call`);
process.exit(0);

Roadmap

What's shipping next, and what's on the horizon. Contributions welcome — see CONTRIBUTING.md.

✓

Multi-provider support

OpenAI, Anthropic, Google Gemini, Groq all supported with live pricing.
✓

JSON output mode

Full structured JSON for scripting and pipeline integration.
✓

Multi-run averaging

Average token counts, cost, and latency across N runs.
Side-by-side terminal diff view

Split-pane terminal layout showing both responses side by side.
Cross-model comparison

--model-a gpt-4o --model-b claude-sonnet-4-20250514 — same prompt, different models.
HTML & PDF report export

Shareable visual reports with diff, metrics, and run history.
Config file support

Persist defaults to .llm-diff.json so you don't repeat flags every run.
Streaming with live token counting

Watch token counts update in real time as the model streams its response.
Mistral, Cohere, Together AI

Expanding provider coverage to additional popular inference APIs.
Named experiments & history

Track prompt iterations over time with a local SQLite store. Tag, query, and chart history.

The prompt difftool for engineers.

Token Delta

Cost Tracking

Latency Profiling

Word-Level Diff

Multi-Run Averaging

Programmatic API

Why llm-diff?

Installation

Run instantly with npx (no install)

Install globally

Add to a project

Quick Start

1. Set your API key

2. Compare two prompt files

3. Or compare inline text

4. Add a system prompt

5. Average multiple runs for reliability

Commands

Required flags

Useful one-liners

Full Options

Models & Pricing

JSON Output

How It Works

Resolve model → provider

Read prompts A and B

Fire both requests (in parallel by default)

Collect usage metadata

Compute deltas

Generate word-level diff

Render to terminal or JSON

Gateway Support

Azure OpenAI example

LiteLLM Proxy

Averaging Multiple Runs

Piping averaged results

Programmatic API

Basic usage

Full options

CI integration example (Node.js)

Roadmap

The prompt diff
tool for engineers.