The prompt diff
tool for engineers.
Stop guessing whether your prompt changes actually helped. llm-diff gives you a precise, technical breakdown — token deltas, cost savings, latency shifts, and a word-level response diff — all in one command.
Token Delta
See exactly how many input and output tokens were saved or added between prompt versions.
Cost Tracking
Real-time cost calculation based on the latest published pricing for every supported model.
Latency Profiling
Measure time-to-first-token and total latency shifts. Spot regressions before they reach prod.
Word-Level Diff
Colored inline diff of actual model responses — not just stats, but what actually changed.
Multi-Run Averaging
LLM outputs are stochastic. Average over N runs to get statistically reliable comparisons.
Programmatic API
Import runDiff() directly into your eval suite, CI pipeline, or observability stack.
Why llm-diff?
Prompt engineering is still largely vibes-driven. You tweak wording, run it manually, eyeball the output, and guess whether it's better. llm-diff adds rigor to this loop.
Unlike playground UIs or LangSmith traces, llm-diff is designed to live in your terminal and CI pipeline. It speaks the language of engineers: flags, pipes, JSON, exit codes.
Installation
Run instantly with npx (no install)
Install globally
Add to a project
Requirements: Node.js ≥ 18. No external dependencies beyond the provider SDKs.
Quick Start
1. Set your API key
Export the key for your chosen provider:
2. Compare two prompt files
3. Or compare inline text
4. Add a system prompt
5. Average multiple runs for reliability
--runs 5 or higher any time you're comparing prompts with non-zero temperature. Single-run comparisons can be misleading due to stochastic variation.
Commands
Required flags
| Flag | Description |
|---|---|
--a, -a | Prompt A — file path or inline text string |
--b, -b | Prompt B — file path or inline text string |
--model, -m | Model name. Run llm-diff --models for the full list. |
Useful one-liners
Full Options
| Flag | Default | Description |
|---|---|---|
--a, -a | — | Path to file or inline text for Prompt A. |
--b, -b | — | Path to file or inline text for Prompt B. |
--model, -m | — | Model name (e.g. gpt-4o, claude-sonnet-4-20250514). |
--system, -s | — | System prompt — file path or inline text. |
--temperature | 0 | Sampling temperature (0–2). Use 0 for deterministic runs. |
--max-tokens | 2048 | Maximum output token budget per call. |
--runs | 1 | Average results over N runs. Recommended ≥ 3 at temp > 0. |
--no-parallel | — | Run A and B sequentially instead of in parallel. |
--full | — | Show full word-level inline diff of the response text. |
--json | false | Output results as machine-readable JSON. |
--base-url | — | Gateway URL override (e.g. https://gw.llmhut.com/v1). |
--timeout | 60000 | Request timeout in milliseconds. |
--models | — | Print all supported models and their current pricing, then exit. |
Models & Pricing
llm-diff bundles up-to-date pricing for all major providers. Run llm-diff --models for the live list. Pricing below is indicative — always verify with your provider dashboard.
| Model | Provider | Input / 1M tk | Output / 1M tk | Notes |
|---|---|---|---|---|
gpt-4o |
OpenAI | $2.50 | $10.00 | Flagship multimodal model |
gpt-4o-mini |
OpenAI | $0.15 | $0.60 | Best cost/quality for most tasks |
gpt-4-turbo |
OpenAI | $10.00 | $30.00 | 128K context window |
o1 |
OpenAI | $15.00 | $60.00 | Extended reasoning; slow |
o3-mini |
OpenAI | $1.10 | $4.40 | Reasoning at lower cost |
claude-sonnet-4-20250514 |
Anthropic | $3.00 | $15.00 | Current Sonnet; strong reasoning |
claude-3.5-haiku |
Anthropic | $0.80 | $4.00 | Fast, cheap Anthropic option |
claude-3-opus |
Anthropic | $15.00 | $75.00 | Most capable Claude model |
gemini-2.0-flash |
$0.10 | $0.40 | Fastest Gemini; great for evals | |
gemini-1.5-pro |
$1.25 | $5.00 | 2M token context window | |
llama-3.3-70b |
Groq | $0.59 | $0.79 | Extremely fast inference via Groq |
mixtral-8x7b |
Groq | $0.24 | $0.24 | MoE; efficient for classification |
JSON Output
Pipe results into your own scripts, dashboards, or eval pipelines with --json:
The full JSON schema:
How It Works
Under the hood, llm-diff coordinates a small pipeline for every comparison:
Resolve model → provider
The model name is matched against the built-in registry to determine the provider, API base URL, and per-token pricing rates.
Read prompts A and B
Each argument is treated as a file path first; if the file doesn't exist, the value is used as inline text. System prompts follow the same logic.
Fire both requests (in parallel by default)
Requests are sent concurrently using Promise.all() to minimize wall-clock time. Use --no-parallel for rate-limited accounts.
Collect usage metadata
Token counts and latency are read directly from the API response object — no client-side estimation. Cost is computed from the live pricing registry.
Compute deltas
Absolute and percentage deltas are calculated for tokens, cost, and latency. With --runs N, all values are averaged across runs.
Generate word-level diff
Response texts are diffed at word granularity using a Myers-style algorithm. Added words are green, removed words are red.
Render to terminal or JSON
Output is formatted using ANSI color codes for terminal display, or serialized to structured JSON with --json.
Gateway Support
Route requests through any OpenAI-compatible gateway — LLM Hut, LiteLLM, Azure OpenAI, or your own proxy — using --base-url:
Azure OpenAI example
LiteLLM Proxy
Averaging Multiple Runs
LLM responses are stochastic — the same prompt can return different token counts, different response lengths, and different latencies on each call. A single-run comparison can easily mislead.
Use --runs N to fire each prompt N times and average all metrics:
What gets averaged: token counts, cost, and latency. The response text shown in the diff is taken from the final run.
temperature=0, 1–3 runs is sufficient. At temperature > 0, use at least 5 runs. For critical prompt decisions, consider 10+ runs and pipe results to a JSON aggregator.
Piping averaged results
Programmatic API
Import runDiff() directly to integrate llm-diff into eval suites, CI checks, or observability pipelines.
Basic usage
Full options
CI integration example (Node.js)
Roadmap
What's shipping next, and what's on the horizon. Contributions welcome — see CONTRIBUTING.md.
-
✓Multi-provider supportOpenAI, Anthropic, Google Gemini, Groq all supported with live pricing.
-
✓JSON output modeFull structured JSON for scripting and pipeline integration.
-
✓Multi-run averagingAverage token counts, cost, and latency across N runs.
-
Side-by-side terminal diff viewSplit-pane terminal layout showing both responses side by side.
-
Cross-model comparison
--model-a gpt-4o --model-b claude-sonnet-4-20250514— same prompt, different models. -
HTML & PDF report exportShareable visual reports with diff, metrics, and run history.
-
Config file supportPersist defaults to
.llm-diff.jsonso you don't repeat flags every run. -
Streaming with live token countingWatch token counts update in real time as the model streams its response.
-
Mistral, Cohere, Together AIExpanding provider coverage to additional popular inference APIs.
-
Named experiments & historyTrack prompt iterations over time with a local SQLite store. Tag, query, and chart history.