Built by LLM Hut Apache-2.0 Node.js ≥ 18 npm · npx

The prompt diff
tool for engineers.

Stop guessing whether your prompt changes actually helped. llm-diff gives you a precise, technical breakdown — token deltas, cost savings, latency shifts, and a word-level response diff — all in one command.

bash — llm-diff
$ llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o llm-diff openai/gpt-4o tokens 312289 -23 (-7.4%) input 45 38 -7 output 267251 -16 cost $0.0041$0.0038 -$0.0003 (-7.3%) latency 1247ms 943ms -304ms (-24.4%) --- prompt-v1.txt +++ prompt-v2.txt The capital of France is Paris. - It is located in northern France and has a population of approximately - 2.1 million people in the city proper… + Paris, with ~2.1M residents, serves as the political and cultural + center of the country…
Tokens saved
−23
▼ 7.4% fewer tokens
Cost delta
−$0.0003
▼ 7.3% cheaper per call
Latency saved
−304ms
▼ 24.4% faster response
📊

Token Delta

See exactly how many input and output tokens were saved or added between prompt versions.

💸

Cost Tracking

Real-time cost calculation based on the latest published pricing for every supported model.

Latency Profiling

Measure time-to-first-token and total latency shifts. Spot regressions before they reach prod.

🔍

Word-Level Diff

Colored inline diff of actual model responses — not just stats, but what actually changed.

🔁

Multi-Run Averaging

LLM outputs are stochastic. Average over N runs to get statistically reliable comparisons.

🛠️

Programmatic API

Import runDiff() directly into your eval suite, CI pipeline, or observability stack.

Why llm-diff?

Prompt engineering is still largely vibes-driven. You tweak wording, run it manually, eyeball the output, and guess whether it's better. llm-diff adds rigor to this loop.

The core insight: a prompt that saves 20 tokens per call costs $200 less per million calls on GPT-4o. At scale, prompt quality is a cost-control lever — and you can't optimize what you can't measure.

Unlike playground UIs or LangSmith traces, llm-diff is designed to live in your terminal and CI pipeline. It speaks the language of engineers: flags, pipes, JSON, exit codes.

Installation

Run instantly with npx (no install)

npx llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o

Install globally

npm install -g llm-diff

Add to a project

npm install --save-dev llm-diff

Requirements: Node.js ≥ 18. No external dependencies beyond the provider SDKs.

Quick Start

1. Set your API key

Export the key for your chosen provider:

export OPENAI_API_KEY=sk-... # or export ANTHROPIC_API_KEY=sk-ant-... # or export GEMINI_API_KEY=AIza... # or export GROQ_API_KEY=gsk_...

2. Compare two prompt files

llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o

3. Or compare inline text

llm-diff -a "Explain gravity" -b "Explain gravity to a child" -m gpt-4o-mini

4. Add a system prompt

llm-diff -a v1.txt -b v2.txt -m claude-sonnet-4-20250514 -s "You are a concise science teacher"

5. Average multiple runs for reliability

llm-diff -a v1.txt -b v2.txt -m gpt-4o --runs 5
Tip: Use --runs 5 or higher any time you're comparing prompts with non-zero temperature. Single-run comparisons can be misleading due to stochastic variation.

Commands

llm-diff --a <prompt-a> --b <prompt-b> --model <model> [options]

Required flags

FlagDescription
--a, -aPrompt A — file path or inline text string
--b, -bPrompt B — file path or inline text string
--model, -mModel name. Run llm-diff --models for the full list.

Useful one-liners

# Get JSON output and pipe to jq llm-diff -a v1.txt -b v2.txt -m gpt-4o --json | jq '.delta' # Show full response diff with word-level highlighting llm-diff -a v1.txt -b v2.txt -m gpt-4o --full # Run sequentially (not in parallel) for rate-limited accounts llm-diff -a v1.txt -b v2.txt -m gpt-4o --no-parallel # Print all supported models with pricing llm-diff --models

Full Options

FlagDefaultDescription
--a, -aPath to file or inline text for Prompt A.
--b, -bPath to file or inline text for Prompt B.
--model, -mModel name (e.g. gpt-4o, claude-sonnet-4-20250514).
--system, -sSystem prompt — file path or inline text.
--temperature0Sampling temperature (0–2). Use 0 for deterministic runs.
--max-tokens2048Maximum output token budget per call.
--runs1Average results over N runs. Recommended ≥ 3 at temp > 0.
--no-parallelRun A and B sequentially instead of in parallel.
--fullShow full word-level inline diff of the response text.
--jsonfalseOutput results as machine-readable JSON.
--base-urlGateway URL override (e.g. https://gw.llmhut.com/v1).
--timeout60000Request timeout in milliseconds.
--modelsPrint all supported models and their current pricing, then exit.

Models & Pricing

llm-diff bundles up-to-date pricing for all major providers. Run llm-diff --models for the live list. Pricing below is indicative — always verify with your provider dashboard.

ModelProviderInput / 1M tkOutput / 1M tkNotes
gpt-4o OpenAI $2.50$10.00 Flagship multimodal model
gpt-4o-mini OpenAI $0.15$0.60 Best cost/quality for most tasks
gpt-4-turbo OpenAI $10.00$30.00 128K context window
o1 OpenAI $15.00$60.00 Extended reasoning; slow
o3-mini OpenAI $1.10$4.40 Reasoning at lower cost
claude-sonnet-4-20250514 Anthropic $3.00$15.00 Current Sonnet; strong reasoning
claude-3.5-haiku Anthropic $0.80$4.00 Fast, cheap Anthropic option
claude-3-opus Anthropic $15.00$75.00 Most capable Claude model
gemini-2.0-flash Google $0.10$0.40 Fastest Gemini; great for evals
gemini-1.5-pro Google $1.25$5.00 2M token context window
llama-3.3-70b Groq $0.59$0.79 Extremely fast inference via Groq
mixtral-8x7b Groq $0.24$0.24 MoE; efficient for classification

JSON Output

Pipe results into your own scripts, dashboards, or eval pipelines with --json:

llm-diff -a v1.txt -b v2.txt -m gpt-4o --json | jq '.delta'

The full JSON schema:

JSON output schema
{ "model": "gpt-4o", "provider": "openai", "promptA": { "tokens": { "input": 45, "output": 267, "total": 312 }, "cost": 0.004125, "latencyMs": 1247 }, "promptB": { "tokens": { "input": 38, "output": 251, "total": 289 }, "cost": 0.003832, "latencyMs": 943 }, "delta": { "totalTokens": -23, "totalTokensPct": -7.4, "inputTokens": -7, "outputTokens": -16, "cost": -0.000293, "costPct": -7.1, "latencyMs": -304, "latencyPct": -24.4 }, "runs": 1, "diff": "--- prompt-v1.txt\n+++ prompt-v2.txt\n The capital…" }

How It Works

Under the hood, llm-diff coordinates a small pipeline for every comparison:

1

Resolve model → provider

The model name is matched against the built-in registry to determine the provider, API base URL, and per-token pricing rates.

2

Read prompts A and B

Each argument is treated as a file path first; if the file doesn't exist, the value is used as inline text. System prompts follow the same logic.

3

Fire both requests (in parallel by default)

Requests are sent concurrently using Promise.all() to minimize wall-clock time. Use --no-parallel for rate-limited accounts.

4

Collect usage metadata

Token counts and latency are read directly from the API response object — no client-side estimation. Cost is computed from the live pricing registry.

5

Compute deltas

Absolute and percentage deltas are calculated for tokens, cost, and latency. With --runs N, all values are averaged across runs.

6

Generate word-level diff

Response texts are diffed at word granularity using a Myers-style algorithm. Added words are green, removed words are red.

7

Render to terminal or JSON

Output is formatted using ANSI color codes for terminal display, or serialized to structured JSON with --json.

Gateway Support

Route requests through any OpenAI-compatible gateway — LLM Hut, LiteLLM, Azure OpenAI, or your own proxy — using --base-url:

llm-diff -a v1.txt -b v2.txt -m gpt-4o \ --base-url https://gw.llmhut.com/v1
Gateway auth: When using a gateway, the gateway handles authentication on your behalf. You don't need provider-specific API keys — just your gateway credentials.

Azure OpenAI example

export OPENAI_API_KEY=your-azure-key llm-diff -a v1.txt -b v2.txt -m gpt-4o \ --base-url https://my-resource.openai.azure.com/openai/deployments/my-gpt4o

LiteLLM Proxy

llm-diff -a v1.txt -b v2.txt -m claude-sonnet-4-20250514 \ --base-url http://localhost:8000

Averaging Multiple Runs

LLM responses are stochastic — the same prompt can return different token counts, different response lengths, and different latencies on each call. A single-run comparison can easily mislead.

Use --runs N to fire each prompt N times and average all metrics:

llm-diff -a v1.txt -b v2.txt -m gpt-4o --runs 5 --temperature 0.7

What gets averaged: token counts, cost, and latency. The response text shown in the diff is taken from the final run.

Recommended practice: at temperature=0, 1–3 runs is sufficient. At temperature > 0, use at least 5 runs. For critical prompt decisions, consider 10+ runs and pipe results to a JSON aggregator.

Piping averaged results

llm-diff -a v1.txt -b v2.txt -m gpt-4o --runs 10 --json \ | jq '{avgCostDelta: .delta.cost, avgLatencyDelta: .delta.latencyMs}'

Programmatic API

Import runDiff() directly to integrate llm-diff into eval suites, CI checks, or observability pipelines.

Basic usage

import { runDiff } from 'llm-diff'; const result = await runDiff({ promptA: 'Summarize the history of Rome', promptB: 'Summarize the history of Rome in 2 sentences', model: 'gpt-4o-mini', }); console.log(result.delta.cost); // -0.0004 console.log(result.delta.latencyMs); // -210 console.log(result.diff); // word-level diff string

Full options

const result = await runDiff({ promptA: 'path/to/v1.txt', // or inline string promptB: 'path/to/v2.txt', model: 'claude-sonnet-4-20250514', system: 'You are a helpful assistant', temperature: 0, maxTokens: 1024, runs: 5, parallel: true, baseUrl: 'https://gw.llmhut.com/v1', });

CI integration example (Node.js)

import { runDiff } from 'llm-diff'; const result = await runDiff({ promptA: 'prompts/system-v1.txt', promptB: 'prompts/system-v2.txt', model: 'gpt-4o-mini', runs: 3, }); // Fail the CI step if prompt B costs more than A if (result.delta.cost > 0) { console.error(`Prompt B is more expensive by $${result.delta.cost.toFixed(5)}`); process.exit(1); } console.log(`✓ Prompt B saves ${Math.abs(result.delta.costPct).toFixed(1)}% per call`); process.exit(0);

Roadmap

What's shipping next, and what's on the horizon. Contributions welcome — see CONTRIBUTING.md.

  • Multi-provider support
    OpenAI, Anthropic, Google Gemini, Groq all supported with live pricing.
  • JSON output mode
    Full structured JSON for scripting and pipeline integration.
  • Multi-run averaging
    Average token counts, cost, and latency across N runs.
  • Side-by-side terminal diff view
    Split-pane terminal layout showing both responses side by side.
  • Cross-model comparison
    --model-a gpt-4o --model-b claude-sonnet-4-20250514 — same prompt, different models.
  • HTML & PDF report export
    Shareable visual reports with diff, metrics, and run history.
  • Config file support
    Persist defaults to .llm-diff.json so you don't repeat flags every run.
  • Streaming with live token counting
    Watch token counts update in real time as the model streams its response.
  • Mistral, Cohere, Together AI
    Expanding provider coverage to additional popular inference APIs.
  • Named experiments & history
    Track prompt iterations over time with a local SQLite store. Tag, query, and chart history.