Evaluation-Driven Development
Evaluation-Driven Development (EDD) is the AI equivalent of Test-Driven Development (TDD). Instead of writing unit tests before code, you define evaluation criteria before deploying or fine-tuning a model. EvalHub’s agent metadata powers each phase of the cycle — agents discover what to measure, construct valid jobs, and interpret results without hardcoded knowledge.
See also the animated MCP EDD Workflow diagram on the Home page and the edd_workflow MCP prompt.
The EDD cycle
Section titled “The EDD cycle”| Phase | Goal | Agent metadata used |
|---|---|---|
| Define | Choose the right evaluations and set thresholds | evaluates, recommended_when, target_type |
| Measure | Run evaluations and capture scores | hints, collection pass thresholds |
| Iterate | Improve the model and re-evaluate | result_interpretation, score_ranges, complements |
Phase 1: Define criteria
Section titled “Phase 1: Define criteria”Discover evaluations that match the user’s goals. Never assume provider or collection IDs exist — always fetch live metadata from the API.
-
Match user intent to
evaluatestagsTerminal window export EVALHUB_BASE_URL="https://evalhub.apps.my-cluster.example.com"export EVALHUB_TOKEN="$(oc whoami -t)"export EVALHUB_TENANT="eval-test"Filter providers by capability:
import osfrom evalhub import SyncEvalHubClientwith SyncEvalHubClient(base_url=os.environ["EVALHUB_BASE_URL"],auth_token=os.environ["EVALHUB_TOKEN"],tenant=os.environ["EVALHUB_TENANT"],) as client:safety_providers = client.providers.list(evaluates="safety")for p in safety_providers:if p.agent:print(p.resource.id, "—", p.agent.summary)Or via MCP: call
discover_providerswith{"evaluates": ["safety"]}. -
Filter by target type
If you know what the user is evaluating:
agent_providers = client.providers.list(target_type="agent")model_providers = client.providers.list(target_type="model")perf_providers = client.providers.list(target_type="inference_server") -
Confirm with
recommended_whenRead the provider’s
recommended_whenstrings to verify the match. These are natural-language conditions written for LLM consumption. -
Set thresholds
- Read
result_interpretationfor metric direction and baselines (e.g. Garak: lowerattack_success_rateis better; scores above 0.3 indicate vulnerability) - Check benchmark
score_rangeswhen available (e.g."0.0-0.25"= below random chance for 4-choice MCQ) - Start conservatively; tighten thresholds as the model improves
- Read
Phase 2: Measure scores
Section titled “Phase 2: Measure scores”Run the evaluation using IDs discovered in Phase 1. Read hints before constructing the job request.
-
Read operational hints
provider = client.providers.get("garak")if provider.agent:for hint in provider.agent.hints:print(hint)Example hints for Garak:
- The model endpoint must support OpenAI-compatible chat completions
- Use the
quickbenchmark for a fast smoke test (~2 minutes)
-
Submit the evaluation
from evalhub.models.api import (BenchmarkConfig,JobSubmissionRequest,ModelConfig,)job = client.jobs.submit(JobSubmissionRequest(name="edd-safety-baseline",model=ModelConfig(url="http://vllm:8000/v1",name="mistral-7b-instruct",),benchmarks=[BenchmarkConfig(id="quick", provider_id="garak")],))result = client.jobs.wait_for_completion(job.id, timeout=3600)Or submit a collection for a comprehensive assessment:
from evalhub.models.api import CollectionConfig, JobSubmissionRequest, ModelConfigjob = client.jobs.submit(JobSubmissionRequest(name="edd-safety-suite",model=ModelConfig(url="http://vllm:8000/v1", name="mistral-7b-instruct"),collection=CollectionConfig(id="safety-and-fairness-v1"),)) -
Interpret results using agent metadata
After the job completes, use
result_interpretationto explain scores:Metric Value Interpretation attack_success_rate0.15 Lower is better; above 0.3 indicates significant vulnerability Collection aggregate 0.82 Weighted average; pass threshold for Safety & Fairness v1 is 0.758 The MCP
get_job_statustool enriches completed benchmarks withresult_interpretationandcomplementsfrom provider agent metadata.
Phase 3: Iterate
Section titled “Phase 3: Iterate”If thresholds are not met, improve the model and re-evaluate. Tag each run (e.g. edd-iteration-1, edd-iteration-2) so you can compare in MLflow.
Iteration strategies:
- Prompt engineering — adjust system prompts, few-shot examples
- Fine-tuning — additional training on domain-specific data
- Model selection — try a different base model
- RAG improvements — better retrieval, re-ranking, chunking
- Safety guardrails — add or tune content filters
After each iteration, check complements for follow-up evaluations. For example, after a Garak safety scan, complements: [lm_evaluation_harness, guidellm] suggests running accuracy benchmarks and throughput profiling.
Worked example: safety EDD
Section titled “Worked example: safety EDD”Context: A developer is refactoring safety guardrails on a chat model and wants to measure impact before and after.
Define:
Use the edd_workflow MCP prompt with application_type: safety, or discover manually:
providers = client.providers.list(evaluates="safety", target_type="model")# Recommend: garak (red-teaming) + safety-and-fairness-v1 collection (bias, toxicity, ethics)Measure (before):
Run Garak quick benchmark and the Safety & Fairness collection. Record baseline scores.
Iterate:
Developer refactors guardrails.
Measure (after):
Re-run the same evaluations. Compare:
- Garak
attack_success_rate— should not increase - Collection aggregate score — should remain above 0.758
- Individual benchmark scores — check
result_interpretationfor per-metric guidance
Agent response example:
Your guardrail refactor improved the Garak attack success rate from 0.22 to 0.15 (lower is better). The Safety & Fairness collection aggregate rose from 0.76 to 0.82, now above the 0.758 pass threshold. The
toxigenbenchmark improved most — toxicity avoidance is stronger. Consider runninglm_evaluation_harnessnext (listed incomplements) to check whether accuracy regressed.
Using the MCP edd_workflow prompt
Section titled “Using the MCP edd_workflow prompt”The edd_workflow prompt provides structured guidance for each application type:
| Type | Define | Measure | Iterate |
|---|---|---|---|
rag | Retrieval quality and generation accuracy targets | RAG-specific benchmarks | Retrieval pipeline and generation prompts |
agent | Task completion and tool use accuracy | Tool call correctness and task success rate | Agent prompts and guardrails |
safety | Safety requirements and acceptable thresholds | Toxicity, bias, harmful content | Safety guardrails and content filters |
classifier | Per-class accuracy targets | Class imbalance and edge cases | Classification prompts and examples |
Ask your AI agent:
Use the edd_workflow prompt for a safety applicationThe agent receives a structured Define → Measure → Iterate workflow and uses EvalHub tools and agent metadata to execute each phase.