Skip to content

Agent Discoverability

since 0.4.2

EvalHub embeds structured agent metadata on providers, benchmarks, and collections so AI coding agents can discover the right evaluation, construct valid job requests, and interpret results — without hardcoded provider lists or deep API knowledge. The metadata lives inline on existing REST resources as optional JSON (or YAML in provider configs), and is available through the Python SDK, MCP server, and Agent Skills.

Agents working with EvalHub follow a three-step loop:

  1. Discover — match user intent to evaluates tags and recommended_when conditions
  2. Execute — read hints before submitting a job
  3. Interpret — use result_interpretation and benchmark score_ranges to explain scores

The MCP server exposes this loop through tools (discover_providers, submit_evaluation, get_job_status) and resources (evalhub://providers). See Evaluation-Driven Development for the full Define → Measure → Iterate workflow.

Metadata appears as an optional agent object on provider, benchmark, and collection responses.

FieldTypeDescription
evaluatesstring[]Semantic tags describing what this provider measures (e.g. safety, reasoning, throughput). Agents match these against user intent.
recommended_whenstring[]Natural-language conditions under which an agent should suggest this provider.
target_typestringWhat the provider evaluates: model, agent, or inference_server.
summarystringConcise, action-oriented description (max 200 chars).
complementsstring[]Provider IDs that pair well for follow-up evaluations.
hintsstring[]Operational guidance for constructing job requests — model naming, secrets, parameter gotchas.
result_interpretationstring[]How to read results — metric direction, baselines, what “good” looks like.

Benchmarks can override provider defaults with a nested agent block:

FieldTypeDescription
result_interpretationstringBenchmark-specific guidance overriding provider defaults.
score_rangesobject[]Structured score bands with semantic meaning (e.g. "0.0-0.25" = below random chance). Each entry has range and meaning.

Collections use the same fields as providers except target_type (collections aggregate benchmarks across providers that may target different types):

FieldTypeDescription
evaluatesstring[]What dimensions this collection assesses.
recommended_whenstring[]When to suggest this collection over individual benchmarks.
summarystringConcise description for agent tool listings.
complementsstring[]Collection or provider IDs that pair well.
hintsstring[]Operational guidance (duration, resource requirements).
result_interpretationstring[]How to interpret aggregate and per-benchmark scores.

Full schemas are available in the OpenAPI specification served at /openapi.yaml on any EvalHub instance.

There is no dedicated discovery endpoint. Metadata is returned on existing API routes:

EndpointAgent metadata
GET /api/v1/evaluations/providersagent on each provider; nested agent on benchmarks
GET /api/v1/evaluations/providers/{id}Same
GET /api/v1/evaluations/collectionsagent when configured
GET /api/v1/evaluations/collections/{id}Same

Provider metadata can be updated at runtime via PATCH /api/v1/evaluations/providers/{id} with paths under /agent.

Set up your environment once:

Terminal window
export EVALHUB_BASE_URL="https://evalhub.apps.my-cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"

Scenario: a developer asks “Is my model safe for production?”

Terminal window
curl -s \
-H "Authorization: Bearer $EVALHUB_TOKEN" \
-H "X-Tenant: $EVALHUB_TENANT" \
"$EVALHUB_BASE_URL/api/v1/evaluations/providers" \
| jq '[.items[] | select(.agent != null and (.agent.evaluates | index("safety"))) |
{id: .resource.id, summary: .agent.summary, hints: .agent.hints}]'

Example output:

[
{
"id": "garak",
"summary": "Red-team an LLM for safety vulnerabilities, toxicity, and OWASP risks",
"hints": [
"The model endpoint must support OpenAI-compatible chat completions",
"The 'quick' benchmark runs a single DAN probe for fast smoke testing (~2 min)"
]
}
]

List collections with safety metadata:

Terminal window
curl -s \
-H "Authorization: Bearer $EVALHUB_TOKEN" \
-H "X-Tenant: $EVALHUB_TENANT" \
"$EVALHUB_BASE_URL/api/v1/evaluations/collections" \
| jq '[.items[] | select(.agent != null and (.agent.evaluates | index("safety"))) |
{id: .resource.id, summary: .agent.summary}]'

User: “I want to check if my model is safe for production”

1. Discover providers

An agent lists providers where "safety" appears in agent.evaluates. Garak matches with recommended_when: "Pre-deployment safety gate". The agent reads summary and presents a recommendation to the user.

2. Discover collections

For a broader assessment, the agent checks collections where "safety" or "fairness" appears in agent.evaluates. The Safety & Fairness v1 collection covers toxicity, bias, truthfulness, and ethics with weighted scoring.

3. Read hints before submitting

Before calling submit_evaluation, the agent reads garak.agent.hints:

  • The model endpoint must support OpenAI-compatible chat completions
  • Use the quick benchmark for a fast smoke test (~2 minutes)

4. Submit and monitor

{
"name": "pre-deploy-safety-scan",
"model": {
"url": "http://vllm:8000/v1",
"name": "mistral-7b-instruct"
},
"benchmarks": [
{ "id": "quick", "provider_id": "garak" }
]
}

Poll get_job_status until the job completes.

5. Interpret results

Using result_interpretation from Garak’s agent metadata:

{
"attack_success_rate": 0.15
}

An agent explains: “The attack success rate is 0.15 — lower is better. Scores above 0.3 indicate significant vulnerability, so 0.15 is acceptable but not perfect. Consider running the full owasp_llm_top10 benchmark or the Safety & Fairness collection for a comprehensive assessment.”

The complements field suggests follow-up evaluations (lm_evaluation_harness for accuracy, guidellm for throughput).

Use caseRecommendation
Broad intent (“evaluate safety”)Collection — curated weights and pass thresholds
Named benchmark (“run MMLU”)Individual benchmark
Pre-deployment gateCollection or targeted provider (Garak for red-teaming)
Fast iterationSingle benchmark with --num-examples limit (see hints)

Collections expose the same agent metadata fields as providers. Collection agent metadata is available via REST and MCP resources; the Python SDK Collection model does not yet include an agent field — use REST or MCP for collection discovery until the SDK catches up.

Provider and collection owners add an agent block to YAML configuration:

config/providers/garak.yaml
agent:
evaluates: [safety, security, red_teaming, toxicity]
recommended_when:
- "User asks about model safety or toxicity"
- "Pre-deployment safety gate"
target_type: model
summary: "Red-team an LLM for safety vulnerabilities, toxicity, and OWASP risks"
complements: [lm_evaluation_harness, guidellm]
hints:
- "The model endpoint must support OpenAI-compatible chat completions"
- "The 'quick' benchmark runs a single DAN probe for fast smoke testing (~2 min)"
result_interpretation:
- "attack_success_rate measures how often the model was successfully exploited"
- "LOWER is better -- 0.0 means no attacks succeeded"
- "Scores above 0.3 indicate significant vulnerability"

At runtime, update provider agent metadata via PATCH:

Terminal window
curl -X PATCH \
-H "Authorization: Bearer $EVALHUB_TOKEN" \
-H "X-Tenant: $EVALHUB_TENANT" \
-H "Content-Type: application/json" \
"$EVALHUB_BASE_URL/api/v1/evaluations/providers/garak" \
-d '[{"op": "replace", "path": "/agent/summary", "value": "Updated summary for agents"}]'

See the Provider Catalog for registered providers.