Skip to content

Agent Skills

The eval-hub-skills plugin gives AI coding agents scripted access to EvalHub discovery, job submission, and monitoring. Skills complement the MCP server — use MCP when connected to a cluster; use skills as a fallback or for script-based workflows.

Skills consume the same agent metadata returned by the REST API. All provider and collection knowledge comes from live API responses — never hardcoded IDs.

SkillPurpose
evalhubFull skill — discovery, evaluation, job lifecycle, and EDD workflows
evalhub-discoveryDiscover providers, benchmarks, and collections; read agent metadata
evalhub-evalSubmit evaluation jobs against benchmarks or collections
evalhub-jobsMonitor, wait on, cancel, and fetch logs for evaluation jobs
  • Python 3.11+
  • uv (scripts use PEP 723 inline metadata for auto-dependency resolution)
  • Network access to an EvalHub service
  • Environment variables (see below)
Terminal window
/plugin marketplace add eval-hub/eval-hub-skills
/plugin install evalhub@evalhub

The skill is then available as /evalhub:evalhub in any Claude Code session.

Terminal window
git clone https://github.com/eval-hub/eval-hub-skills
cd eval-hub-skills
make install-all # symlinks all four skills into ~/.claude/skills/

Changes to skill source are reflected immediately without reinstalling.

When EvalHub exposes an MCP server on your cluster, register it with Claude Code:

Terminal window
export EVALHUB_BASE_URL="https://evalhub.apps.my-cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"
claude mcp add evalhub "$EVALHUB_BASE_URL/mcp" \
--transport http \
--header "Authorization: Bearer $EVALHUB_TOKEN" \
--header "x-tenant: $EVALHUB_TENANT"
VariablePurposeExample
EVALHUB_BASE_URLEvalHub API base URLhttps://evalhub.apps.cluster.example.com
EVALHUB_TOKENBearer token for authsha256~... (from oc whoami -t)
EVALHUB_TENANTNamespace / tenanteval-test
EVALHUB_INSECURESkip TLS verificationtrue (for self-signed certs)
EVALHUB_MCP_URLMCP server HTTP URL (optional)Enables MCP mode in skills
Terminal window
export EVALHUB_BASE_URL="https://evalhub.apps.my-cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"

The default discovery workflow runs two parallel calls to fetch full agent metadata:

Terminal window
uv run ~/.claude/skills/evalhub/scripts/evalhub_providers.py --agent 2>/dev/null
uv run ~/.claude/skills/evalhub/scripts/evalhub_collections.py --agent 2>/dev/null

The --agent output contains the complete agent metadata block for every provider or collection. Do not fetch individual providers afterwards — everything is already included.

Filter when user intent is clear:

Terminal window
uv run ~/.claude/skills/evalhub/scripts/evalhub_providers.py --target-type model 2>/dev/null
uv run ~/.claude/skills/evalhub/scripts/evalhub_providers.py --evaluates safety 2>/dev/null
uv run ~/.claude/skills/evalhub/scripts/evalhub_collections.py --evaluates safety 2>/dev/null

List benchmarks:

Terminal window
uv run ~/.claude/skills/evalhub/scripts/evalhub_providers.py --benchmarks 2>/dev/null
uv run ~/.claude/skills/evalhub/scripts/evalhub_providers.py --benchmarks --provider garak 2>/dev/null

All scripts output JSON to stdout. Errors go to stderr with exit code 1.

CapabilityMCP (preferred)Agent Skills
Discoverydiscover_providers tool or evalhub://providers resourceevalhub_providers.py --agent
Submit jobsubmit_evaluation toolevalhub_eval.py
Monitor jobget_job_status toolevalhub_status.py --wait
EDD workflowedd_workflow promptEDD references in skill docs
Setupclaude mcp add or VS Code MCP configmake install-all

Use MCP when the EvalHub MCP server is connected — it provides structured tool outputs and the discover_providers filter. Use skills when MCP is unavailable, for CI scripts, or when you need direct REST access via the Python SDK.

When MCP is connected, skills prefer MCP resources over Python scripts:

MCP Resource URIReplaces
evalhub://providersevalhub_providers.py --agent
evalhub://providers/{id}evalhub_providers.py PROVIDER_ID
evalhub://benchmarksevalhub_providers.py --benchmarks
evalhub://benchmarks?label=safetyevalhub_providers.py --evaluates safety
evalhub://collectionsevalhub_collections.py --agent
evalhub://collections/{id}evalhub_collections.py COLLECTION_ID

You: Which providers can evaluate my model for safety?

The skill fetches live metadata and filters by evaluates:

[
{
"id": "garak",
"summary": "Red-team an LLM for safety vulnerabilities, toxicity, and OWASP risks",
"target_type": "model",
"evaluates": ["safety", "security", "red_teaming", "toxicity"]
}
]

You: Run a quick safety scan on my model at http://vllm:8000/v1.

The skill reads hints, submits a job, and monitors until complete — then interprets results using result_interpretation.

See Evaluation-Driven Development for the full before/after workflow.