Quick Start

Run your first evaluation with EvalHub.

Start a Model Server

You need an OpenAI-compatible model endpoint. Pick one:
- vLLM (Kubernetes)
- Ollama (Local)
Terminal window
kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: vllm-server spec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: containers: - name: vllm image: vllm/vllm-openai:latest args: [--model, meta-llama/Llama-3.2-1B-Instruct, --port, "8000"] ports: - containerPort: 8000 EOF
Terminal window
curl -fsSL https://ollama.com/install.sh | sh ollama run qwen2.5:1.5b
Ollama serves at http://localhost:11434/v1 (OpenAI-compatible).
Install a Client
Terminal window
pip install "eval-hub-sdk[cli]"
Terminal window
pip install "eval-hub-sdk[client]"
No installation needed — use curl directly against the REST API.

Submit an Evaluation

evalhub eval run \
  --name quickstart-eval \
  --model-url http://localhost:11434/v1 \
  --model-name qwen2.5:1.5b \
  --provider guidellm \
  --benchmark quick_perf_test
# Job submitted: eval-a1b2c3d4

import os
from evalhub import SyncEvalHubClient
from evalhub.models.api import ModelConfig, BenchmarkConfig, JobSubmissionRequest

client = SyncEvalHubClient(base_url=os.environ["EVALHUB_URL"])

job = client.jobs.submit(JobSubmissionRequest(
    name="quickstart-eval",
    model=ModelConfig(
        url="http://localhost:11434/v1",
        name="qwen2.5:1.5b"
    ),
    benchmarks=[
        BenchmarkConfig(
            id="quick_perf_test",
            provider_id="guidellm",
            parameters={
                "profile": "constant",
                "rate": 5,
                "max_seconds": 10,
                "max_requests": 20,
            }
        )
    ]
))

print(f"Job submitted: {job.id}")

curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "quickstart-eval",
    "model": {
      "url": "http://localhost:11434/v1",
      "name": "qwen2.5:1.5b"
    },
    "benchmarks": [
      {
        "id": "quick_perf_test",
        "provider_id": "guidellm",
        "parameters": {
          "profile": "constant",
          "rate": 5,
          "max_seconds": 10,
          "max_requests": 20
        }
      }
    ]
  }'

Wait for Results

# Block until the job completes
evalhub eval run --config eval.yaml --wait

# Or check status separately
evalhub eval status eval-a1b2c3d4

# View results
evalhub eval results eval-a1b2c3d4

result = client.jobs.wait_for_completion(job.id, timeout=120)
print(f"Status: {result.state}")
print(f"Results: {result.results}")

Or poll manually:

status = client.jobs.get(job.id)
print(f"Status: {status.state}")

# Check job status
curl -s $EVALHUB_URL/api/v1/evaluations/jobs/eval-a1b2c3d4 | jq .

# Get results once completed
curl -s $EVALHUB_URL/api/v1/evaluations/jobs/eval-a1b2c3d4 \
  | jq '.results'

Explore Further

List Providers and Benchmarks

evalhub providers list
evalhub providers describe lm_evaluation_harness

for provider in client.providers.list():
    print(f"{provider.resource.id}: {provider.name}")

benchmarks = client.benchmarks.list(provider_id="lm_evaluation_harness")
for b in benchmarks:
    print(f"  {b.id}: {b.name}")

# List all providers
curl -s $EVALHUB_URL/api/v1/evaluations/providers | jq .

# Get a specific provider with its benchmarks
curl -s $EVALHUB_URL/api/v1/evaluations/providers/lm_evaluation_harness | jq .

Submit a Collection

evalhub collections run healthcare_safety_v1 \
  --model-url http://localhost:11434/v1 \
  --model-name qwen2.5:1.5b \
  --name llama3-healthcare-eval \
  --wait

from evalhub.models.api import CollectionRef

job = client.jobs.submit(JobSubmissionRequest(
    name="llama3-healthcare-eval",
    model=ModelConfig(url="http://localhost:11434/v1", name="qwen2.5:1.5b"),
    collection=CollectionRef(id="healthcare_safety_v1")
))

curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama3-healthcare-eval",
    "model": {
      "url": "http://localhost:11434/v1",
      "name": "qwen2.5:1.5b"
    },
    "collection": {
      "id": "healthcare_safety_v1"
    }
  }'

Use MLflow Tracking

evalhub eval run \
  --name llama3-mlflow-eval \
  --model-url http://localhost:11434/v1 \
  --model-name qwen2.5:1.5b \
  --provider lm_evaluation_harness \
  --benchmark mmlu \
  --experiment my-experiment

from evalhub.models.api import ExperimentConfig

job = client.jobs.submit(JobSubmissionRequest(
    name="llama3-mlflow-eval",
    model=ModelConfig(url="http://localhost:11434/v1", name="qwen2.5:1.5b"),
    benchmarks=[BenchmarkConfig(id="mmlu", provider_id="lm_evaluation_harness")],
    experiment=ExperimentConfig(name="my-experiment")
))

curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama3-mlflow-eval",
    "model": {
      "url": "http://localhost:11434/v1",
      "name": "qwen2.5:1.5b"
    },
    "benchmarks": [
      {
        "id": "mmlu",
        "provider_id": "lm_evaluation_harness"
      }
    ],
    "experiment": {
      "name": "my-experiment"
    }
  }'

Export Results

# JSON
evalhub eval results eval-a1b2c3d4 --format json > results.json

# CSV
evalhub eval results eval-a1b2c3d4 --format csv > results.csv

result = client.jobs.wait_for_completion(job.id, timeout=3600)

# Access results directly
for r in result.results.benchmarks:
    print(f"{r.id}: {r.metrics}")

curl -s $EVALHUB_URL/api/v1/evaluations/jobs/eval-a1b2c3d4 \
  | jq '.results'

Export Results to OCI Registry

EvalHub can persist evaluation result files as OCI Artifacts to an OCI Registry. Each result file is stored as a separate layer in the OCI Artifact, allowing consumers to selectively pull only the content they need (e.g. a summary JSON, individual adapter output files, or all files).

First, create a Kubernetes Secret with your registry credentials (aka “OCI Connection”):

kind: Secret
apiVersion: v1
type: kubernetes.io/dockerconfigjson
metadata:
  name: my-oci-credentials
data:
  .dockerconfigjson: <base64-encoded docker config>

Then reference it in the job submission using the exports.oci field:

evalhub eval run \
  --name my-eval-job1 \
  --model-url http://vllm-llama3-8b-instruct-svc.evalhub-test.svc.cluster.local:8000/v1 \
  --model-name meta-llama/Llama-3.2-1B-Instruct \
  --provider demo \
  --benchmark demo_benchmark \
  --oci-host quay.io \
  --oci-repository myorg/myartifact \
  --oci-connection my-oci-credentials

from evalhub.models.api import EvaluationExports, EvaluationExportsOCI, OCICoordinates, OCIConnectionConfig

job = client.jobs.submit(JobSubmissionRequest(
    name="my-eval-job1",
    model=ModelConfig(
        url="http://vllm-llama3-8b-instruct-svc.evalhub-test.svc.cluster.local:8000/v1",
        name="meta-llama/Llama-3.2-1B-Instruct"
    ),
    benchmarks=[
        BenchmarkConfig(id="demo_benchmark", provider_id="demo")
    ],
    exports=EvaluationExports(
        oci=EvaluationExportsOCI(
            coordinates=OCICoordinates(
                oci_host="quay.io",
                oci_repository="myorg/myartifact"
            ),
            k8s=OCIConnectionConfig(connection="my-oci-credentials")
        )
    )
))

curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-eval-job1",
    "model": {
      "url": "http://vllm-llama3-8b-instruct-svc.evalhub-test.svc.cluster.local:8000/v1",
      "name": "meta-llama/Llama-3.2-1B-Instruct"
    },
    "benchmarks": [
      {
        "id": "demo_benchmark",
        "provider_id": "demo"
      }
    ],
    "exports": {
      "oci": {
        "coordinates": {
          "oci_host": "quay.io",
          "oci_repository": "myorg/myartifact"
        },
        "k8s": {
          "connection": "my-oci-credentials"
        }
      }
    }
  }'

When the evaluation completes, the adapter pushes an OCI Artifact to the specified repository (e.g. quay.io/myorg/myartifact). The k8s.connection field references the name of the Kubernetes Secret containing the registry credentials.

Troubleshooting

Job stuck in pending: Check server logs with kubectl logs deployment/evalhub-server or run locally with debug logging.

Model server not responding: Verify the model endpoint is reachable from the adapter pod (curl http://model-server:8000/v1/models).