Quick Start
Run your first evaluation with EvalHub.
-
Start a Model Server
You need an OpenAI-compatible model endpoint. Pick one:
Terminal window kubectl apply -f - <<EOFapiVersion: apps/v1kind: Deploymentmetadata:name: vllm-serverspec:replicas: 1selector:matchLabels:app: vllmtemplate:metadata:labels:app: vllmspec:containers:- name: vllmimage: vllm/vllm-openai:latestargs: [--model, meta-llama/Llama-3.2-1B-Instruct, --port, "8000"]ports:- containerPort: 8000EOFTerminal window curl -fsSL https://ollama.com/install.sh | shollama run qwen2.5:1.5bOllama serves at
http://localhost:11434/v1(OpenAI-compatible). -
Install a Client
Terminal window pip install "eval-hub-sdk[cli]"Terminal window pip install "eval-hub-sdk[client]"No installation needed — use
curldirectly against the REST API. -
Submit an Evaluation
Terminal window evalhub eval run \--name quickstart-eval \--model-url http://localhost:11434/v1 \--model-name qwen2.5:1.5b \--provider guidellm \--benchmark quick_perf_test# Job submitted: eval-a1b2c3d4import osfrom evalhub import SyncEvalHubClientfrom evalhub.models.api import ModelConfig, BenchmarkConfig, JobSubmissionRequestclient = SyncEvalHubClient(base_url=os.environ["EVALHUB_URL"])job = client.jobs.submit(JobSubmissionRequest(name="quickstart-eval",model=ModelConfig(url="http://localhost:11434/v1",name="qwen2.5:1.5b"),benchmarks=[BenchmarkConfig(id="quick_perf_test",provider_id="guidellm",parameters={"profile": "constant","rate": 5,"max_seconds": 10,"max_requests": 20,})]))print(f"Job submitted: {job.id}")Terminal window curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \-H "Content-Type: application/json" \-d '{"name": "quickstart-eval","model": {"url": "http://localhost:11434/v1","name": "qwen2.5:1.5b"},"benchmarks": [{"id": "quick_perf_test","provider_id": "guidellm","parameters": {"profile": "constant","rate": 5,"max_seconds": 10,"max_requests": 20}}]}' -
Wait for Results
Terminal window # Block until the job completesevalhub eval run --config eval.yaml --wait# Or check status separatelyevalhub eval status eval-a1b2c3d4# View resultsevalhub eval results eval-a1b2c3d4result = client.jobs.wait_for_completion(job.id, timeout=120)print(f"Status: {result.state}")print(f"Results: {result.results}")Or poll manually:
status = client.jobs.get(job.id)print(f"Status: {status.state}")Terminal window # Check job statuscurl -s $EVALHUB_URL/api/v1/evaluations/jobs/eval-a1b2c3d4 | jq .# Get results once completedcurl -s $EVALHUB_URL/api/v1/evaluations/jobs/eval-a1b2c3d4 \| jq '.results'
Explore Further
Section titled “Explore Further”List Providers and Benchmarks
Section titled “List Providers and Benchmarks”evalhub providers listevalhub providers describe lm_evaluation_harnessfor provider in client.providers.list(): print(f"{provider.resource.id}: {provider.name}")
benchmarks = client.benchmarks.list(provider_id="lm_evaluation_harness")for b in benchmarks: print(f" {b.id}: {b.name}")# List all providerscurl -s $EVALHUB_URL/api/v1/evaluations/providers | jq .
# Get a specific provider with its benchmarkscurl -s $EVALHUB_URL/api/v1/evaluations/providers/lm_evaluation_harness | jq .Submit a Collection
Section titled “Submit a Collection”evalhub collections run healthcare_safety_v1 \ --model-url http://localhost:11434/v1 \ --model-name qwen2.5:1.5b \ --name llama3-healthcare-eval \ --waitfrom evalhub.models.api import CollectionRef
job = client.jobs.submit(JobSubmissionRequest( name="llama3-healthcare-eval", model=ModelConfig(url="http://localhost:11434/v1", name="qwen2.5:1.5b"), collection=CollectionRef(id="healthcare_safety_v1")))curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \ -H "Content-Type: application/json" \ -d '{ "name": "llama3-healthcare-eval", "model": { "url": "http://localhost:11434/v1", "name": "qwen2.5:1.5b" }, "collection": { "id": "healthcare_safety_v1" } }'Use MLflow Tracking
Section titled “Use MLflow Tracking”evalhub eval run \ --name llama3-mlflow-eval \ --model-url http://localhost:11434/v1 \ --model-name qwen2.5:1.5b \ --provider lm_evaluation_harness \ --benchmark mmlu \ --experiment my-experimentfrom evalhub.models.api import ExperimentConfig
job = client.jobs.submit(JobSubmissionRequest( name="llama3-mlflow-eval", model=ModelConfig(url="http://localhost:11434/v1", name="qwen2.5:1.5b"), benchmarks=[BenchmarkConfig(id="mmlu", provider_id="lm_evaluation_harness")], experiment=ExperimentConfig(name="my-experiment")))curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \ -H "Content-Type: application/json" \ -d '{ "name": "llama3-mlflow-eval", "model": { "url": "http://localhost:11434/v1", "name": "qwen2.5:1.5b" }, "benchmarks": [ { "id": "mmlu", "provider_id": "lm_evaluation_harness" } ], "experiment": { "name": "my-experiment" } }'Export Results
Section titled “Export Results”# JSONevalhub eval results eval-a1b2c3d4 --format json > results.json
# CSVevalhub eval results eval-a1b2c3d4 --format csv > results.csvresult = client.jobs.wait_for_completion(job.id, timeout=3600)
# Access results directlyfor r in result.results.benchmarks: print(f"{r.id}: {r.metrics}")curl -s $EVALHUB_URL/api/v1/evaluations/jobs/eval-a1b2c3d4 \ | jq '.results'Export Results to OCI Registry
Section titled “Export Results to OCI Registry”EvalHub can persist evaluation result files as OCI Artifacts to an OCI Registry. Each result file is stored as a separate layer in the OCI Artifact, allowing consumers to selectively pull only the content they need (e.g. a summary JSON, individual adapter output files, or all files).
First, create a Kubernetes Secret with your registry credentials (aka “OCI Connection”):
kind: SecretapiVersion: v1type: kubernetes.io/dockerconfigjsonmetadata: name: my-oci-credentialsdata: .dockerconfigjson: <base64-encoded docker config>Then reference it in the job submission using the exports.oci field:
evalhub eval run \ --name my-eval-job1 \ --model-url http://vllm-llama3-8b-instruct-svc.evalhub-test.svc.cluster.local:8000/v1 \ --model-name meta-llama/Llama-3.2-1B-Instruct \ --provider demo \ --benchmark demo_benchmark \ --oci-host quay.io \ --oci-repository myorg/myartifact \ --oci-connection my-oci-credentialsfrom evalhub.models.api import EvaluationExports, EvaluationExportsOCI, OCICoordinates, OCIConnectionConfig
job = client.jobs.submit(JobSubmissionRequest( name="my-eval-job1", model=ModelConfig( url="http://vllm-llama3-8b-instruct-svc.evalhub-test.svc.cluster.local:8000/v1", name="meta-llama/Llama-3.2-1B-Instruct" ), benchmarks=[ BenchmarkConfig(id="demo_benchmark", provider_id="demo") ], exports=EvaluationExports( oci=EvaluationExportsOCI( coordinates=OCICoordinates( oci_host="quay.io", oci_repository="myorg/myartifact" ), k8s=OCIConnectionConfig(connection="my-oci-credentials") ) )))curl -s -X POST $EVALHUB_URL/api/v1/evaluations/jobs \ -H "Content-Type: application/json" \ -d '{ "name": "my-eval-job1", "model": { "url": "http://vllm-llama3-8b-instruct-svc.evalhub-test.svc.cluster.local:8000/v1", "name": "meta-llama/Llama-3.2-1B-Instruct" }, "benchmarks": [ { "id": "demo_benchmark", "provider_id": "demo" } ], "exports": { "oci": { "coordinates": { "oci_host": "quay.io", "oci_repository": "myorg/myartifact" }, "k8s": { "connection": "my-oci-credentials" } } } }'When the evaluation completes, the adapter pushes an OCI Artifact to the specified repository (e.g. quay.io/myorg/myartifact). The k8s.connection field references the name of the Kubernetes Secret containing the registry credentials.
Troubleshooting
Section titled “Troubleshooting”Job stuck in pending: Check server logs with kubectl logs deployment/evalhub-server or run locally with debug logging.
Model server not responding: Verify the model endpoint is reachable from the adapter pod (curl http://model-server:8000/v1/models).