Overview¶
EvalHub is an evaluation orchestration platform for Large Language Models, consisting of three integrated components.
The EvalHub Ecosystem¶
EvalHub Server¶
Go REST API service that manages evaluation workflows.
- Versioned REST API (v1) with OpenAPI specification
- Kubernetes Job orchestration and lifecycle management
- Provider registry and benchmark discovery
- MLflow experiment tracking integration
- SQLite (local) or PostgreSQL (production) storage
- Prometheus metrics and OpenTelemetry tracing
EvalHub SDK¶
Python SDK providing three packages:
evalhub.client - Submit evaluations:
from evalhub import SyncEvalHubClient
from evalhub.models.api import ModelConfig, BenchmarkConfig, JobSubmissionRequest
client = SyncEvalHubClient(base_url="http://localhost:8080")
job = client.jobs.submit(JobSubmissionRequest(
model=ModelConfig(url="http://vllm:8000/v1", name="llama-3-8b"),
benchmarks=[BenchmarkConfig(id="mmlu", provider_id="lm_evaluation_harness")]
))
evalhub.adapter - Build framework adapters:
from evalhub.adapter import FrameworkAdapter, JobSpec, JobResults, JobCallbacks
class MyAdapter(FrameworkAdapter):
def run_benchmark_job(self, config: JobSpec, callbacks: JobCallbacks) -> JobResults:
...
evalhub.models - Shared data structures:
EvalHub Contrib¶
Community-contributed framework adapters packaged as container images.
| Adapter | Image | Purpose |
|---|---|---|
| LightEval | quay.io/eval-hub/community-lighteval:latest |
Language model benchmarks |
| GuideLLM | quay.io/eval-hub/community-guidellm:latest |
Performance benchmarking |
| MTEB | quay.io/eval-hub/community-mteb:latest |
Embedding model evaluation |
System Architecture¶
Data Flow¶
- Client submits evaluation via SDK or REST API
- Server creates Kubernetes Job with adapter container and sidecar
- ConfigMap mounted with JobSpec at
/meta/job.json - Adapter loads JobSpec, runs evaluation, reports progress via callbacks
- Sidecar forwards events to the server via
POST /api/v1/evaluations/jobs/{id}/events - Adapter persists artifacts to OCI registry and logs metrics to MLflow
- Server stores results in PostgreSQL, returns status to client
Core Concepts¶
Providers¶
Evaluation providers represent frameworks. Each provider has a set of benchmarks and a container image that runs evaluations.
Built-in providers: lm_evaluation_harness, garak, guidellm, lighteval. Custom providers can be registered via YAML configuration or the REST API.
Benchmarks¶
Benchmarks are specific evaluation tasks within a provider. Examples: mmlu (lm_evaluation_harness), hellaswag (lighteval), sweep (guidellm).
Collections¶
Curated sets of benchmarks with weighted scoring, enabling domain-specific evaluation with a single API call.
collection_id: healthcare_safety_v1
benchmarks:
- benchmark_id: mmlu_medical
provider_id: lm_evaluation_harness
weight: 2.0
- benchmark_id: toxicity
provider_id: garak
weight: 1.0
Jobs¶
Kubernetes Jobs that execute evaluations in isolated pods. Each job has an adapter container (runs the framework) and a sidecar (forwards status events to the server).
Adapters¶
Containerised applications implementing the FrameworkAdapter interface from the SDK. Each adapter loads a JobSpec, runs the evaluation framework, reports progress via callbacks, persists artifacts to OCI, and returns structured results.
Next Steps¶
- Installation - Install server and SDK components
- Quick Start - Run your first evaluation
- Architecture - Adapter technical architecture
- Python SDK - Client and adapter reference