EvalHub¶
Open source evaluation orchestration platform for Large Language Models.
What is EvalHub?¶
EvalHub is an evaluation orchestration platform for systematic LLM evaluation. It supports local development and Kubernetes-native deployment at scale. The platform consists of three components:
- EvalHub Server: Go REST API service for managing evaluation workflows
- EvalHub SDK: Python SDK for submitting evaluations and building adapters
- EvalHub Contrib: Community-contributed framework adapters
Architecture Overview¶
| Component | Description | Technology |
|---|---|---|
| Server | REST API, job orchestration, provider management | Go, SQLite / PostgreSQL |
| SDK | Client library, adapter framework, data models | Python 3.11+ |
| Contrib | Community framework adapters | Python containers (UBI9) |
| Jobs | Isolated evaluation execution | Kubernetes Jobs |
| Registry | Immutable artifact storage | OCI registries |
Quick Start¶
from evalhub import SyncEvalHubClient
from evalhub.models.api import ModelConfig, BenchmarkConfig, JobSubmissionRequest
client = SyncEvalHubClient(base_url="http://localhost:8080")
job = client.jobs.submit(JobSubmissionRequest(
model=ModelConfig(
url="http://localhost:11434/v1",
name="qwen2.5:1.5b"
),
benchmarks=[
BenchmarkConfig(id="mmlu", provider_id="lm_evaluation_harness")
]
))
result = client.jobs.wait_for_completion(job.id)
Key Features¶
- Versioned REST API (v1) with OpenAPI specification
- Provider registry with benchmark discovery
- Benchmark collections with weighted scoring
- Kubernetes-native job orchestration
- MLflow integration for experiment tracking
- OCI artifact persistence for evaluation results
- Adapter pattern for "Bring Your Own Framework" (BYOF) extensibility
- Prometheus metrics and OpenTelemetry tracing
Available Adapters¶
| Adapter | Provider | Metrics |
|---|---|---|
| LightEval | lighteval |
accuracy, acc_norm, exact_match |
| GuideLLM | guidellm |
TTFT, ITL, throughput, latency |
| MTEB | mteb |
STS, retrieval, classification |
| lm-eval-harness | lm_evaluation_harness |
167 benchmarks across 12 categories |
| Garak | garak |
OWASP Top 10, vulnerability scanning |
Next Steps¶
- Overview - Architecture and core concepts
- Installation - Install server and SDK
- Quick Start - Run your first evaluation
- Model authentication - API key, CA cert, and ServiceAccount token
- Using custom data - Using custom benchmark test data
- Python SDK Reference - Client and adapter API
License¶
Apache 2.0 - see LICENSE.