System Overview
EvalHub consists of three components that work together to orchestrate LLM evaluations.
| Component | Description | Technology |
|---|---|---|
| Server | REST API, job orchestration, provider management | Go, SQLite / PostgreSQL |
| SDK | Client library, adapter framework, data models | Python 3.11+ |
| Contrib | Community framework adapters | Python containers (UBI9) |
| Jobs | Isolated evaluation execution | Kubernetes Jobs |
| Registry | Immutable artifact storage | OCI registries |
Data Flow
Section titled “Data Flow”- Client submits evaluation via SDK or REST API
- Server creates Kubernetes Job with adapter container and sidecar
- ConfigMap mounted with JobSpec at
/meta/job.json - Adapter loads JobSpec, runs evaluation, reports progress via callbacks
- Sidecar forwards events to the server via
POST /api/v1/evaluations/jobs/{id}/events - Adapter persists artifacts to OCI registry and logs metrics to MLflow
- Server stores results in PostgreSQL, returns status to client
Core Concepts
Section titled “Core Concepts”Providers
Section titled “Providers”Evaluation providers represent frameworks. Each provider has a set of benchmarks and a container image that runs evaluations.
Built-in providers: lm_evaluation_harness, garak, guidellm, lighteval. Custom providers can be registered via YAML configuration or the REST API.
Benchmarks
Section titled “Benchmarks”Benchmarks are specific evaluation tasks within a provider. Examples: mmlu (lm_evaluation_harness), hellaswag (lighteval), sweep (guidellm).
Collections
Section titled “Collections”Curated sets of benchmarks with weighted scoring, enabling domain-specific evaluation with a single API call.
collection_id: healthcare_safety_v1benchmarks: - benchmark_id: mmlu_medical provider_id: lm_evaluation_harness weight: 2.0 - benchmark_id: toxicity provider_id: garak weight: 1.0Kubernetes Jobs that execute evaluations in isolated pods. Each job has an adapter container (runs the framework) and a sidecar (forwards status events to the server).
Adapters
Section titled “Adapters”Containerised applications implementing the FrameworkAdapter interface from the SDK. Each adapter loads a JobSpec, runs the evaluation framework, reports progress via callbacks, persists artifacts to OCI, and returns structured results.
Adapter Pattern
Section titled “Adapter Pattern”All adapters implement the FrameworkAdapter interface from the SDK:
from evalhub.adapter import FrameworkAdapter, JobSpec, JobResults, JobCallbacksfrom evalhub.adapter.models import JobStatusUpdate, JobStatus, JobPhase, MessageInfo, OCIArtifactSpec
class MyAdapter(FrameworkAdapter): def run_benchmark_job( self, config: JobSpec, callbacks: JobCallbacks ) -> JobResults: callbacks.report_status(JobStatusUpdate( status=JobStatus.RUNNING, phase=JobPhase.RUNNING_EVALUATION, progress=0.5, message=MessageInfo(message="Running evaluation", message_code="running_evaluation") ))
raw_results = self._run_evaluation(config) metrics = self._parse_results(raw_results)
oci_result = callbacks.create_oci_artifact(OCIArtifactSpec( files_path=self._output_dir, coordinates=config.exports.oci.coordinates ))
return JobResults( id=config.id, benchmark_id=config.benchmark_id, benchmark_index=config.benchmark_index, model_name=config.model.name, results=metrics, overall_score=self._calculate_score(metrics), num_examples_evaluated=len(metrics), duration_seconds=self._get_duration(), oci_artifact=oci_result )The adapter entrypoint creates callbacks and reports results:
def main(): adapter = MyAdapter() callbacks = DefaultCallbacks.from_adapter(adapter) results = adapter.run_benchmark_job(adapter.job_spec, callbacks)
# Optional: persist metrics to MLflow run_id = callbacks.mlflow.save(results, adapter.job_spec) if run_id: results.mlflow_run_id = run_id
callbacks.report_results(results)Component Diagram
Section titled “Component Diagram”Data Flow
Section titled “Data Flow”1. Initialisation
Section titled “1. Initialisation”2. Execution
Section titled “2. Execution”3. Artifact Persistence
Section titled “3. Artifact Persistence”Key Abstractions
Section titled “Key Abstractions”JobSpec
Section titled “JobSpec”Job configuration loaded from /meta/job.json (mounted ConfigMap):
class JobSpec: id: str provider_id: str benchmark_id: str benchmark_index: int model: ModelConfig # url, name, auth parameters: dict # adapter-specific parameters callback_url: str # sidecar base URL num_examples: int | None experiment_name: str | None # MLflow experiment tags: list[dict] exports: JobSpecExports | None # OCI coordinatesJobResults
Section titled “JobResults”Structured results returned by the adapter:
class JobResults: id: str benchmark_id: str benchmark_index: int model_name: str results: list[EvaluationResult] overall_score: float | None num_examples_evaluated: int duration_seconds: float completed_at: datetime oci_artifact: OCIArtifactResult | NoneJobCallbacks
Section titled “JobCallbacks”Communication interface for the adapter:
class JobCallbacks(ABC): def report_status(self, update: JobStatusUpdate) -> None: """Send status update via POST /api/v1/evaluations/jobs/{id}/events"""
def create_oci_artifact(self, spec: OCIArtifactSpec) -> OCIArtifactResult: """Push artifacts to OCI registry using oras/olot"""
def report_results(self, results: JobResults) -> None: """Send final results via POST /api/v1/evaluations/jobs/{id}/events"""DefaultCallbacks
Section titled “DefaultCallbacks”The SDK provides DefaultCallbacks which:
- Sends status events to the sidecar via HTTP POST
- Pushes OCI artifacts using
orasandolot - Logs metrics to MLflow when
experiment_nameis set - Handles auth via ServiceAccount tokens or explicit tokens
AdapterSettings
Section titled “AdapterSettings”Environment-based configuration loaded via pydantic-settings:
EVALHUB_MODE:k8sorlocal(defaultlocal)EVALHUB_JOB_SPEC_PATH: path to job spec (default/meta/job.jsonin k8s,meta/job.jsonlocally)OCI_AUTH_CONFIG_PATH: Docker config for OCI registry authOCI_INSECURE: skip TLS verification for OCI registryEVALHUB_AUTH_TOKEN_PATH: path to ServiceAccount token fileEVALHUB_CA_BUNDLE_PATH: path to CA bundle for TLSEVALHUB_INSECURE: skip TLS verification for EvalHub connectionEVALHUB_MLFLOW_BACKEND:odh(default) orupstream
Adapter Container Images
Section titled “Adapter Container Images”Adapters are built as UBI9 Python containers with a standard layout:
adapters/<name>/├── main.py # Entrypoint with FrameworkAdapter implementation├── requirements.txt # eval-hub-sdk[adapter] + framework dependencies├── Containerfile # UBI9 Python, entrypoint: python main.py└── meta/ └── job.json # Example job spec for local testing