Skip to content

Overview

EvalHub is an evaluation orchestration platform for Large Language Models, consisting of three integrated components.

The EvalHub Ecosystem

EvalHub Server

Go REST API service that manages evaluation workflows.

  • Versioned REST API (v1) with OpenAPI specification
  • Kubernetes Job orchestration and lifecycle management
  • Provider registry and benchmark discovery
  • MLflow experiment tracking integration
  • SQLite (local) or PostgreSQL (production) storage
  • Prometheus metrics and OpenTelemetry tracing

EvalHub SDK

Python SDK providing three packages:

evalhub.client - Submit evaluations:

from evalhub import SyncEvalHubClient
from evalhub.models.api import ModelConfig, BenchmarkConfig, JobSubmissionRequest

client = SyncEvalHubClient(base_url="http://localhost:8080")
job = client.jobs.submit(JobSubmissionRequest(
    model=ModelConfig(url="http://vllm:8000/v1", name="llama-3-8b"),
    benchmarks=[BenchmarkConfig(id="mmlu", provider_id="lm_evaluation_harness")]
))

evalhub.adapter - Build framework adapters:

from evalhub.adapter import FrameworkAdapter, JobSpec, JobResults, JobCallbacks

class MyAdapter(FrameworkAdapter):
    def run_benchmark_job(self, config: JobSpec, callbacks: JobCallbacks) -> JobResults:
        ...

evalhub.models - Shared data structures:

from evalhub.models.api import ModelConfig, BenchmarkConfig, EvaluationJob

EvalHub Contrib

Community-contributed framework adapters packaged as container images.

Adapter Image Purpose
LightEval quay.io/eval-hub/community-lighteval:latest Language model benchmarks
GuideLLM quay.io/eval-hub/community-guidellm:latest Performance benchmarking
MTEB quay.io/eval-hub/community-mteb:latest Embedding model evaluation

System Architecture

EvalHub system architecture

Data Flow

  1. Client submits evaluation via SDK or REST API
  2. Server creates Kubernetes Job with adapter container and sidecar
  3. ConfigMap mounted with JobSpec at /meta/job.json
  4. Adapter loads JobSpec, runs evaluation, reports progress via callbacks
  5. Sidecar forwards events to the server via POST /api/v1/evaluations/jobs/{id}/events
  6. Adapter persists artifacts to OCI registry and logs metrics to MLflow
  7. Server stores results in PostgreSQL, returns status to client

Core Concepts

Providers

Evaluation providers represent frameworks. Each provider has a set of benchmarks and a container image that runs evaluations.

Built-in providers: lm_evaluation_harness, garak, guidellm, lighteval. Custom providers can be registered via YAML configuration or the REST API.

Benchmarks

Benchmarks are specific evaluation tasks within a provider. Examples: mmlu (lm_evaluation_harness), hellaswag (lighteval), sweep (guidellm).

Collections

Curated sets of benchmarks with weighted scoring, enabling domain-specific evaluation with a single API call.

collection_id: healthcare_safety_v1
benchmarks:
  - benchmark_id: mmlu_medical
    provider_id: lm_evaluation_harness
    weight: 2.0
  - benchmark_id: toxicity
    provider_id: garak
    weight: 1.0

Jobs

Kubernetes Jobs that execute evaluations in isolated pods. Each job has an adapter container (runs the framework) and a sidecar (forwards status events to the server).

Adapters

Containerised applications implementing the FrameworkAdapter interface from the SDK. Each adapter loads a JobSpec, runs the evaluation framework, reports progress via callbacks, persists artifacts to OCI, and returns structured results.

Next Steps