EvalHub¶

Open source evaluation orchestration platform for Large Language Models.

What is EvalHub?¶

EvalHub is an evaluation orchestration platform designed for systematic LLM evaluation. It supports both local development and Kubernetes-native deployment for scale. The platform consists of three components:

EvalHub Server: REST API orchestration service for managing evaluation workflows
EvalHub SDK: Python SDK for submitting evaluations and building adapters
EvalHub Contrib: Community-contributed framework adapters

Core Problem Solved¶

LLM evaluation involves coordinating multiple frameworks, managing evaluation jobs, tracking results, and handling diverse deployment environments. EvalHub solves this by providing:

Unified evaluation API: Submit evaluations across frameworks using a consistent interface
Kubernetes-native orchestration: Automatic job lifecycle management, scaling, and resource isolation
Framework adapter pattern: "Bring Your Own Framework" (BYOF) approach with standardised integration
Multi-environment support: Deploy locally for development or on OpenShift for production

Architecture Overview¶

graph TB
    subgraph client["Client Applications"]
        PY[Python SDK Client]
        API[Direct REST API]
    end

    subgraph server["EvalHub Server"]
        REST[REST API Server]
        ORCH[Job Orchestrator]
        PROV[Provider Registry]
    end

    subgraph k8s["Kubernetes / OpenShift"]
        JOB[Evaluation Jobs]
        ADAPT[Framework Adapters]
        SIDE[Sidecar Containers]
    end

    subgraph storage["Storage"]
        OCI[OCI Registry]
        DB[(PostgreSQL)]
    end

    PY --> REST
    API --> REST
    REST --> ORCH
    REST --> PROV
    ORCH --> JOB
    JOB --> ADAPT
    JOB --> SIDE
    ADAPT --> SIDE
    SIDE --> OCI
    REST --> DB

    style server fill:#e3f2fd
    style k8s fill:#fff3e0
    style storage fill:#f3e5f5

Component Responsibilities¶

Component	Description	Technology
Server	REST API, job orchestration, provider management	Go, SQLite (local) / PostgreSQL (production)
SDK	Client library, adapter framework, data models	Python 3.12+
Contrib	Community framework adapters (LightEval, GuideLLM)	Python containers
Jobs	Isolated evaluation execution environments	Kubernetes Jobs
Registry	Immutable artifact storage	OCI registries

Key Features¶

Server Features¶

REST API: Versioned API (v1) with OpenAPI specification
Provider management: Discover evaluation providers and benchmarks
Collection management: Curated benchmark collections with weighted scoring
Job orchestration: Kubernetes Job lifecycle management
Persistent storage: SQLite for local development, PostgreSQL for production
Prometheus metrics: Production-ready observability

SDK Features¶

Client SDK (evalhub.client):

Submit evaluations to EvalHub service
Discover providers, benchmarks, and collections
Monitor job status and retrieve results
Async/await support for non-blocking workflows

Adapter SDK (evalhub.adapter):

Base class for framework integration
Callback interface for status reporting
OCI artifact persistence
Settings-based configuration

Core Models (evalhub.models):

Shared data structures
Request/response schemas
API model validation

Contrib Features¶

LightEval: Language model evaluation (HellaSwag, ARC, MMLU, TruthfulQA, GSM8K)
GuideLLM: Performance benchmarking (TTFT, ITL, throughput, latency)
Containerised: Production-ready container images

Quick Start¶

Submit an Evaluation (Python SDK)¶

from evalhub.client import EvalHubClient
from evalhub.models.api import ModelConfig, BenchmarkSpec

# Connect to EvalHub service
client = EvalHubClient(base_url="http://localhost:8080")

# Submit evaluation
job = client.submit_evaluation(
    model=ModelConfig(
        url="https://api.openai.com/v1",
        name="gpt-4"
    ),
    benchmarks=[
        BenchmarkSpec(
            benchmark_id="mmlu",
            provider_id="lm_evaluation_harness"
        )
    ]
)

# Monitor progress
status = client.get_job_status(job.job_id)
print(f"Status: {status.status}, Progress: {status.progress}")

Deploy on OpenShift¶

# Apply EvalHub manifests
oc apply -k config/openshift/

# Verify deployment
oc get pods -n eval-hub
oc logs -f deployment/eval-hub-server

Build a Framework Adapter¶

from evalhub.adapter import FrameworkAdapter, JobSpec, JobResults, JobCallbacks
from evalhub.adapter.models import JobStatusUpdate, JobStatus, JobPhase, OCIArtifactSpec

class MyAdapter(FrameworkAdapter):
    def run_benchmark_job(
        self,
        config: JobSpec,
        callbacks: JobCallbacks
    ) -> JobResults:
        # Load framework
        framework = load_your_framework()

        # Report progress
        callbacks.report_status(JobStatusUpdate(
            status=JobStatus.RUNNING,
            phase=JobPhase.RUNNING_EVALUATION,
            progress=0.5,
            message="Running evaluation"
        ))

        # Run evaluation
        results = framework.evaluate(config.benchmark_id)

        # Persist artifacts
        oci_result = callbacks.create_oci_artifact(OCIArtifactSpec(
            files=["results.json", "report.html"],
            job_id=config.job_id,
            benchmark_id=config.benchmark_id,
            model_name=config.model.name
        ))

        # Return results
        return JobResults(
            job_id=config.job_id,
            benchmark_id=config.benchmark_id,
            model_name=config.model.name,
            results=results.evaluation_results,
            num_examples_evaluated=len(results.evaluation_results),
            duration_seconds=results.duration,
            oci_artifact=oci_result
        )

Use Cases¶

Model Evaluation¶

Systematically evaluate LLMs across standardised benchmarks:

# Evaluate model on multiple benchmarks
client.submit_evaluation(
    model=ModelConfig(url="...", name="llama-3-8b"),
    benchmarks=[
        BenchmarkSpec(benchmark_id="mmlu", provider_id="lm_evaluation_harness"),
        BenchmarkSpec(benchmark_id="humaneval", provider_id="lm_evaluation_harness"),
        BenchmarkSpec(benchmark_id="gsm8k", provider_id="lm_evaluation_harness"),
    ]
)

Performance Testing¶

Benchmark inference server performance under load:

# Test server throughput and latency
client.submit_evaluation(
    model=ModelConfig(url="http://vllm-server:8000", name="llama-3-8b"),
    benchmarks=[
        BenchmarkSpec(
            benchmark_id="performance_test",
            provider_id="guidellm",
            config={
                "profile": "constant",
                "rate": 10,
                "max_seconds": 60
            }
        )
    ]
)

Collection-Based Evaluation¶

Run curated benchmark collections:

# Evaluate using predefined collection
client.submit_evaluation(
    model=ModelConfig(url="...", name="gpt-4"),
    collection_id="healthcare_safety_v1"  # Expands to multiple benchmarks
)