Skip to content

Local Mode Tutorial

since 0.4.2

End-to-end walkthrough: run a LightEval evaluation locally with EvalHub, MLflow experiment tracking, and OCI artifact storage — no Kubernetes required.

This tutorial uses the LightEval adapter from eval-hub-contrib. For background on how local mode works, see the Local Mode guide.

Install the following tools before starting:

  • uv — Python package manager
  • podman (or Docker) — for running the OCI registry
  • ollama (or any OpenAI-compatible LLM server) — for serving a local model
  1. Create the project directory and virtual environment

    Terminal window
    mkdir local-lighteval && cd local-lighteval
    uv venv --python 3.12
  2. Install dependencies

    Install eval-hub-server and the evalhub CLI:

    Terminal window
    uv pip install "eval-hub-sdk[server,cli]"
  3. Download the LightEval adapter

    Download the adapter driver and its requirements from eval-hub-contrib, then install them:

    Terminal window
    curl -o main.py https://raw.githubusercontent.com/eval-hub/eval-hub-contrib/main/adapters/lighteval/main.py
    curl -o requirements.txt https://raw.githubusercontent.com/eval-hub/eval-hub-contrib/main/adapters/lighteval/requirements.txt
    uv pip install -r requirements.txt
  4. Start MLflow (optional)

    Only needed if you want experiment tracking. Skip this step if you just want to run evaluations.

    Install MLflow and start the server:

    Terminal window
    uv pip install "mlflow>=3.11"
    Terminal window
    source .venv/bin/activate
    mlflow server \
    --backend-store-uri sqlite:///mlflow.db \
    --host localhost \
    --port 5000

    Verify from another terminal:

    Terminal window
    curl http://localhost:5000/health

    The MLflow UI dashboard is accessible at http://localhost:5000.

  5. Start the OCI registry (optional)

    Only needed if you want to persist evaluation artifacts to an OCI registry. Skip this step if you just want to run evaluations.

    In another terminal, pull the registry image and start it on localhost:5001:

    Terminal window
    podman pull docker.io/library/registry:2
    podman run -d -p 5001:5000 \
    --name eval-hub-oci-registry \
    -e REGISTRY_STORAGE_DELETE_ENABLED=true \
    docker.io/library/registry:2
  6. Start the LLM server

    Pull a model:

    Terminal window
    ollama pull llama3.2:3b-instruct-q4_K_M

    Verify it’s running:

    Terminal window
    curl -s http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "llama3.2:3b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "max_tokens": 100
    }'
  7. Start eval-hub-server

    Download the template config.yaml from the eval-hub repository:

    Terminal window
    mkdir -p config
    curl -o config/config.yaml https://raw.githubusercontent.com/eval-hub/eval-hub/main/config/config.yaml

    In another terminal, start the server:

    Terminal window
    eval-hub-server --local --configdir ./config

    If MLflow is running, pass the tracking URI:

    Terminal window
    MLFLOW_TRACKING_URI=http://localhost:5000 \
    eval-hub-server --local --configdir ./config

    Verify from another terminal:

    Terminal window
    evalhub config set base_url http://localhost:8080
    evalhub health
    EvalHub service: healthy (2ms)
  8. Register the provider

    Create a provider definition file and register it with the CLI:

    Terminal window
    cat > lighteval.yaml << 'EOF'
    name: lighteval
    description: LightEval adapter for evaluation framework
    runtime:
    local:
    command: "python main.py"
    env:
    - name: OCI_INSECURE
    value: "true"
    benchmarks:
    - id: gsm8k
    name: Grade-school math word problems
    description: |-
    Multi-step arithmetic word problems requiring 2-8 reasoning steps
    (8-shot, 1,319 examples).
    category: math
    metrics:
    - exact_match
    - acc
    num_few_shot: 8
    dataset_size: 1319
    tags:
    - math
    - reasoning
    - lighteval
    primary_score:
    metric: acc
    lower_is_better: false
    pass_criteria:
    threshold: 0.25
    EOF
    evalhub providers create --file lighteval.yaml

    See the Local Mode guide — Provider Configuration for details on the runtime.local section.

    Verify the provider was registered:

    Terminal window
    evalhub providers list
    ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
    ┃ ID ┃ NAME ┃ DESCRIPTION ┃ BENCHMARKS ┃
    ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
    │ 11031769-2ee5-4433-a129-8c6ad84d7185 │ lighteval │ LightEval adapter for evaluation framework │ 1 │
    └──────────────────────────────────────┴───────────┴────────────────────────────────────────────┴────────────┘

    Use the provider ID from the output above to query its available benchmarks (your ID will differ):

    Terminal window
    evalhub providers describe <provider-id>
    Provider: lighteval
    ID: 11031769-2ee5-4433-a129-8c6ad84d7185
    Description: LightEval adapter for evaluation framework
    Benchmarks (1):
    ┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
    ┃ ID ┃ NAME ┃ CATEGORY ┃ METRICS ┃
    ┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
    │ gsm8k │ Grade-school math word problems │ math │ exact_match, acc │
    └───────┴─────────────────────────────────┴──────────┴──────────────────┘

    The benchmark ID gsm8k is defined in the provider YAML above. Use the provider ID and benchmark ID from this output as the --provider and --benchmark values when running evaluations.

With all services running, submit a job using the CLI:

Terminal window
evalhub eval run \
--name my-eval-job \
--model-url http://localhost:11434/v1 \
--model-name llama3.2:3b-instruct-q4_K_M \
--provider <provider-id> \
--benchmark <benchmark-id> \
--param num_examples=10 \
--param num_few_shot=0 \
--wait

Check results:

Terminal window
evalhub eval results <job-id>

With MLflow experiments and OCI artifact storage

Section titled “With MLflow experiments and OCI artifact storage”

MLflow experiment tracking and OCI artifact export are optional. If you started MLflow and the OCI registry in the setup steps, add the --experiment and --oci-* flags:

Terminal window
evalhub eval run \
--name my-eval-job \
--model-url http://localhost:11434/v1 \
--model-name llama3.2:3b-instruct-q4_K_M \
--provider <provider-id> \
--benchmark <benchmark-id> \
--param num_examples=10 \
--param num_few_shot=0 \
--experiment my-local-experiment \
--oci-host localhost:5001 \
--oci-repository local-eval-results \
--wait
Terminal window
evalhub eval results <job-id> --format json

Example output:

[
{
"id": "gsm8k",
"provider_id": "11031769-2ee5-4433-a129-8c6ad84d7185",
"benchmark_index": 0,
"metrics": {
"all.extractive_match": 0.7,
"gsm8k|0.extractive_match": 0.7
},
"artifacts": {
"evalhub.env_card": {
"aggregate_results": {},
"autograder_bias": {},
"capture_completeness": 0.15,
"confidence_intervals": {},
"cpu_model": "arm",
"custom": {},
"k8s_pod_labels": {},
"k8s_resource_limits": {},
"key_packages": {
"mlflow": "3.12.0",
"torch": "2.12.0",
"transformers": "5.9.0"
},
"os_info": "macOS-26.5-arm64-arm-64bit",
"per_task_results": {},
"python_version": "3.12.12",
"scorer_ids": []
},
"oci_digest": "sha256:31f4da6f8aca49f50d97da92775eef691875701fae0bb81cf72c5b60fb7f9f20",
"oci_reference": "localhost:5001/local-eval-results:evalhub-bdf6096c95aff921cc5067a2c679829db76541f07ee8d607211666129bbcfef7@sha256:31f4da6f8aca49f50d97da92775eef691875701fae0bb81cf72c5b60fb7f9f20"
},
"mlflow_run_id": "0a3505d2c1d24025b13f05d4ec96cb01",
"logs_path": null
}
]

The included evalhub-client.ipynb notebook demonstrates the full evaluation lifecycle using the eval-hub-sdk Python client — submitting jobs, polling status, and retrieving results programmatically.

After completing setup, you have four services on localhost:

ServiceURLPurpose
eval-hub-serverhttp://localhost:8080Evaluation orchestration
MLflowhttp://localhost:5000Experiment tracking dashboard
OCI registryhttp://localhost:5001Artifact storage
Ollamahttp://localhost:11434LLM inference
  • Browse the MLflow UI at http://localhost:5000 to see experiment metrics
  • Read the Local Mode guide for provider configuration details and troubleshooting
  • Try other adapters by registering additional providers with evalhub providers create --file <file>.yaml