Local Mode Tutorial
End-to-end walkthrough: run a LightEval evaluation locally with EvalHub, MLflow experiment tracking, and OCI artifact storage — no Kubernetes required.
This tutorial uses the LightEval adapter from eval-hub-contrib. For background on how local mode works, see the Local Mode guide.
Prerequisites
Section titled “Prerequisites”Install the following tools before starting:
- uv — Python package manager
- podman (or Docker) — for running the OCI registry
- ollama (or any OpenAI-compatible LLM server) — for serving a local model
-
Create the project directory and virtual environment
Terminal window mkdir local-lighteval && cd local-lightevaluv venv --python 3.12 -
Install dependencies
Install eval-hub-server and the evalhub CLI:
Terminal window uv pip install "eval-hub-sdk[server,cli]" -
Download the LightEval adapter
Download the adapter driver and its requirements from eval-hub-contrib, then install them:
Terminal window curl -o main.py https://raw.githubusercontent.com/eval-hub/eval-hub-contrib/main/adapters/lighteval/main.pycurl -o requirements.txt https://raw.githubusercontent.com/eval-hub/eval-hub-contrib/main/adapters/lighteval/requirements.txtuv pip install -r requirements.txt -
Start MLflow (optional)
Only needed if you want experiment tracking. Skip this step if you just want to run evaluations.
Install MLflow and start the server:
Terminal window uv pip install "mlflow>=3.11"Terminal window source .venv/bin/activatemlflow server \--backend-store-uri sqlite:///mlflow.db \--host localhost \--port 5000Verify from another terminal:
Terminal window curl http://localhost:5000/healthThe MLflow UI dashboard is accessible at
http://localhost:5000. -
Start the OCI registry (optional)
Only needed if you want to persist evaluation artifacts to an OCI registry. Skip this step if you just want to run evaluations.
In another terminal, pull the registry image and start it on
localhost:5001:Terminal window podman pull docker.io/library/registry:2podman run -d -p 5001:5000 \--name eval-hub-oci-registry \-e REGISTRY_STORAGE_DELETE_ENABLED=true \docker.io/library/registry:2 -
Start the LLM server
Pull a model:
Terminal window ollama pull llama3.2:3b-instruct-q4_K_MVerify it’s running:
Terminal window curl -s http://localhost:11434/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "llama3.2:3b-instruct-q4_K_M","messages": [{"role": "user", "content": "Why is the sky blue?"}],"max_tokens": 100}' -
Start eval-hub-server
Download the template
config.yamlfrom the eval-hub repository:Terminal window mkdir -p configcurl -o config/config.yaml https://raw.githubusercontent.com/eval-hub/eval-hub/main/config/config.yamlIn another terminal, start the server:
Terminal window eval-hub-server --local --configdir ./configIf MLflow is running, pass the tracking URI:
Terminal window MLFLOW_TRACKING_URI=http://localhost:5000 \eval-hub-server --local --configdir ./configVerify from another terminal:
Terminal window evalhub config set base_url http://localhost:8080evalhub healthEvalHub service: healthy (2ms) -
Register the provider
Create a provider definition file and register it with the CLI:
Terminal window cat > lighteval.yaml << 'EOF'name: lightevaldescription: LightEval adapter for evaluation frameworkruntime:local:command: "python main.py"env:- name: OCI_INSECUREvalue: "true"benchmarks:- id: gsm8kname: Grade-school math word problemsdescription: |-Multi-step arithmetic word problems requiring 2-8 reasoning steps(8-shot, 1,319 examples).category: mathmetrics:- exact_match- accnum_few_shot: 8dataset_size: 1319tags:- math- reasoning- lightevalprimary_score:metric: acclower_is_better: falsepass_criteria:threshold: 0.25EOFevalhub providers create --file lighteval.yamlSee the Local Mode guide — Provider Configuration for details on the
runtime.localsection.Verify the provider was registered:
Terminal window evalhub providers list┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓┃ ID ┃ NAME ┃ DESCRIPTION ┃ BENCHMARKS ┃┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩│ 11031769-2ee5-4433-a129-8c6ad84d7185 │ lighteval │ LightEval adapter for evaluation framework │ 1 │└──────────────────────────────────────┴───────────┴────────────────────────────────────────────┴────────────┘Use the provider ID from the output above to query its available benchmarks (your ID will differ):
Terminal window evalhub providers describe <provider-id>Provider: lightevalID: 11031769-2ee5-4433-a129-8c6ad84d7185Description: LightEval adapter for evaluation frameworkBenchmarks (1):┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓┃ ID ┃ NAME ┃ CATEGORY ┃ METRICS ┃┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩│ gsm8k │ Grade-school math word problems │ math │ exact_match, acc │└───────┴─────────────────────────────────┴──────────┴──────────────────┘The benchmark ID
gsm8kis defined in the provider YAML above. Use the provider ID and benchmark ID from this output as the--providerand--benchmarkvalues when running evaluations.
Run an evaluation
Section titled “Run an evaluation”With all services running, submit a job using the CLI:
evalhub eval run \ --name my-eval-job \ --model-url http://localhost:11434/v1 \ --model-name llama3.2:3b-instruct-q4_K_M \ --provider <provider-id> \ --benchmark <benchmark-id> \ --param num_examples=10 \ --param num_few_shot=0 \ --waitCheck results:
evalhub eval results <job-id>With MLflow experiments and OCI artifact storage
Section titled “With MLflow experiments and OCI artifact storage”MLflow experiment tracking and OCI artifact export are optional. If you started MLflow and the OCI registry in the setup steps, add the --experiment and --oci-* flags:
evalhub eval run \ --name my-eval-job \ --model-url http://localhost:11434/v1 \ --model-name llama3.2:3b-instruct-q4_K_M \ --provider <provider-id> \ --benchmark <benchmark-id> \ --param num_examples=10 \ --param num_few_shot=0 \ --experiment my-local-experiment \ --oci-host localhost:5001 \ --oci-repository local-eval-results \ --waitevalhub eval results <job-id> --format jsonExample output:
[ { "id": "gsm8k", "provider_id": "11031769-2ee5-4433-a129-8c6ad84d7185", "benchmark_index": 0, "metrics": { "all.extractive_match": 0.7, "gsm8k|0.extractive_match": 0.7 }, "artifacts": { "evalhub.env_card": { "aggregate_results": {}, "autograder_bias": {}, "capture_completeness": 0.15, "confidence_intervals": {}, "cpu_model": "arm", "custom": {}, "k8s_pod_labels": {}, "k8s_resource_limits": {}, "key_packages": { "mlflow": "3.12.0", "torch": "2.12.0", "transformers": "5.9.0" }, "os_info": "macOS-26.5-arm64-arm-64bit", "per_task_results": {}, "python_version": "3.12.12", "scorer_ids": [] }, "oci_digest": "sha256:31f4da6f8aca49f50d97da92775eef691875701fae0bb81cf72c5b60fb7f9f20", "oci_reference": "localhost:5001/local-eval-results:evalhub-bdf6096c95aff921cc5067a2c679829db76541f07ee8d607211666129bbcfef7@sha256:31f4da6f8aca49f50d97da92775eef691875701fae0bb81cf72c5b60fb7f9f20" }, "mlflow_run_id": "0a3505d2c1d24025b13f05d4ec96cb01", "logs_path": null }]Using the Python SDK
Section titled “Using the Python SDK”The included evalhub-client.ipynb notebook demonstrates the full evaluation lifecycle using the eval-hub-sdk Python client — submitting jobs, polling status, and retrieving results programmatically.
What’s running
Section titled “What’s running”After completing setup, you have four services on localhost:
| Service | URL | Purpose |
|---|---|---|
| eval-hub-server | http://localhost:8080 | Evaluation orchestration |
| MLflow | http://localhost:5000 | Experiment tracking dashboard |
| OCI registry | http://localhost:5001 | Artifact storage |
| Ollama | http://localhost:11434 | LLM inference |
Next steps
Section titled “Next steps”- Browse the MLflow UI at
http://localhost:5000to see experiment metrics - Read the Local Mode guide for provider configuration details and troubleshooting
- Try other adapters by registering additional providers with
evalhub providers create --file <file>.yaml