Examples
End-to-End RAG Evaluation
Section titled “End-to-End RAG Evaluation”This example walks through evaluating a RAG pipeline from dataset preparation to results retrieval using the EvalHub SDK CLI.
Prerequisites
Section titled “Prerequisites”- EvalHub server running (local or on OpenShift)
- An OpenAI-compatible model endpoint serving a judge model (e.g. vLLM, Ollama)
- An embeddings endpoint (can be the same server if it supports
/v1/embeddings) - The
eval-hub-sdkCLI installed:pip install eval-hub-sdk
Prepare the Dataset
Section titled “Prepare the Dataset”-
Create a JSONL dataset with your RAG pipeline outputs. Each line must contain the four RAGAS columns:
{"user_input": "What is the capital of France?", "response": "The capital of France is Paris.", "retrieved_contexts": ["Paris is the capital and largest city of France. It is situated on the River Seine, in northern France."], "reference": "The capital of France is Paris."}{"user_input": "Who wrote Romeo and Juliet?", "response": "Romeo and Juliet was written by William Shakespeare.", "retrieved_contexts": ["Romeo and Juliet is a tragedy written by William Shakespeare early in his career about the romance between two Italian youths from feuding families."], "reference": "William Shakespeare wrote Romeo and Juliet."}{"user_input": "What is photosynthesis?", "response": "Photosynthesis is the process by which green plants convert sunlight into chemical energy.", "retrieved_contexts": ["Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that can be stored and later released to fuel the organism's activities."], "reference": "Photosynthesis is the process by which plants convert light energy into chemical energy stored in glucose."} -
Upload the dataset to an S3 bucket accessible from your cluster, or mount it as a volume at
/data/dataset.jsonl. -
If your dataset uses different column names, prepare a
column_map:{"column_map": {"question": "user_input","answer": "response","contexts": "retrieved_contexts","ground_truth": "reference"}}
Submit a RAGAS Evaluation Job
Section titled “Submit a RAGAS Evaluation Job”# Submit a RAGAS evaluation jobevalhub job submit \ --provider ragas \ --benchmark ragas_rag_default \ --model-name "Qwen/Qwen2.5-1.5B-Instruct" \ --model-url "http://vllm-service:8000" \ --param embedding_model=all-MiniLM-L6-v2 \ --param embedding_url=http://embeddings-service:8001 \ --param max_tokens=512 \ --param temperature=0.1 \ --test-data-s3 s3://my-bucket/ragas-dataset/dataset.jsonlfrom evalhub import EvalHubClient
client = EvalHubClient(base_url="http://localhost:8080")
job = client.submit_job( provider_id="ragas", benchmark_id="ragas_rag_default", model_name="Qwen/Qwen2.5-1.5B-Instruct", model_url="http://vllm-service:8000", parameters={ "embedding_model": "all-MiniLM-L6-v2", "embedding_url": "http://embeddings-service:8001", "max_tokens": 512, "temperature": 0.1, }, test_data_ref={"s3": "s3://my-bucket/ragas-dataset/dataset.jsonl"},)
print(f"Job submitted: {job.id}")Retrieve Results
Section titled “Retrieve Results”# Check job statusevalhub job status <job-id>
# Get resultsevalhub job results <job-id>Results include per-metric scores:
{ "id": "ragas-rag-eval-001", "benchmark_id": "ragas_rag_default", "overall_score": 0.87, "results": [ {"metric_name": "faithfulness", "metric_value": 0.92, "num_samples": 3}, {"metric_name": "answer_relevancy", "metric_value": 0.88, "num_samples": 3}, {"metric_name": "context_precision", "metric_value": 0.85, "num_samples": 3}, {"metric_name": "context_recall", "metric_value": 0.83, "num_samples": 3} ], "num_examples_evaluated": 3, "duration_seconds": 45.2}Running the Full Metric Suite
Section titled “Running the Full Metric Suite”To evaluate with all 11 metrics, use the ragas_rag_full benchmark:
evalhub job submit \ --provider ragas \ --benchmark ragas_rag_full \ --model-name "Qwen/Qwen2.5-1.5B-Instruct" \ --model-url "http://vllm-service:8000" \ --param embedding_model=all-MiniLM-L6-v2 \ --param embedding_url=http://embeddings-service:8001 \ --test-data-s3 s3://my-bucket/ragas-dataset/dataset.jsonlEvaluation Collection with RAGAS
Section titled “Evaluation Collection with RAGAS”EvalHub Evaluation Collections let you run benchmarks from multiple providers in a single evaluation. This example combines RAGAS with LM Evaluation Harness benchmarks to evaluate both the RAG pipeline quality and the underlying model capabilities.
Define a Collection
Section titled “Define a Collection”name: healthcare_rag_eval_v1description: Healthcare RAG pipeline evaluation: RAG quality + medical knowledgeproviders: - provider_id: ragas benchmarks: - benchmark_id: ragas_rag_default weight: 0.6 pass_criteria: threshold: 0.7 - provider_id: lm_evaluation_harness benchmarks: - benchmark_id: medqa weight: 0.2 pass_criteria: threshold: 0.5 - benchmark_id: pubmedqa weight: 0.2 pass_criteria: threshold: 0.5Run the Collection
Section titled “Run the Collection”evalhub collection run healthcare_rag_eval_v1 \ --model-name "Qwen/Qwen2.5-1.5B-Instruct" \ --model-url "http://vllm-service:8000" \ --param ragas.embedding_model=all-MiniLM-L6-v2 \ --param ragas.embedding_url=http://embeddings-service:8001 \ --test-data-s3 s3://my-bucket/ragas-dataset/dataset.jsonlThe collection produces a unified report with weighted scores from all providers, making it possible to set quality gates that cover both RAG-specific and general model capabilities.
Local Development Example
Section titled “Local Development Example”For local testing without Kubernetes:
# Start a model server (e.g. Ollama)ollama serve &ollama pull qwen2.5:1.5b
# Prepare a small test datasetcat > /tmp/dataset.jsonl << 'EOF'{"user_input": "What is machine learning?", "response": "Machine learning is a subset of AI that enables systems to learn from data.", "retrieved_contexts": ["Machine learning is a branch of artificial intelligence that focuses on building systems that learn from data."], "reference": "Machine learning is a subset of artificial intelligence focused on learning from data."}EOF
# Run the adapter locallycd adapters/ragasexport EVALHUB_MODE=localexport EVALHUB_JOB_SPEC_PATH=meta/job.jsonpython main.py