Configuration Examples
Complete examples for common GuideLLM benchmarking scenarios.
Quick Test
Section titled “Quick Test”Fast validation with minimal samples:
{ "id": "guidellm-quick-001", "benchmark_id": "performance_quick", "model": { "url": "http://127.0.0.1:8000/v1", "name": "Qwen/Qwen2.5-1.5B-Instruct" }, "parameters": { "profile": "constant", "rate": 5, "max_seconds": 10, "max_requests": 20, "data": "prompt_tokens=50,output_tokens=20", "request_type": "chat_completions", "warmup": "0", "detect_saturation": false }, "experiment_name": "qwen-quick-test", "tags": { "framework": "guidellm", "model_size": "small", "evaluation_type": "performance" }}Duration: ~10 seconds Use case: Quick validation, CI/CD testing
Performance Sweep
Section titled “Performance Sweep”Automatically explore request rates:
{ "id": "guidellm-sweep-001", "benchmark_id": "performance_sweep", "model": { "url": "http://localhost:8000/v1", "name": "Qwen/Qwen2.5-1.5B-Instruct" }, "parameters": { "profile": "sweep", "max_seconds": 30, "max_requests": 100, "data": "prompt_tokens=256,output_tokens=128", "warmup": "5%", "detect_saturation": true }, "experiment_name": "qwen-capacity-discovery", "tags": { "test_type": "discovery", "purpose": "capacity_planning" }}Duration: ~30 seconds Use case: Initial capacity discovery, finding safe operating range
Constant Load Test
Section titled “Constant Load Test”Steady-state performance measurement:
{ "id": "guidellm-constant-001", "benchmark_id": "performance_constant", "model": { "url": "http://production.example.com/v1", "name": "llama-2-7b-chat" }, "parameters": { "profile": "constant", "rate": 10, "max_seconds": 300, "max_requests": 3000, "data": "hf:abisee/cnn_dailymail", "data_args": {"name": "3.0.0"}, "data_column_mapper": {"text_column": "article"}, "data_samples": 500, "warmup": "5%", "cooldown": "5%" }, "experiment_name": "llama-production-baseline", "tags": { "environment": "production", "test_type": "baseline" }}Duration: 5 minutes Use case: Production baseline, SLA validation
Throughput Test
Section titled “Throughput Test”Maximum capacity testing:
{ "id": "guidellm-throughput-001", "benchmark_id": "max_throughput", "model": { "url": "http://localhost:8000/v1", "name": "gpt-3.5-turbo" }, "parameters": { "profile": "throughput", "max_seconds": 60, "max_requests": 5000, "data": "prompt_tokens=512,output_tokens=256", "warmup": "10%", "max_error_rate": 0.1 }, "experiment_name": "gpt35-max-capacity", "tags": { "test_type": "stress", "purpose": "capacity_limit" }}Duration: 1 minute + warmup Use case: Stress testing, capacity planning
Concurrent Users
Section titled “Concurrent Users”Simulate parallel user load:
{ "id": "guidellm-concurrent-001", "benchmark_id": "concurrent_users", "model": { "url": "http://localhost:8000/v1", "name": "llama-2-13b" }, "parameters": { "profile": "concurrent", "rate": 25, "max_requests": 500, "max_seconds": 120, "data": "prompt_tokens=512,output_tokens=256", "warmup": "5%" }, "experiment_name": "llama-concurrent-load", "tags": { "test_type": "concurrency", "concurrent_users": 25 }}Duration: 2 minutes Use case: User simulation, concurrency testing
Poisson Distribution
Section titled “Poisson Distribution”Realistic production traffic pattern:
{ "id": "guidellm-poisson-001", "benchmark_id": "poisson_traffic", "model": { "url": "http://localhost:8000/v1", "name": "mistral-7b" }, "parameters": { "profile": "poisson", "rate": 15, "max_seconds": 180, "data": "hf:openai/gsm8k", "data_args": {"name": "main"}, "data_column_mapper": {"text_column": "question"}, "data_samples": 1000, "warmup": "5%", "detect_saturation": true }, "experiment_name": "mistral-realistic-load", "tags": { "test_type": "realistic", "traffic_pattern": "poisson" }}Duration: 3 minutes Use case: Production simulation, realistic load testing
Synchronous Baseline
Section titled “Synchronous Baseline”Single-user minimum latency:
{ "id": "guidellm-sync-001", "benchmark_id": "baseline_latency", "model": { "url": "http://localhost:8000/v1", "name": "qwen-1.5b" }, "parameters": { "profile": "synchronous", "max_requests": 100, "data": "prompt_tokens=256,output_tokens=128" }, "experiment_name": "qwen-baseline", "tags": { "test_type": "baseline", "load": "single_user" }}Duration: Variable (depends on model speed) Use case: Baseline measurement, minimum latency testing
Local Testing with Ollama
Section titled “Local Testing with Ollama”Configuration for local testing with Ollama:
# Install Ollama (if not already installed)curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a modelollama run qwen2.5:1.5b{ "id": "local-test-001", "benchmark_id": "ollama_test", "model": { "url": "http://localhost:11434/v1", "name": "qwen2.5:1.5b" }, "parameters": { "profile": "constant", "rate": 5, "max_seconds": 10, "max_requests": 20, "data": "prompt_tokens=50,output_tokens=20", "warmup": "0" }}# Set environmentexport EVALHUB_MODE=localexport EVALHUB_JOB_SPEC_PATH=meta/job.jsonexport SERVICE_URL=http://localhost:8080
# Run adapterpython main.pyError Handling
Section titled “Error Handling”Configuration with error thresholds:
{ "id": "guidellm-resilience-001", "benchmark_id": "error_tolerance", "model": { "url": "http://localhost:8000/v1", "name": "test-model" }, "parameters": { "profile": "constant", "rate": 10, "max_seconds": 60, "max_errors": 10, "max_error_rate": 0.05, "data": "prompt_tokens=256,output_tokens=128" }, "experiment_name": "error-tolerance-test", "tags": { "test_type": "resilience" }}Stops when:
- 10 total errors occur, OR
- Error rate exceeds 5%
Tips for Writing Configurations
Section titled “Tips for Writing Configurations”Choose the Right Profile
Section titled “Choose the Right Profile”- First test: Use
sweepto discover safe rates - Repeatable tests: Use
constantfor consistent results - Stress tests: Use
throughputto find limits - Production simulation: Use
poissonfor realistic traffic
Set Appropriate Limits
Section titled “Set Appropriate Limits”- Always specify at least one:
max_seconds,max_requests - Use
max_error_rateto fail fast on issues - Add
warmupto exclude cold-start effects
Data Sources
Section titled “Data Sources”- Quick tests: Use synthetic data (
prompt_tokens=N,output_tokens=M) - Realistic tests: Use HuggingFace datasets (
hf:dataset_name) - Specific scenarios: Use local files (
file:///path/to/data)
Warmup Best Practices
Section titled “Warmup Best Practices”{ "warmup": "5%", // For percentage-based "warmup": 10 // For time-based (seconds)}Recommended: "5%" for most tests
Tags for Organisation
Section titled “Tags for Organisation”Use tags to categorise benchmarks:
{ "tags": { "environment": "production|staging|dev", "test_type": "baseline|stress|discovery", "model_size": "small|medium|large", "purpose": "capacity_planning|sla_validation|regression" }}