Skip to content

Configuration Examples

Complete examples for common GuideLLM benchmarking scenarios.

Fast validation with minimal samples:

{
"id": "guidellm-quick-001",
"benchmark_id": "performance_quick",
"model": {
"url": "http://127.0.0.1:8000/v1",
"name": "Qwen/Qwen2.5-1.5B-Instruct"
},
"parameters": {
"profile": "constant",
"rate": 5,
"max_seconds": 10,
"max_requests": 20,
"data": "prompt_tokens=50,output_tokens=20",
"request_type": "chat_completions",
"warmup": "0",
"detect_saturation": false
},
"experiment_name": "qwen-quick-test",
"tags": {
"framework": "guidellm",
"model_size": "small",
"evaluation_type": "performance"
}
}

Duration: ~10 seconds Use case: Quick validation, CI/CD testing


Automatically explore request rates:

{
"id": "guidellm-sweep-001",
"benchmark_id": "performance_sweep",
"model": {
"url": "http://localhost:8000/v1",
"name": "Qwen/Qwen2.5-1.5B-Instruct"
},
"parameters": {
"profile": "sweep",
"max_seconds": 30,
"max_requests": 100,
"data": "prompt_tokens=256,output_tokens=128",
"warmup": "5%",
"detect_saturation": true
},
"experiment_name": "qwen-capacity-discovery",
"tags": {
"test_type": "discovery",
"purpose": "capacity_planning"
}
}

Duration: ~30 seconds Use case: Initial capacity discovery, finding safe operating range


Steady-state performance measurement:

{
"id": "guidellm-constant-001",
"benchmark_id": "performance_constant",
"model": {
"url": "http://production.example.com/v1",
"name": "llama-2-7b-chat"
},
"parameters": {
"profile": "constant",
"rate": 10,
"max_seconds": 300,
"max_requests": 3000,
"data": "hf:abisee/cnn_dailymail",
"data_args": {"name": "3.0.0"},
"data_column_mapper": {"text_column": "article"},
"data_samples": 500,
"warmup": "5%",
"cooldown": "5%"
},
"experiment_name": "llama-production-baseline",
"tags": {
"environment": "production",
"test_type": "baseline"
}
}

Duration: 5 minutes Use case: Production baseline, SLA validation


Maximum capacity testing:

{
"id": "guidellm-throughput-001",
"benchmark_id": "max_throughput",
"model": {
"url": "http://localhost:8000/v1",
"name": "gpt-3.5-turbo"
},
"parameters": {
"profile": "throughput",
"max_seconds": 60,
"max_requests": 5000,
"data": "prompt_tokens=512,output_tokens=256",
"warmup": "10%",
"max_error_rate": 0.1
},
"experiment_name": "gpt35-max-capacity",
"tags": {
"test_type": "stress",
"purpose": "capacity_limit"
}
}

Duration: 1 minute + warmup Use case: Stress testing, capacity planning


Simulate parallel user load:

{
"id": "guidellm-concurrent-001",
"benchmark_id": "concurrent_users",
"model": {
"url": "http://localhost:8000/v1",
"name": "llama-2-13b"
},
"parameters": {
"profile": "concurrent",
"rate": 25,
"max_requests": 500,
"max_seconds": 120,
"data": "prompt_tokens=512,output_tokens=256",
"warmup": "5%"
},
"experiment_name": "llama-concurrent-load",
"tags": {
"test_type": "concurrency",
"concurrent_users": 25
}
}

Duration: 2 minutes Use case: User simulation, concurrency testing


Realistic production traffic pattern:

{
"id": "guidellm-poisson-001",
"benchmark_id": "poisson_traffic",
"model": {
"url": "http://localhost:8000/v1",
"name": "mistral-7b"
},
"parameters": {
"profile": "poisson",
"rate": 15,
"max_seconds": 180,
"data": "hf:openai/gsm8k",
"data_args": {"name": "main"},
"data_column_mapper": {"text_column": "question"},
"data_samples": 1000,
"warmup": "5%",
"detect_saturation": true
},
"experiment_name": "mistral-realistic-load",
"tags": {
"test_type": "realistic",
"traffic_pattern": "poisson"
}
}

Duration: 3 minutes Use case: Production simulation, realistic load testing


Single-user minimum latency:

{
"id": "guidellm-sync-001",
"benchmark_id": "baseline_latency",
"model": {
"url": "http://localhost:8000/v1",
"name": "qwen-1.5b"
},
"parameters": {
"profile": "synchronous",
"max_requests": 100,
"data": "prompt_tokens=256,output_tokens=128"
},
"experiment_name": "qwen-baseline",
"tags": {
"test_type": "baseline",
"load": "single_user"
}
}

Duration: Variable (depends on model speed) Use case: Baseline measurement, minimum latency testing


Configuration for local testing with Ollama:

Terminal window
# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run qwen2.5:1.5b

Configuration with error thresholds:

{
"id": "guidellm-resilience-001",
"benchmark_id": "error_tolerance",
"model": {
"url": "http://localhost:8000/v1",
"name": "test-model"
},
"parameters": {
"profile": "constant",
"rate": 10,
"max_seconds": 60,
"max_errors": 10,
"max_error_rate": 0.05,
"data": "prompt_tokens=256,output_tokens=128"
},
"experiment_name": "error-tolerance-test",
"tags": {
"test_type": "resilience"
}
}

Stops when:

  • 10 total errors occur, OR
  • Error rate exceeds 5%

  • First test: Use sweep to discover safe rates
  • Repeatable tests: Use constant for consistent results
  • Stress tests: Use throughput to find limits
  • Production simulation: Use poisson for realistic traffic
  • Always specify at least one: max_seconds, max_requests
  • Use max_error_rate to fail fast on issues
  • Add warmup to exclude cold-start effects
  • Quick tests: Use synthetic data (prompt_tokens=N,output_tokens=M)
  • Realistic tests: Use HuggingFace datasets (hf:dataset_name)
  • Specific scenarios: Use local files (file:///path/to/data)
{
"warmup": "5%", // For percentage-based
"warmup": 10 // For time-based (seconds)
}

Recommended: "5%" for most tests

Use tags to categorise benchmarks:

{
"tags": {
"environment": "production|staging|dev",
"test_type": "baseline|stress|discovery",
"model_size": "small|medium|large",
"purpose": "capacity_planning|sla_validation|regression"
}
}