Search Benchmark Results Schema¶

Reference specification for the JSON output format of scripts/benchmark_search.py.

Results Structure¶

{
  "timestamp": "2025-01-15T10:30:45.123456",
  "config": {
    "backend_url": "http://localhost:8004",
    "iterations": 3,
    "warmup": 2,
    "total_queries": 16,
    "total_variants": 3
  },
  "targets": {
    "hybrid": 300,
    "keyword": 150,
    "vector": 200
  },
  "results": [...],
  "summary": {...}
}

Fields Reference¶

Top-Level Fields¶

Field	Type	Description
`timestamp`	string	ISO 8601 timestamp when benchmark completed
`config`	object	Benchmark configuration and metadata
`targets`	object	Performance targets in milliseconds for each search variant
`results`	array	Detailed results for each query/variant combination
`summary`	object	Aggregated statistics by search variant

Config Object¶

Field	Type	Description
`backend_url`	string	Backend API base URL used for testing
`iterations`	integer	Number of iterations run per query
`warmup`	integer	Number of warmup queries executed
`total_queries`	integer	Total unique queries tested
`total_variants`	integer	Number of search variants (hybrid, keyword, vector)

Targets Object¶

Performance targets in milliseconds (P95 latency):

Field	Type	Description
`hybrid`	integer	Target for hybrid search (thinking mode, α=0.5)
`keyword`	integer	Target for keyword search (rabbit mode, α=0.0)
`vector`	integer	Target for vector search (rabbit mode, α=1.0)

Results Array¶

Each result object contains:

{
  "query": "kredyty frankowe",
  "query_category": "financial",
  "query_language": "pl",
  "search_variant": "hybrid",
  "search_mode": "thinking",
  "search_alpha": 0.5,
  "description": "Hybrid search with AI query enhancement",
  "iterations": 3,
  "error_count": 0,
  "avg_results": 10.0,
  "latency_ms": {
    "p50": 120.5,
    "p95": 145.2,
    "p99": 160.1,
    "min": 115.3,
    "max": 160.1,
    "mean": 128.6
  },
  "raw_latencies": [115.3, 125.8, 160.1]
}

Result Object Fields¶

Field	Type	Description
`query`	string	Search query text
`query_category`	string	Query category (financial, criminal, etc.)
`query_language`	string	Query language code (pl, en)
`search_variant`	string	Search type (hybrid, keyword, vector)
`search_mode`	string	API mode (thinking, rabbit)
`search_alpha`	float	Hybrid search alpha parameter (0.0-1.0)
`description`	string	Human-readable description of search variant
`iterations`	integer	Successful iterations (may be less than config if errors)
`error_count`	integer	Number of failed requests
`avg_results`	float	Average number of documents returned
`latency_ms`	object	Latency percentiles and statistics
`raw_latencies`	array	All measured latencies in milliseconds

Latency Object¶

Field	Type	Description
`p50`	float	Median latency (50th percentile)
`p95`	float	95th percentile latency
`p99`	float	99th percentile latency
`min`	float	Minimum latency observed
`max`	float	Maximum latency observed
`mean`	float	Mean latency

Summary Object¶

Aggregated statistics by search variant:

{
  "hybrid": {
    "queries": [...],
    "total_iterations": 48,
    "total_errors": 0,
    "all_latencies": [120.5, 125.8, ...],
    "passed": 16,
    "failed": 0,
    "overall_percentiles": {
      "p50": 128.5,
      "p95": 165.2,
      "p99": 180.1,
      "min": 95.3,
      "max": 185.1,
      "mean": 135.6
    },
    "target_ms": 300,
    "passes_target": true
  }
}

Summary Variant Fields¶

Field	Type	Description
`queries`	array	Array of result objects for this variant
`total_iterations`	integer	Total successful iterations across all queries
`total_errors`	integer	Total errors across all queries
`all_latencies`	array	All latencies for this variant combined
`passed`	integer	Number of queries that passed the target
`failed`	integer	Number of queries that failed the target
`overall_percentiles`	object	Percentiles calculated from all latencies
`target_ms`	integer	Performance target for this variant
`passes_target`	boolean	Whether overall P95 meets the target

Usage Examples¶

Loading Results¶

import json

with open('data/benchmark_results.json', 'r') as f:
    results = json.load(f)

print(f"Benchmark completed at: {results['timestamp']}")
print(f"Total queries tested: {results['config']['total_queries']}")

# Check if benchmark passed
all_passed = all(
    variant_stats['passes_target']
    for variant_stats in results['summary'].values()
)
print(f"Overall result: {'PASS' if all_passed else 'FAIL'}")

Analyzing Latencies¶

# Get all hybrid search latencies
hybrid_latencies = results['summary']['hybrid']['all_latencies']

# Find slowest queries
slow_queries = [
    result for result in results['results']
    if result['latency_ms']['p95'] > 200  # queries over 200ms P95
]

# Performance by language
pl_queries = [r for r in results['results'] if r['query_language'] == 'pl']
en_queries = [r for r in results['results'] if r['query_language'] == 'en']

Trend Analysis¶

Compare multiple benchmark runs:

import glob
import json
from datetime import datetime

# Load all benchmark files
benchmark_files = glob.glob('data/benchmark_*.json')
benchmarks = []

for file in benchmark_files:
    with open(file, 'r') as f:
        data = json.load(f)
        benchmarks.append({
            'timestamp': datetime.fromisoformat(data['timestamp']),
            'hybrid_p95': data['summary']['hybrid']['overall_percentiles']['p95'],
            'keyword_p95': data['summary']['keyword']['overall_percentiles']['p95'],
            'vector_p95': data['summary']['vector']['overall_percentiles']['p95'],
        })

# Sort by timestamp and analyze trends
benchmarks.sort(key=lambda x: x['timestamp'])

Validation¶

The results schema is validated by the benchmark script. Invalid results indicate:

Missing required fields
Incorrect data types
Negative latencies or invalid percentiles
Inconsistent iteration counts

For schema validation in external tools, check:

All required top-level fields present
Latency percentiles satisfy: min ≤ p50 ≤ p95 ≤ p99 ≤ max
Error counts non-negative
Target passes/fails consistent with measured P95 vs targets