Benchmarks
Benchmarks provide standardized performance measurements for AI models, functions, workflows, and system components. They help establish baselines, track improvements, and compare different implementations.
Overview
The Benchmarks collection offers:
- Standardized performance testing
- Historical performance tracking
- Comparative analysis
- Resource utilization metrics
- Scalability assessments
Benchmark Structure
Each benchmark record contains detailed information about the test and its results:
interface Benchmark {
id: string
name: string
description?: string
type: 'model' | 'function' | 'workflow' | 'system' | 'integration'
target: {
id: string
type: string
version?: string
}
parameters: Record<string, any>
metrics: {
latency?: {
p50: number
p90: number
p95: number
p99: number
min: number
max: number
mean: number
}
throughput?: number
errorRate?: number
resourceUsage?: {
cpu?: number
memory?: number
gpu?: number
network?: number
storage?: number
}
customMetrics?: Record<string, number>
}
environment: {
hardware?: string
os?: string
runtime?: string
dependencies?: Record<string, string>
}
timestamp: Date
duration: number
status: 'completed' | 'failed' | 'aborted'
baseline?: {
id: string
metrics: Record<string, number>
comparison: Record<string, number>
}
metadata?: Record<string, any>
}
Benchmark Types
Benchmarks are categorized into several types:
Model Benchmarks
Evaluate AI model performance:
- Accuracy
- Inference latency
- Token throughput
- Memory usage
Function Benchmarks
Measure function performance:
- Execution time
- Resource utilization
- Scalability
- Error rates
Workflow Benchmarks
Assess workflow efficiency:
- End-to-end latency
- Step timing
- Success rates
- Resource consumption
System Benchmarks
Evaluate overall system performance:
- API response times
- Concurrent request handling
- Database operations
- Queue processing
Working with Benchmarks
Running Benchmarks
Benchmarks can be executed on-demand or scheduled:
// Example: Run a benchmark on a specific function
const result = await Benchmarks.run({
type: 'function',
target: {
id: 'function-123',
type: 'text-classification',
},
parameters: {
iterations: 1000,
concurrency: 10,
inputSize: 'medium',
},
})
Comparing Benchmarks
Compare performance across different versions or implementations:
// Example: Compare benchmark results
const comparison = await Benchmarks.compare({
baseline: 'benchmark-123',
current: 'benchmark-456',
metrics: ['latency.p95', 'throughput', 'errorRate'],
})
Tracking Performance Over Time
Monitor performance trends:
// Example: Get historical performance data
const history = await Benchmarks.getHistory({
target: {
id: 'model-789',
type: 'model',
},
metric: 'latency.p95',
timeRange: {
start: new Date('2023-01-01'),
end: new Date('2023-12-31'),
},
interval: 'month',
})
Best Practices
- Consistent Environment: Run benchmarks in consistent environments for valid comparisons
- Comprehensive Metrics: Measure multiple aspects of performance
- Statistical Significance: Ensure sufficient iterations for reliable results
- Realistic Workloads: Use representative data and workloads
- Regular Execution: Run benchmarks regularly to track performance over time
Integration with Other Observability Tools
Benchmarks integrate with other observability components:
- Metrics: Benchmark results feed into performance metrics
- Alerts: Performance regressions can trigger alerts
- Dashboards: Visualize benchmark trends and comparisons
- CI/CD: Integrate benchmarks into continuous integration pipelines
API Reference
List Benchmarks
GET / api / benchmarks
Query parameters:
type
: Filter by benchmark typetarget.id
: Filter by target IDfrom
: Start timestampto
: End timestamplimit
: Maximum number of benchmarks to return
Get Benchmark by ID
GET /api/benchmarks/:id
Run Benchmark
POST / api / benchmarks / run
Request body:
{
"type": "function",
"target": {
"id": "function-123",
"type": "text-classification"
},
"parameters": {
"iterations": 1000,
"concurrency": 10
}
}
Compare Benchmarks
POST / api / benchmarks / compare
Request body:
{
"baseline": "benchmark-123",
"current": "benchmark-456",
"metrics": ["latency.p95", "throughput"]
}
Last updated on