Performance Readiness¶
Use this guide when you need to validate or document performance-sensitive behavior in aion-api.
Current State¶
The repo does not yet ship committed Go microbenchmarks such as BenchmarkXxx.
What it does have today:
- RED-style dashboards for HTTP and GraphQL behavior
- smoke checks for record projection, SSE delivery, ingest flow, and outbox health
- operator diagnostics for backlog and pending-event age
- committed local load scenarios with reviewable thresholds for auth login, derived GraphQL reads, dashboard aggregation, and realtime SSE delivery
That means current performance documentation should be based on observed boundary behavior, not invented numbers.
Validated Local Baseline¶
Validated on 2026-03-28 against the local multi-repo development stack with:
./infrastructure/observability/scripts/setup-improvements.sh
make event-backbone-gate
Observed readiness signals from that run:
- API health endpoint returned healthy after the stack restart
- Prometheus, Grafana, Jaeger, and the OTel collector responded successfully
- Prometheus reported the
otel-collectortarget as up - Grafana provisioned Prometheus and Jaeger datasources and exposed the RED dashboard
- outbox diagnosis reported
pending_count=0,failed_count=0, andoldest_pending_age=n/a - record projection smoke, realtime SSE smoke, and projection pagination smoke all passed
- ingest smoke passed repeatedly with unique event ids after the smoke payload was made per-run unique
- the integrated dashboard records E2E passed at the end of
make event-backbone-gate
Use that baseline as readiness evidence, not as a latency budget or throughput SLA.
Versioned Local Load Scenarios¶
The repo now ships committed local load scenarios under hack/tools/load-test/.
Current scenarios:
| Scenario | Boundary | Command |
|---|---|---|
auth-login |
public auth HTTP path | make load-test-auth-login |
record-projections-latest |
authenticated GraphQL derived read path | make load-test-record-projections |
dashboard-snapshot |
authenticated GraphQL dashboard aggregation path | make load-test-dashboard-snapshot |
realtime-record-created |
authenticated SSE delivery path for projection-ready record events | make load-test-realtime-record-created |
Combined baseline:
make load-test-baseline
Thresholds live in hack/tools/load-test/thresholds.json.
Treat them as local readiness budgets for this workspace, not as production SLOs.
Recent clean local runs on 2026-03-28:
auth-login: 60 measured requests, concurrency 6, p50 about49-50ms, p95 about55-128ms, error rate0%record-projections-latest: 80 measured requests, concurrency 8, p50 about6-10ms, p95 about15-16ms, error rate0%dashboard-snapshot: 60-80 measured requests, concurrency 6-8, p50 about3-7ms, p95 about6-26ms, error rate0%realtime-record-created: 20 measured requests, concurrency 4, p50 about10-12s, p95 about15-16s, error rate0%
Run the load baseline while the dev stack is otherwise idle. Hot-reload rebuilds or large doc-generation commands can perturb the auth scenario and produce transport noise that is not representative of the API itself. The realtime scenario is intentionally different: it measures the full local async chain from record creation through outbox publication, Kafka consumption, projection readiness, and final SSE delivery. Its threshold therefore carries slight headroom above the latest clean local branch measurements rather than pretending the path behaves like a synchronous request.
What To Measure Today¶
| Boundary | Current signals |
|---|---|
| HTTP and GraphQL transport | throughput, error rate, p50/p75/p95 latency from Grafana |
| Record projection pipeline | successful projection materialization, pagination behavior, backlog age, pending count |
| Realtime SSE | delivery continuity, disconnect behavior, dropped-event logs, and end-to-end async delivery latency |
| Outbox publisher | pending count, oldest pending age, publish or reschedule health |
Recommended Validation Flow¶
- Boot the local stack.
make dev
./infrastructure/observability/scripts/setup-improvements.sh
- Confirm dashboards and collectors are healthy.
Read Observability Quickstart.
- Exercise the critical boundaries.
make outbox-diagnose
make record-projection-smoke
make realtime-record-smoke
make record-projection-page-smoke
make load-test-baseline
make event-backbone-gate
- Capture the metrics that matter.
At minimum, record:
- operation or boundary under test
- traffic source or smoke command
- throughput or request volume
- p95 latency when the RED dashboard provides it
- error rate
- any backlog, retry, or dropped-event signal
When To Add Real Benchmarks¶
Do not document per-function numbers in a README until the repo has a committed benchmark that can reproduce them.
Good first candidates:
- record search and projection query paths
- dashboard snapshot assembly and insight generation
- outbox publish batching
- GraphQL resolver hot paths that fan into record-heavy reads
Use local benchmarks only when:
- the hot path is isolated enough to measure without full-stack noise
- the benchmark input can be kept deterministic
- the result changes an engineering decision, not just curiosity
Documentation Rule¶
Only put performance notes into a boundary README when they are:
- measurable with a committed command or dashboard
- local to that boundary
- likely to influence future changes
If the note is cross-cutting, keep it here and link from the boundary README instead of duplicating a protocol everywhere.