Checklist · April 11, 2026

LLM Observability Trace Review Checklist

A replay-first checklist for confirming one Elastic-backed AI workflow is observable enough to debug before you scale it.

Use this checklist on one real AI workflow before you build another dashboard.

The goal is simple: confirm that one request can be replayed end to end from telemetry alone.

1. Request identity

Every request has a stable request ID.
Session or conversation ID is available when the workflow spans multiple turns.
Tenant, workspace, or customer identifier is captured where relevant.
The request can be correlated across traces, logs, and metrics.

2. Prompt lineage

Prompt template version or prompt identifier is captured.
System instructions can be tied back to the request.
Prompt changes can be distinguished from model changes.
Sensitive prompt content is redacted or access-controlled where needed.

3. Context and retrieval

Retrieval source IDs or document IDs are attached to the request.
Empty or weak retrieval states are visible.
Operators can see whether context was missing, stale, or low-signal.
Retrieval failure is distinguishable from model failure.

4. Model call

Model name and provider are captured.
Request duration is visible.
Error state is visible when the model call fails.
Retry count or fallback behavior is visible when the path changes.
Token counts are captured for the request.

5. Tool chain

Tool calls are visible as child spans or linked events.
Tool order is reconstructable.
Tool success, failure, and timeout states are visible.
Operators can tell whether the workflow failed in retrieval, the model, or the downstream action.

6. Cost and outcome

Token or usage cost is attributable to the request path.
Final outcome classification is captured, such as answered, degraded, blocked, or failed.
Cost spikes can be connected to workflow, model, or retry behavior.
Latency spikes can be connected to a specific handoff inside the workflow.

7. Operator view

One engineer can click from the parent trace to AI metadata without losing the request path.
One engineer can move from AI metadata to tool spans without switching systems.
Dashboards highlight action-driving questions rather than decorative charts.
Alerts exist for provider failures, cost spikes, latency regressions, and tool failure rates.

8. Incident replay test

Run one replay test and confirm you can answer:

What the user asked
What context the system used
What the model returned
Which tools ran
How long each step took
What the request cost
Where the failure or degradation occurred

If you cannot answer all seven from telemetry alone, fix the instrumentation before expanding the workflow.

Signoff

Before rollout or expansion, record:

Trace-reviewed, replayable, and strong enough for the next production step.

Owner:
Second reviewer:
Date: