What to expect
The Connected rail, two scenarios end-to-end, the user detail page.
obs-unified is designed around one promise: every signal is reachable from every other in ≤2 clicks. This page walks through what the dashboard actually surfaces once instrumentation is in place.
The Connected rail
Every detail page in the dashboard mounts a right-side rail with four sections:
┌─ Connected — span ─┐
│ │
│ Up: │
│ Trace │
│ Parent trace │
│ │
│ Across: │
│ Other spans │
│ Logs in trace │
│ AI calls │
│ │
│ Down: │
│ Profiles │
│ │
│ Related: │
│ Click that │
│ caused this │
│ trace │
│ → click_5 │
│ │
└────────────────────┘- Up — the parent entity (trace ← span, session ← usage event, etc.)
- Across — sibling signals sharing the same identity key (other spans in the same trace, logs from the same session)
- Down — derived data (pprof profile for a trace, off-CPU profile for a span)
- Related — non-identity-based neighbors (the click that caused this trace, alerts firing on this service)
When a section has no neighbors, the rail renders an informative-absence message explaining why — never a silent empty section. The platform's contract is that "no data" should always tell you what's missing and how to populate it.
Scenario A — alert → trace → flame graph → cohort → session → replay
The headline product test. From a paged alert:
| Step | What you see | What you click | RFCs |
|---|---|---|---|
| 1 | Alert detail with bound Analysis narrative + exemplar traces | Slowest exemplar trace | 0002, 0006 |
| 2 | Trace waterfall, self-time bars, ⚠ UNINSTRUMENTED + 🔥 PROFILES badges | 🔥 badge on the slow span | 0005, 0006, 0007 |
| 3 | Flame graph filtered to this trace's samples (server-side filter, smaller blob) | "Other traces sampled in this profile (243)" | 0007 |
| 4 | Cohort: all traces touched by this profile, with user attribution | A user from the cohort | 0007, 0006 |
| 5 | Session timeline: user's page views, clicks, traces side-by-side | An rrweb event | 0004, 0006 |
| 6 | Replay scrubbed to the click + Connected rail: "Trace caused by this click" | Closes the loop back to step 2's trace | 0004, 0006 |
Six clicks across the entire platform. The platform's claim is that every neighbor at every step is on the rail.
Scenario B — AI cost spike → user → session → trace
A different entry point exercising the same identity skeleton:
- AI dashboard shows a cost spike (
SPANS OVER TIMEchart peaks). The Sessions view ranks the heavy spender at the top by cost. - Click the
👤 user-idchip on the heavy spender's row → user detail page. - User detail page shows the user's
Identitycard + a Connected rail with "Latest session", "Recent traces", "Recent AI calls". The rail surfaces the count-collapsed link for a session with N traces / M AI calls. - Click "Latest session" → Replay tab scoped to that session, showing the session's interactions linked to their traces.
- Click an interaction → trace waterfall for the trace that click caused. Connected rail's "Click that caused this trace" closes the loop back to the originating click.
The seed (pnpm seed) plants a "Heavy Spender (seed)" user with 8–9 high-cost claude-3-5-haiku calls so this walkthrough is reproducible without writing real AI traffic.
Scenario C — futex contention via off-CPU flame graph
Validates the kernel-level layer:
- Trace shows an unexplained pause inside a span (no child spans, on-CPU profile shows little activity).
- Rail's "Down → 🔥 off-CPU profile" leads to an icicle flame graph that surfaces
futex_wait_queue↑pthread_mutex_lock↑inventory_pool::checkouttaking 84% of off-CPU time. - Root cause: a single pool-wide mutex serializing every checkout.
This scenario currently runs only against the docker-compose demo with Beyla feeding pprof. The dashboard code paths are live; the synthetic seed doesn't generate pprof blobs.
Per-tab walkthrough
| Tab | What's there | Key rail pivots |
|---|---|---|
| Health | Tier-0 analysis tiles (error top offenders, latency outliers, log anomaly summary) with optional LLM narrative | Click a tile → Investigations page with the analysis detail |
| Timeline | Per-session lane of usage / span / log events, grouped by interaction_id | Click an event → trace or replay |
| Service Map | Service-to-service edges with SDK / eBPF source filter | Click an edge → traces between those services |
| Logs | Histogram + by-service / by-severity breakdown, filterable | Click a log → log detail with rail surfacing parent trace |
| Investigations | List of analyses + per-analysis detail page with narrative + evidence + connected rail | Rail's "Cited traces" → trace detail |
| Traces | Trace list with inline waterfall expansion, self-time visualization, ⚠ + 🔥 badges, span detail drawer | Click a span row → rail with "Click that caused this trace" |
| Issues | Trace-level issue grouping by error fingerprint | Click an issue → trace |
| AI Calls | Two views — Spans (typed LLM/TOOL/RETRIEVER spans) and Sessions (multi-turn conversation rendering with cost + tokens). User chips are clickable. | Click 👤 user-id → user detail page |
| Replays | Session list + rrweb player + per-session interactions panel | Click an interaction → trace it caused |
| Alerts | Alert rules + recent firings + bound analyses | Click an alert → bound Analysis → exemplar traces |
| Usage | Page views, interactions, top paths, by-country breakdown | Click a session row → timeline |
| Resources | Cloudflare worker resource panels + (when populated) Linux host metrics | Click a host → host detail |
| Projects | Multi-project routing (ingest keys, dashboard auth) | n/a |
When you should expect informative absence
The rail is honest about what's missing. You'll see explicit "—" messages when:
- No interaction_id on a span — the trace wasn't caused by a browser click (cron, queue consumer, retry). The "Originating click" section explains this.
- No pprof profile — the producing service hasn't wired
startProfiler()or an eBPF agent. The Down section explains how to populate. - No rrweb replay — the session had no real browser to capture chunks. The Replay tab tells you to visit
/playgroundand click "Start replay" to capture one. - Alert/analysis topic links — alerts and analyses don't carry identity columns; they relate by topic, not identity. The rail's
Relatedsection explains this is by design.
These are part of the design — empty data should always be explained, never silent.
Production deployment caveats
- The migration runner has a
--remotemode; first-run on a partially-migrated production DB needs manual backfill (see Installation). - The every-minute analyses cron uses a 90s claim/lease to prevent overlap on long-running LLM narrative passes (RFC 0002 Stage 4 follow-up).
- The pprof receiver returns 422 on decode failure (corrupted blobs surface to the agent instead of landing silently in R2).
- The connected-routes endpoint returns 400 on unknown entity kinds (catches client-side URL building bugs).