obs-unified

What to expect

The Connected rail, two scenarios end-to-end, the user detail page.

obs-unified is designed around one promise: every signal is reachable from every other in ≤2 clicks. This page walks through what the dashboard actually surfaces once instrumentation is in place.

The Connected rail

Every detail page in the dashboard mounts a right-side rail with four sections:

┌─ Connected — span ─┐
│                    │
│  Up:               │
│    Trace           │
│      Parent trace  │
│                    │
│  Across:           │
│    Other spans     │
│    Logs in trace   │
│    AI calls        │
│                    │
│  Down:             │
│    Profiles        │
│                    │
│  Related:          │
│    Click that      │
│      caused this   │
│      trace         │
│      → click_5     │
│                    │
└────────────────────┘
  • Up — the parent entity (trace ← span, session ← usage event, etc.)
  • Across — sibling signals sharing the same identity key (other spans in the same trace, logs from the same session)
  • Down — derived data (pprof profile for a trace, off-CPU profile for a span)
  • Related — non-identity-based neighbors (the click that caused this trace, alerts firing on this service)

When a section has no neighbors, the rail renders an informative-absence message explaining why — never a silent empty section. The platform's contract is that "no data" should always tell you what's missing and how to populate it.

Scenario A — alert → trace → flame graph → cohort → session → replay

The headline product test. From a paged alert:

StepWhat you seeWhat you clickRFCs
1Alert detail with bound Analysis narrative + exemplar tracesSlowest exemplar trace0002, 0006
2Trace waterfall, self-time bars, ⚠ UNINSTRUMENTED + 🔥 PROFILES badges🔥 badge on the slow span0005, 0006, 0007
3Flame graph filtered to this trace's samples (server-side filter, smaller blob)"Other traces sampled in this profile (243)"0007
4Cohort: all traces touched by this profile, with user attributionA user from the cohort0007, 0006
5Session timeline: user's page views, clicks, traces side-by-sideAn rrweb event0004, 0006
6Replay scrubbed to the click + Connected rail: "Trace caused by this click"Closes the loop back to step 2's trace0004, 0006

Six clicks across the entire platform. The platform's claim is that every neighbor at every step is on the rail.

Scenario B — AI cost spike → user → session → trace

A different entry point exercising the same identity skeleton:

  1. AI dashboard shows a cost spike (SPANS OVER TIME chart peaks). The Sessions view ranks the heavy spender at the top by cost.
  2. Click the 👤 user-id chip on the heavy spender's row → user detail page.
  3. User detail page shows the user's Identity card + a Connected rail with "Latest session", "Recent traces", "Recent AI calls". The rail surfaces the count-collapsed link for a session with N traces / M AI calls.
  4. Click "Latest session" → Replay tab scoped to that session, showing the session's interactions linked to their traces.
  5. Click an interaction → trace waterfall for the trace that click caused. Connected rail's "Click that caused this trace" closes the loop back to the originating click.

The seed (pnpm seed) plants a "Heavy Spender (seed)" user with 8–9 high-cost claude-3-5-haiku calls so this walkthrough is reproducible without writing real AI traffic.

Scenario C — futex contention via off-CPU flame graph

Validates the kernel-level layer:

  1. Trace shows an unexplained pause inside a span (no child spans, on-CPU profile shows little activity).
  2. Rail's "Down → 🔥 off-CPU profile" leads to an icicle flame graph that surfaces futex_wait_queuepthread_mutex_lockinventory_pool::checkout taking 84% of off-CPU time.
  3. Root cause: a single pool-wide mutex serializing every checkout.

This scenario currently runs only against the docker-compose demo with Beyla feeding pprof. The dashboard code paths are live; the synthetic seed doesn't generate pprof blobs.

Per-tab walkthrough

TabWhat's thereKey rail pivots
HealthTier-0 analysis tiles (error top offenders, latency outliers, log anomaly summary) with optional LLM narrativeClick a tile → Investigations page with the analysis detail
TimelinePer-session lane of usage / span / log events, grouped by interaction_idClick an event → trace or replay
Service MapService-to-service edges with SDK / eBPF source filterClick an edge → traces between those services
LogsHistogram + by-service / by-severity breakdown, filterableClick a log → log detail with rail surfacing parent trace
InvestigationsList of analyses + per-analysis detail page with narrative + evidence + connected railRail's "Cited traces" → trace detail
TracesTrace list with inline waterfall expansion, self-time visualization, ⚠ + 🔥 badges, span detail drawerClick a span row → rail with "Click that caused this trace"
IssuesTrace-level issue grouping by error fingerprintClick an issue → trace
AI CallsTwo views — Spans (typed LLM/TOOL/RETRIEVER spans) and Sessions (multi-turn conversation rendering with cost + tokens). User chips are clickable.Click 👤 user-id → user detail page
ReplaysSession list + rrweb player + per-session interactions panelClick an interaction → trace it caused
AlertsAlert rules + recent firings + bound analysesClick an alert → bound Analysis → exemplar traces
UsagePage views, interactions, top paths, by-country breakdownClick a session row → timeline
ResourcesCloudflare worker resource panels + (when populated) Linux host metricsClick a host → host detail
ProjectsMulti-project routing (ingest keys, dashboard auth)n/a

When you should expect informative absence

The rail is honest about what's missing. You'll see explicit "—" messages when:

  • No interaction_id on a span — the trace wasn't caused by a browser click (cron, queue consumer, retry). The "Originating click" section explains this.
  • No pprof profile — the producing service hasn't wired startProfiler() or an eBPF agent. The Down section explains how to populate.
  • No rrweb replay — the session had no real browser to capture chunks. The Replay tab tells you to visit /playground and click "Start replay" to capture one.
  • Alert/analysis topic links — alerts and analyses don't carry identity columns; they relate by topic, not identity. The rail's Related section explains this is by design.

These are part of the design — empty data should always be explained, never silent.

Production deployment caveats

  • The migration runner has a --remote mode; first-run on a partially-migrated production DB needs manual backfill (see Installation).
  • The every-minute analyses cron uses a 90s claim/lease to prevent overlap on long-running LLM narrative passes (RFC 0002 Stage 4 follow-up).
  • The pprof receiver returns 422 on decode failure (corrupted blobs surface to the agent instead of landing silently in R2).
  • The connected-routes endpoint returns 400 on unknown entity kinds (catches client-side URL building bugs).

On this page