Skip to content

feat(telemetry): instrument connection lifecycle#942

Open
EhabY wants to merge 8 commits intomainfrom
feat/issue-905-connection-telemetry
Open

feat(telemetry): instrument connection lifecycle#942
EhabY wants to merge 8 commits intomainfrom
feat/issue-905-connection-telemetry

Conversation

@EhabY
Copy link
Copy Markdown
Collaborator

@EhabY EhabY commented May 7, 2026

Summary

  • instrument SSH process discovery, loss/recovery, and sampled network info telemetry
  • instrument reconnecting WebSocket open/drop/reconnect/state-transition lifecycle telemetry
  • wire telemetry into primary and remote-scoped Coder API clients while leaving throwaway clients opt-in
  • add TestSink assertions for the requested lifecycle events and sampling behavior

Closes #905.

Validation

  • pnpm test:extension ./test/unit/remote/sshProcess.test.ts
  • pnpm test:extension ./test/unit/websocket/reconnectingWebSocket.test.ts
  • pnpm typecheck
  • pnpm lint
  • pnpm format:check
Implementation plan and decisions

Confirmed decisions:

  1. Telemetry injection is optional for throwaway CoderApi instances; long-lived primary/remote clients receive telemetry.
  2. connection.open uses the sanitized route (pathname + search) rather than the full URL.
  3. Reconnect aggregation emits telemetry for terminal DISCONNECTED outcomes as well as successful recovery.
  4. SSH loss causes use stale_network_info, missing_network_info, process_changed, disposed.
  5. Network derp uses the preferred_derp string, empty string if unavailable.

Implemented:

  • SshProcessMonitor: ssh.process.discovered, ssh.process.lost, ssh.process.recovered, and sampled ssh.network.info.
  • ReconnectingWebSocket: connection.state_transition, connection.open, connection.drop, and aggregated connection.reconnect.
  • Unit coverage for requested event behavior and sampling cadence.

Generated by Coder Agents.

@EhabY EhabY self-assigned this May 7, 2026
@EhabY EhabY force-pushed the feat/issue-905-connection-telemetry branch from 11b393a to 2903de2 Compare May 7, 2026 15:05
EhabY and others added 5 commits May 7, 2026 18:07
Type the connection reason/cause strings (ConnectionStateReason,
ConnectionDropCause), have disconnect/scheduleReconnect own drop
emission so the dedup flag and triple pre-call drops can go away,
add an attempts measurement to the ssh discovery span, and extract
the duplicated test telemetry helper.
@EhabY EhabY force-pushed the feat/issue-905-connection-telemetry branch from 2903de2 to 5e25405 Compare May 7, 2026 15:07
EhabY and others added 3 commits May 7, 2026 19:43
The reconnect cycle was modeled as a Promise-backed trace with a NOOP_SPAN
placeholder and synthetic Errors fed in to satisfy the trace contract.
Switch to a plain field plus a single log() call at finish time, since the
cycle is event-driven, not function-driven. Encapsulate the four-step
process_changed handover as recordProcessReplaced, short-circuit network
sampling before allocating a sample, and rename the shadowed internal
options.reason to closeReason.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry: instrument connection lifecycle (SSH process, WebSocket)

1 participant