Skip to content

Analytics reliability improvements + bug fixes#193

Open
strickvl wants to merge 5 commits intodevelopfrom
fix/analytics-gaps-from-first-report
Open

Analytics reliability improvements + bug fixes#193
strickvl wants to merge 5 commits intodevelopfrom
fix/analytics-gaps-from-first-report

Conversation

@strickvl
Copy link
Copy Markdown
Collaborator

@strickvl strickvl commented Feb 6, 2026

Summary

Addresses all 8 analytics gaps identified in the first analytics report (Jan 12 – Feb 6, 2026). Also fixes a P0 bug where local server connect options (Docker, port) were silently dropped due to an argument-passing mismatch between TypeScript and Python.

  • Fix local connect args bug (P0): The url variable was uninitialized for local connections but still sent in the args array, causing Python to grab the wrong positional arg. Additionally, getattr() was used on a dict instead of .get(). Docker/port options were always silently ignored.
  • Add server.connection_failed event (P1): Dedicated failure event with privacy-safe error taxonomy (errorKind, errorSource, messageHash) — key signal for diagnosing the Windows connection gap (1.8% vs 11.4% connection rate).
  • Fix "unknown" disconnections (P1): Store lastConnectedType on connect, use it on disconnect instead of re-categorizing the post-disconnect URL (which reverts to sqlite). Add disconnectReason (user_initiated vs unexpected) via disconnect intent signal.
  • Add centralized error.occurred tracking (P1): Emit from LSClient.sendLsClientRequest() at three points (preflight, request, response) with 60s dedupe window.
  • Add Python/ZenML version to common properties (P2): Wire existing version data into AnalyticsService via ENVIRONMENT_INFO_UPDATED EventBus pattern.
  • Add session tracking + extension.deactivated (P2): Generate sessionId on init, include in all events, emit deactivation event with session duration.
  • Add first-run detection (P2): extension.first_activated event via globalState flag, isFirstActivation on extension.activated.
  • Add component CRUD analytics (P3): component.registered, component.updated, component.deleted events.

New file: src/utils/analytics.ts

Privacy-safe error classification utility:

  • ErrorKind / ErrorSource / ErrorPhase taxonomy types
  • sanitizeErrorForAnalytics() — classifies errors + produces hash, never raw messages
  • normalizeForHash() — strips URLs, paths, UUIDs, tokens before hashing
  • isErrorLikeResponse() — type guard for LSP error responses
  • trackEvent() — shared helper used by all command modules
  • extractErrorMessage() — safely extracts message from any error type

Code review fixes (second commit)

  • Fix unused import / add error taxonomy to component register/update: ComponentsForm.ts now returns ComponentOperationResult from registerComponent()/updateComponent(), including sanitizeErrorForAnalytics taxonomy on failure. All three component operations (register, update, delete) now have consistent error tracking.
  • Fix spurious server.disconnected on startup: handleServerStatusChange() now treats the first observed disconnected state as initialization (sets baseline, doesn't emit). Initial connected state still emits server.connected.
  • Widen ConnectServerResponse type: Added SuccessMessageResponse to the union for local/pro message-only success payloads (no access_token).
  • Add dedupe map pruning: LSClient.errorDedupe now has opportunistic pruning — stale entries older than 60s are removed at most once per minute to prevent unbounded growth.
  • Add 54 new tests: Error taxonomy classification (all ErrorKind variants), hash privacy guarantees (URL/path/UUID stripping), server status tracking (initial state, disconnect classification with 10s window, connection type preservation, deduplication), and utility functions (normalizeForHash, sha256Hex, isErrorLikeResponse).

11-agent code review fixes (third commit)

Addressed findings from a comprehensive review by 11 parallel agents (code reviewer, silent failure hunter, comment analyzer, test analyzer, type design analyzer, efficiency reviewer, code quality/reuse/simplification reviewers, and two reviewer personas):

  • Fix catch (error: any) runtime crash risks: Changed to catch (error: unknown) with proper narrowing in LSClient.sendLsClientRequest and disconnectServer. The old code called error.message without checking if error was an Error instance.
  • Fix isErrorLikeResponse false positives: Added truthiness check (&& result.error) to prevent analytics events from { error: null } responses.
  • Fix silent data corruption: Logged globalState.update rejections (first-activated flag and anonymous ID) — failures were silently swallowed, causing extension.first_activated to fire on every activation.
  • Fix event listener leak: Stored 4 listener references in AnalyticsService.registerEventBus() for proper cleanup in dispose(). Previously, anonymous arrow functions were never unsubscribed.
  • Consolidate trackEvent helper: Extracted shared export in analytics.ts, replacing 5 identical local copies across command modules.
  • Extract executeComponentOperation(): Eliminated ~50 lines of copy-paste between registerComponent and updateComponent in ComponentsForm.ts.
  • Export extractErrorMessage(): Shared utility replacing hand-rolled error instanceof Error ? error.message : String(error) pattern.
  • Remove dead 'python_backend' from ErrorSource: No code path ever produced this value.
  • Add logging to bare catch blocks: track() and emitErrorOccurred() now log to console.debug instead of silently swallowing.
  • Remove redundant double-flush on deactivation: dispose() already flushes.
  • Fix sha256Hex JSDoc: Now correctly documents truncation to 16 characters.
  • Update CLAUDE.md analytics instructions: References shared trackEvent import instead of stale local pattern.
  • Add 7 new TS tests: LSClient error dedupe (dedup within window, emit after expiry, cross-operation independence, preflight hard-dedupe).
  • Add 6 new Python tests: Connect args parsing for P0 bug fix (corrected shape, legacy shape, defaults, empty args, missing URL).

Test plan

  • Verify local server connect with Docker=Yes and custom port actually passes options through (was broken before)
  • Verify remote server connect still works (args unchanged for remote path)
  • Check that extension.first_activated fires only on first install (clear globalState to test)
  • Check that extension.deactivated fires with sessionDurationMs on window close
  • Verify server.connection_failed fires with error taxonomy on failed connect
  • Verify error.occurred fires from LSClient on LSP errors (with dedupe)
  • Verify disconnect shows correct connectionType (not "unknown") and disconnectReason
  • Verify component register/update/delete emit analytics events with error taxonomy on failure
  • Verify pythonVersion and zenmlVersion appear in event properties after LSP initialization
  • Run with ZENML_ANALYTICS_VERBOSE=1 ZENML_ANALYTICS_DEBUG=1 to inspect event payloads
  • Confirm no raw URLs, paths, or error messages appear in any analytics payload
  • Run ./scripts/lint.sh — all checks pass
  • Run npm run compile — builds successfully
  • Run npm run test — all 134 tests pass (including 54 new tests)
  • Python connect args tests — 6/6 pass

Fix local connect args bug (P0):
- Fix argument order mismatch between TS and Python for local server
  connections — url was uninitialized but still sent, causing docker/port
  options to be silently dropped
- Fix getattr() on dict in Python wrapper — use dict.get() instead
- Support both legacy and corrected arg shapes for backward compatibility

Add server.connection_failed event (P1):
- Emit dedicated failure event with error taxonomy on connect failure
- Include connectionType, serverUrlCategory, docker, portProvided, and
  privacy-safe error classification (errorKind, errorSource, messageHash)

Fix unknown disconnections (P1):
- Store lastConnectedType on connect, use it on disconnect instead of
  re-categorizing the post-disconnect URL (which reverts to sqlite)
- Add disconnectReason property (user_initiated vs unexpected) using
  SERVER_DISCONNECT_REQUESTED intent signal from disconnect command

Add centralized error.occurred tracking (P1):
- Emit error.occurred from LSClient.sendLsClientRequest at three points:
  client not ready (preflight), request throws (request), error response
- Dedupe identical errors within 60s window; hard-dedupe lsp_not_ready
  to once per session

Add Python/ZenML version to common properties (P2):
- Cache pythonVersion, zenmlVersion, zenmlInstalled in AnalyticsService
- Wire from LSClient (zenml/isInstalled notification) and ZenExtension
  (interpreter change) via ENVIRONMENT_INFO_UPDATED EventBus pattern
- Include in all event common properties

Add session tracking and extension.deactivated (P2):
- Generate sessionId (UUID) on AnalyticsService.initialize()
- Include sessionId in all events as common property
- Emit extension.deactivated with sessionDurationMs before dispose

Add first-run detection (P2):
- Use globalState key to detect first-ever activation
- Emit extension.first_activated once per install
- Add isFirstActivation flag to extension.activated event

Add component CRUD analytics events (P3):
- component.registered and component.updated from ComponentsForm
- component.deleted from components/cmds with error taxonomy on failure

New file: src/utils/analytics.ts
- ErrorKind/ErrorSource/ErrorPhase taxonomy types
- sanitizeErrorForAnalytics() — privacy-safe error classification + hash
- normalizeForHash() — strips URLs, paths, UUIDs, tokens before hashing
- isErrorLikeResponse() — type guard for error responses

Update CLAUDE.md with analytics architecture docs and event table.
Update README.md "What We Collect" section for new events/properties.
- Fix unused import in ComponentsForm.ts by wiring sanitizeErrorForAnalytics
  into register/update failure paths (error taxonomy now on all component events)
- Prevent spurious server.disconnected on startup by treating first observed
  disconnected state as initialization (not a transition)
- Widen ConnectServerResponse type to include SuccessMessageResponse for
  local/pro message-only success payloads
- Add opportunistic pruning to LSClient errorDedupe map to prevent unbounded
  growth in long sessions
- Add 54 new tests: error taxonomy classification, hash privacy guarantees,
  server status tracking, disconnect reason correlation, deduplication
@strickvl strickvl changed the title Close all 8 analytics gaps from first report Analytics reliability improvements + bug fixes Feb 6, 2026
strickvl and others added 3 commits March 31, 2026 06:39
- Fix catch (error: any) in LSClient.sendLsClientRequest and
  server/cmds disconnectServer to use error: unknown with proper
  narrowing via extractErrorMessage(), preventing runtime crashes
  on non-Error thrown values
- Add truthiness check to isErrorLikeResponse usage in LSClient
  to prevent false-positive analytics on { error: null } responses
- Log globalState.update rejections in ZenExtension and
  AnalyticsService to prevent silent first-activation data corruption
- Store event listener references in AnalyticsService for proper
  cleanup in dispose(), preventing listener leaks on host reload
- Add console.debug logging to bare catch blocks in track() and
  emitErrorOccurred for debuggability
- Remove redundant double-flush in extension deactivation
- Extract shared executeComponentOperation() in ComponentsForm,
  eliminating ~50 lines of copy-paste between register and update
- Export extractErrorMessage from analytics.ts for reuse
- Consolidate trackEvent helper into shared export (was duplicated
  across 5 command modules)
- Remove dead 'python_backend' from ErrorSource union type
- Fix sha256Hex JSDoc to mention truncation to 16 characters
- Update CLAUDE.md analytics instructions to reference shared import
- Add 7 TypeScript tests for LSClient error dedupe logic
- Add 6 Python tests for connect() args parsing (P0 bug fix coverage)
@zenml-io zenml-io deleted a comment from coderabbitai Bot Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants