Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions docs/architecture/adr/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: Architecture Decision Records
description: Index of every accepted ADR, with status and one-line summary.
---

# Architecture Decision Records

Architecture Decision Records (ADRs) document **why** a particular architectural choice was made — the context, the alternatives considered, the trade-offs accepted. They're written once and not edited; if a decision changes, a new ADR supersedes the old.

For background on the ADR practice, see Michael Nygard's [original post](https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions). PAL-X follows the standard lightweight pattern: one Markdown file per decision, numbered chronologically, with a status field.

## Status legend

| Status | Meaning |
|---|---|
| `Proposed` | Under discussion. Not yet implemented. |
| `Accepted` | Decided, ratified, implemented. |
| `Deprecated` | Decision still applies historically but a new ADR supersedes the recommendation. |
| `Superseded by ADR-####` | Replaced. The new ADR documents the change. |

No ADRs are currently `Deprecated` or `Superseded`.

## Accepted ADRs

| # | Title | Date | Status | One-line summary |
|---:|---|---|---|---|
| 0001 | [Ratified Deviations from Seed Documentation](0001-deviations-from-seed-docs.md) | 2026-04-23 | Accepted | 12 deviations from the ChatGPT-generated seed docs, ratified at project kickoff: tri-state status (no 0–100 score), declarative comparators (no DSL), content-hash IDs, snake_case fields, `host_context` in v1 schema, Spectre.Console.Cli over System.CommandLine, ScottPlot for charts, and others. |
| 0002 | [Declarative Rule Schema Instead of Custom DSL](0002-declarative-rule-schema.md) | 2026-04-23 | Accepted | Rule conditions are declarative — `metric` + `aggregation` + `operator` + `threshold` + `duration_percent` + optional `window`. No expression parser. Trades expressivity for stability, auditability, and zero parser maintenance. |
| 0003 | [Pack Signing Format and Trust Model](0003-pack-signing-format.md) | 2026-04-27 | Accepted | RSA-PSS-SHA256 with 3072-bit keys, signing raw `pack.yaml` bytes, sidecar file at `pack.yaml.sig`. BCL-only (no NuGet dep). Trust model is consumer-rooted via embedded project key + CLI `--trust-key`. |
| 0004 | [Schema v1.1: Rolling-Window Aggregations (In-Place Enum Bump)](0004-schema-v1.1-rolling-windows.md) | 2026-04-27 | Accepted | Pack schema gains rolling-window aggregations via an additive `window:` field on `Condition`. Schema discriminator is in-place (`schema_version: "pal.pack/v1.1"`); no new JSON Schema file. Validator gates `window:` on the v1.1 version. |

## Reading an ADR

Each ADR follows the same structure:

- **Context** — the problem being solved and the constraints.
- **Decisions** — what was chosen, often broken into sub-decisions.
- **Consequences** — what changed, what we gave up, what's now harder or easier.
- **Alternatives considered** — what we didn't pick and why.

The most important read for new contributors is **[ADR 0001](0001-deviations-from-seed-docs.md)** — it documents every design choice that diverges from the seeded ChatGPT spec, and the diverging choice is the load-bearing one in nearly every case.

## Authoring a new ADR

When a non-trivial architectural decision is made:

1. Number the new ADR sequentially (e.g., `0005-…`).
2. Use the same heading structure as the existing ones.
3. Set status to `Accepted` once the decision is final; don't ship `Proposed` ADRs in production branches.
4. Link to the ADR from any code or doc that implements it — bidirectional references catch drift.
5. Add an entry to this index.

If an ADR supersedes an earlier one, update the earlier ADR's status to `Superseded by ADR-####` and link forward.

ADRs are not retrospective documentation. If a decision was made informally and you're documenting it after the fact, that's fine — but make it clear in the Context section. Date the ADR with when it was written; date the decision (in Context) with when it was made.

## Where ADRs are NOT the answer

ADRs are heavyweight. Don't write one for:

- **Bug fixes** — those live in commit messages and PR descriptions.
- **Refactors that preserve external behaviour** — same.
- **Tactical implementation choices** — e.g., "use a `HashSet` here" doesn't need an ADR.
- **Configuration defaults** — those belong in `appsettings.json` and **[Reference — Configuration](../../reference/configuration.md)**.

ADRs are for **decisions that constrain future work** — choices a future contributor needs to know about to avoid relitigating, breaking, or reversing without strong cause.

## Related

- **[Architecture index](../index.md)** — the broader architecture context.
- **[Data flow](../data-flow.md)** / **[Persistence](../persistence.md)** / **[Schema evolution](../schema-evolution.md)** — implementations of these decisions.
10 changes: 10 additions & 0 deletions docs/architecture/adr/toc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
- name: Index
href: index.md
- name: "0001 — Deviations from seed docs"
href: 0001-deviations-from-seed-docs.md
- name: "0002 — Declarative rule schema"
href: 0002-declarative-rule-schema.md
- name: "0003 — Pack signing format"
href: 0003-pack-signing-format.md
- name: "0004 — Schema v1.1 rolling windows"
href: 0004-schema-v1.1-rolling-windows.md
215 changes: 215 additions & 0 deletions docs/architecture/data-flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
title: Data flow
description: End-to-end — from a counter file on disk to a finding in a report — with the types and components at each hop.
---

# Data flow

This is the runtime story: how a `.csv` or `.blg` becomes a finding in a report. Six hops, two modes (CLI synchronous vs API asynchronous), one engine.

For the per-component reference, see the **[Project map](index.md#project-map)** on the architecture index.

## The engine pipeline (same in all modes)

```text
┌─────────────────────────────────────┐
│ RAW INPUT │
│ capture.csv or capture.blg │
└─────────────┬───────────────────────┘
(1) Collector dispatch by file extension
┌──────────────────────┴──────────────────────┐
▼ ▼
┌──────────────────┐ ┌─────────────────────┐
│ CsvCollector │ │ BlgCollector │
│ (any platform) │ │ (Windows / PDH) │
└─────────┬────────┘ └──────────┬──────────┘
│ │
└────────────────────┬───────────────────────┘
raw counter paths
(2) MetricAliasRegistry normalises paths to canonical IDs
┌─────────────────────────────┐
│ Dataset │
│ series[], samples, gaps, │
│ host_context │
└───────────────┬─────────────┘
(3) PackLoader reads YAML; PackValidator gates malformed packs
┌─────────────────────────────┐
│ Pack[] in memory │
│ applicability filter │
└───────────────┬─────────────┘
(4) RuleEngine evaluates conditions against series
┌─────────────────────────────┐
│ Finding[] │
│ evidence + statistics │
│ sorted: sev/cat/rule/id │
└───────────────┬─────────────┘
(5) Report writers serialise
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ JSON report │ │ HTML report │ │ Markdown report │
│ (canonical) │ │ (browser UX) │ │ (optional) │
└──────────────┘ └──────────────┘ └──────────────────┘
(6) ScottPlot writes SVG charts (optional)
┌─────────────────────────────┐
│ charts/*.svg │
└─────────────────────────────┘
```

## Hop 1 — Collector dispatch

`CollectorFactory.For(path)` looks at the file extension:

- `.csv` → `CsvCollector` (any platform).
- `.blg` → `BlgCollector` (Windows-only, throws `PlatformNotSupportedException` elsewhere with a `relog -f CSV` fallback message).

Both collectors emit the same `Dataset` shape — downstream code can't tell them apart.

The CSV path is text — read line by line, parse perfmon's CSV header for counter paths, parse samples by column. The BLG path is binary — open via PDH (`Pdh.dll`), enumerate counters, fetch samples through `PdhCollectQueryData`.

## Hop 2 — Canonical metric IDs

Raw counter paths look like `\\WEB-01\Processor(_Total)\% Processor Time`. Rules don't reference paths — they reference canonical IDs like `processor.percent_processor_time`. `MetricAliasRegistry.Resolve(path)` runs the path against compiled regex patterns and returns the canonical ID, or `null` if nothing matches (which becomes `unknown.<sanitised>`).

The registry's default entries are built into `Pal.Engine.Normalization.MetricAliasRegistry.BuildDefault()` — see **[Reference — Canonical metric IDs](../reference/metric-ids.md)** for the table. Pack-level `metric_aliases:` extends this registry per analysis.

## Hop 3 — Pack loading

`PackLoader.Load(yamlPath, signatureRequirement, trustedKeys)`:

1. Reads the YAML file.
2. Parses into the `Pack` model (DTOs in `Pal.Engine.Model`).
3. Optionally verifies the `pack.yaml.sig` sidecar.
4. Hands the parsed pack to `PackValidator.Validate(pack)`.

`PackValidator` is the source of truth for what constitutes a valid pack — every schema constraint (severity enum, aggregation enum, operator enum, window invariants) is enforced here, not at YAML parse time. Validation errors and warnings are returned to the caller; failures surface as exit code `4` from the CLI or `400/422` from the API.

`PackRegistrySyncService` (API only) drives the loader at startup: it walks `Packs:Directory`, loads each `pack.yaml`, and persists the result into Postgres so the API has a database-backed pack registry alongside the disk source.

## Hop 4 — Rule evaluation

The heart of the engine. `RuleEngine.Evaluate(dataset, packs)`:

```text
for each pack:
if pack.applicability matches dataset:
for each rule:
if rule.applies_when matches:
for each condition:
select series (canonical_metric + optional instance filter)
compute aggregation (avg, p95, ..., trend, or window-bounded)
compare to threshold (number or host_context-resolved)
check duration_percent
if all conditions satisfied:
emit Finding with evidence
sort findings: severity desc, category asc, rule_id asc, finding_id asc
```

A few important properties:

- **Determinism.** Two runs against the same dataset with the same packs produce identical findings (modulo `generated_at_utc`, overridable with `--now`). The sort order is total, with `finding_id` (a content hash) as the final tiebreaker.
- **`host_context` is informational-fallback.** If a rule references `host_context.total_physical_memory_mb` and the value is unknown, the rule is skipped and an informational warning is emitted. Run still succeeds.
- **Pack-level `applicability` is a fast skip.** If `requires_any` doesn't match the dataset's metric set, the pack's rules are never evaluated. Rule-level `applies_when` is a per-rule equivalent.

`Finding` carries everything needed to render the result: rule metadata, category, severity, the resolved evidence (series + statistics + trigger expression), and inlined recommendations from the pack's `recommendations:` map.

## Hop 5 — Report writing

Three writers, one shared shape:

- `JsonReportWriter` — emits `pal.report/v1` JSON. Canonical; downstream consumers read this.
- `HtmlReportWriter` — emits a self-contained HTML page. Derived view; renders the same data with a human-friendly layout.
- `MarkdownReportWriter` — emits GFM tables. Derived view; only invoked when explicitly requested.

All three call `JsonReportWriter.WriteInput(...)` internally to compose the report model, then serialise to their target format. This is why golden-fixture tests work — the writers are deterministic transforms of a fixed-input model.

UTF-8 without BOM is enforced via `new UTF8Encoding(false)` on every write. This is non-negotiable: golden tests are byte-comparison, and a BOM would break them.

## Hop 6 — Chart SVGs (optional)

If `--include-charts` is set (CLI) or charts are otherwise requested, the engine attaches `ChartRef` entries to findings and writes SVGs via `ScottPlot.Plot.Save`. One SVG per (finding × metric) pair, capped by `--chart-limit` (default 20).

Charts are written to `out/charts/<report-name>-<chart-id>.svg`. The HTML report embeds them inline. The JSON report references them by relative path in each finding's `evidence.charts[]`.
Comment on lines +147 to +149
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove unsupported chart artifact flow claims

This section documents a chart pipeline (--include-charts, evidence.charts[], out/charts/...) that the current code path does not implement: findings do not carry chart references and neither CLI nor API analysis flow invokes chart rendering/writes chart files. Users and automation following this architecture contract will wait for artifacts that are never produced.

Useful? React with 👍 / 👎.


ScottPlot's SVG output is canonicalised by `SvgCanonicalizer` before write — IDs are normalised so two runs produce byte-identical SVGs. Without this step, ScottPlot's gradient IDs include process-local counters that would defeat determinism.

## Two runtime modes share the pipeline

### CLI — synchronous

```text
┌─────────────┐
user typed args ───►│ pal CLI │
│ (synchronous)│
└──────┬──────┘
the 6 hops above, in process
writes to ./out/
exits with status code
```

`AnalyzeCommand.ExecuteAsync` orchestrates collectors, the engine, the writers. Failures map to `ExitCodes.*` per **[Reference — Exit codes](../reference/exit-codes.md)**.

### API — asynchronous

```text
┌──────────────┐ ┌────────────────┐
POST /analysis ────►│ HTTP │──► writes job row ──────►│ Postgres │
│ handler │ └────────────────┘
│ enqueues Guid │
└──────┬───────┘
Channel<Guid> (in-process, single-reader)
┌──────────────┐
│AnalysisWorker│ (BackgroundService)
└──────┬───────┘
the 6 hops above, same code
writes JSON/HTML to disk + result row to Postgres
(auto-compare if selectedBaselineId set)
(policy evaluation → alerts → webhook delivery)
```

The engine pipeline is identical. What's different is the orchestration: HTTP enqueues, the worker dequeues, repositories persist, and additional services (`PolicyEvaluator`, `IAutoCompareService`, `NotificationService`) extend the post-analysis flow with alerting and comparisons.

The in-process `Channel<Guid>` keeps the API simple — no external message broker, no Postgres `LISTEN/NOTIFY`. The trade-off: if the API process crashes, queued-but-not-started jobs are lost (the worker channel is in-memory). Jobs that have started but not finished are detected on restart and marked `failed`. This is documented as a Phase 5 improvement candidate.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct queue crash-recovery semantics

This sentence describes restart behavior opposite to the implementation: jobs are persisted in analysis_jobs before channel enqueue, AnalysisWorker.StartAsync re-enqueues queued jobs from the DB, and ResetOrphanedJobsAsync moves running jobs back to queued (not failed). In operational incidents, this doc would cause readers to treat restart as lossy/final-failure and perform unnecessary manual recovery.

Useful? React with 👍 / 👎.


## Related

- **[Persistence](persistence.md)** — what gets stored after the pipeline completes.
- **[Schema evolution](schema-evolution.md)** — how the input contract evolves.
- **[Reference — Report schema](../reference/report-schema.md)** — output shape.
- **[Reference — Canonical metric IDs](../reference/metric-ids.md)** — the rewrite table for Hop 2.
- **[ADR 0002 — Declarative Rule Schema](adr/0002-declarative-rule-schema.md)** — why Hop 4 doesn't have an expression parser.
Loading
Loading