Add State Persistence for Crash Recovery by bupd · Pull Request #321 · container-registry/harbor-satellite

bupd · 2026-02-10T10:30:23Z

Fixes: State Persistence for Crash Recovery #228

Summary

Persist satellite state (replicated entities + config digest) to disk after each successful sync
Load persisted state on startup to resume from last known position, avoiding full re-replication
Atomic file writes (temp file + rename) to prevent corruption on crash
Mutex protection for concurrent state and config digest writes
Add JSON tags to Entity struct for serialization

Summary by cubic

Persist satellite state to disk and reload it on startup to recover from crashes without re-replicating. Replication streams layers and skips existing blobs for fast resume; adds tests for layer-level resume, handles group swaps, and fixes TLS on push.

New Features
- Persist group entities and config digest to PathConfig.StateFile (state.json) with atomic writes; add JSON tags to Entity and mutex-protect writes.
- Load persisted state and previous config digest on startup to resume from the last known position.
- Stream and de-duplicate layers during replication; unit tests verify layer-level resume on restart.
Bug Fixes
- State durability: fsync before rename; warn on corrupted state file instead of silently discarding.
- Persist state when groups are added, removed, or swapped to avoid stale state.
- Apply custom TLS transport to push operations.

^{Written for commit 07a0d7c. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Persistent state support: the system can now save/load operational state to disk so restarts avoid full re-replication.
- New runtime option to specify a state file for persistence.
New Features (CLI/Tasks)
- End-to-end crash-recovery flows and helper tasks to run, simulate crash/restart with a persistent volume, verify recovery, and clean up.
- BYO-registry variants for crash-recovery testing.
Tests
- Added E2E test suites and unit tests covering state persistence and crash-recovery.

Signed-off-by: bupd <bupdprasanth@gmail.com>

coderabbitai · 2026-02-10T10:30:43Z

📝 Walkthrough

Walkthrough

This change implements state persistence for satellite crash recovery. It adds a persistence layer to save and load satellite state to/from disk, updates satellite initialization to load persisted state on startup, modifies replication to stream descriptors, and introduces E2E crash-recovery tasks and tests to validate restart behavior without full re-replication.

Changes

Cohort / File(s)	Summary
Configuration `pkg/config/paths.go`	Added `StateFile` field to `PathConfig` and set path to `state.json`.
CLI / Bootstrap `cmd/main.go`	Passes `pathConfig.StateFile` into `NewSatellite`, updating constructor arity.
Satellite Initialization `internal/satellite/satellite.go`	`NewSatellite` signature updated to accept `stateFilePath` and propagate it into state process.
State process & lifecycle `internal/state/state_process.go`	Added `stateFilePath` field and new constructor `NewFetchAndReplicateStateProcess(..., stateFilePath, ...)`; loads persisted state on startup and persists state after config/group processing; minor log typo fix.
State persistence layer `internal/state/state_persistence.go`, `internal/state/state_persistence_test.go`	New atomic JSON-based SaveState/LoadState API and types (`PersistedState`, `PersistedGroupState`) plus unit tests for round-trip, missing file, and empty state cases.
Replication logic `internal/state/replicator.go`	Refactored replication to stream descriptors via `remote.Get`/`remote.Write`, added JSON tags to `Entity` fields (`name`, `repository`, `tag`, `digest`), adjusted TLS/auth handling and reference parsing.
E2E task definitions `Taskfile.yml`, `taskfiles/e2e.yml`	Added `e2e-crash-recovery` and `e2e-crash-recovery-byo` tasks plus supporting tasks to run satellite with persistent volume, simulate crash/restart, verify recovery, and cleanup (embedded and BYO registry flows).

Sequence Diagram

sequenceDiagram
    participant Main as Satellite Main
    participant Process as FetchAndReplicateStateProcess
    participant Disk as Disk (state.json)
    participant GC as Ground Control

    rect rgba(100, 150, 200, 0.5)
    Note over Main,Disk: Startup - Load Persisted State
    Main->>Process: NewFetchAndReplicateStateProcess(cm, stateFilePath, log)
    Process->>Disk: LoadState(stateFilePath)
    alt State file exists
        Disk-->>Process: PersistedState {ConfigDigest, Groups}
        Process->>Process: Initialize with persisted state
    else State file missing
        Disk-->>Process: nil, nil
        Process->>Process: Start with empty state
    end
    end

    rect rgba(100, 200, 100, 0.5)
    Note over Process,Disk: Runtime - Sync and Persist
    Process->>GC: FetchRemoteConfig()
    GC-->>Process: New configuration
    Process->>Process: Reconcile config changes
    Process->>Disk: SaveState(stateFilePath, stateMap, configDigest)
    Disk-->>Process: ✓ State persisted
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Main,Disk: Crash & Restart - Resume from Checkpoint
    Main-xMain: Crash (SIGKILL)
    Main->>Process: NewFetchAndReplicateStateProcess(cm, stateFilePath, log)
    Process->>Disk: LoadState(stateFilePath)
    Disk-->>Process: PersistedState (from last sync)
    Process->>Process: Resume with checkpoint state
    Note over Process: No full re-replication needed
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Feat/hotreload #193 — Modifies Satellite construction and main execution flow; closely related to NewSatellite change.
feat: Add E2E Test for Harbor Satellite #27 — Touches internal/satellite/satellite.go and E2E flows; related to satellite initialization and tests.
Add Taskfile-based Build System with E2E Tests #303 — Adds Taskfile/e2e task framework that these crash-recovery tasks extend.

Suggested labels

enhancement

Suggested reviewers

Vad1mo

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding state persistence functionality for crash recovery, which is the primary objective of the PR.
Linked Issues check	✅ Passed	The PR fully addresses issue `#228` requirements: persists state to disk with atomic writes [state_persistence.go], loads persisted state on startup [state_process.go], handles corrupted/missing files gracefully [LoadState], adds JSON tags to Entity [replicator.go], includes unit and E2E crash-recovery tests [state_persistence_test.go, e2e.yml], and improves logging.
Out of Scope Changes check	✅ Passed	All changes align with issue `#228` objectives: state persistence (state_persistence.go), state loading/resume logic (state_process.go), Entity serialization (replicator.go), path configuration (paths.go), and E2E crash-recovery tests. No unrelated changes detected.
Description check	✅ Passed	The pull request description follows the required template with all essential sections completed: fixed issue reference, clear summary of changes, and auto-generated cubic summary with detailed feature/fix breakdown.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-02-10T10:31:28Z

Note

Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Generating unit tests... This may take up to 20 minutes.

codacy-production · 2026-02-10T10:32:36Z

Codacy's Analysis Summary

1 new issue (≤ 0 issue)
0 new security issue
83 complexity
11 duplications

Review Pull Request in Codacy →

✨ AI Reviewer available: add the codacy-review label to get contextual insights without leaving GitHub.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@internal/state/state_persistence.go`:
- Around line 67-79: The unmarshalling error from LoadState is being discarded
by the caller so corrupted state files aren’t logged; update behavior so the
operator sees a warning: either (A) in LoadState (function LoadState) detect
json.Unmarshal errors, call the package logger to emit a warning about a
corrupted state file and then return nil, nil so callers start fresh, or (B)
change the caller NewFetchAndReplicateStateProcess to check for a non-nil error
return from LoadState and log a warning before continuing; pick one approach and
implement it consistently (refer to LoadState and
NewFetchAndReplicateStateProcess in state_persistence.go / state_process.go).
- Around line 48-60: The temp-file write/rename sequence (tmp.Write, tmp.Close,
os.Rename using tmpName → path) lacks an explicit fsync, so data may remain in
page cache and be lost on crash; modify the sequence to call tmp.Sync() after
writing (and before tmp.Close()) and handle any Sync error similar to
Write/Close errors (cleanup tmpName and return a wrapped error) so data is
flushed to stable storage prior to os.Rename.

In `@internal/state/state_process.go`:
- Around line 39-60: The constructor NewFetchAndReplicateStateProcess silently
ignores errors from LoadState; modify it so that when stateFilePath != "" and
LoadState returns a non-nil error you log a warning including the file path and
the error before continuing with an empty state. Implement this by either adding
a logger parameter (e.g., zerolog.Logger) to NewFetchAndReplicateStateProcess
and calling logger.Warn().Err(err).Msgf(...) or (if you cannot change the
signature) using a package-level logger (github.com/rs/zerolog/log) or
fmt.Fprintf(os.Stderr, ...) to emit a clear warning that the persisted state at
stateFilePath was invalid/corrupted and is being ignored; keep the existing
behavior of continuing after logging. Ensure you import the chosen logging
package and include the unique symbols LoadState,
NewFetchAndReplicateStateProcess, stateFilePath and persisted in your change.

In `@taskfiles/e2e.yml`:
- Around line 817-850: Fix the brittle log-string assertions and the typo:
correct "entites" -> "entities" both in the test (verify-crash-recovery) and the
source at state_process.go:167, add a short comment in the test noting it relies
on that source log string, and make the greps less fragile by matching a stable
substring or regex (e.g., match "Old state has zero entit(ies)" and "Total
artifacts to replicate:\\s*0" or otherwise match the prefix "Total artifacts to
replicate" and assert the numeric value) so small rewordings won't break the
check.

🧹 Nitpick comments (5)

internal/state/state_persistence_test.go (1)
66-76: Consider adding a test for corrupted/malformed state files.

The PR objective mentions handling corrupted state files gracefully. A test that writes invalid JSON to the state file path and verifies LoadState returns an appropriate error would strengthen coverage of that requirement.
💡 Suggested test
func TestLoadCorruptedFile(t *testing.T) {
	dir := t.TempDir()
	path := filepath.Join(dir, "state.json")

	if err := os.WriteFile(path, []byte("{invalid json"), 0600); err != nil {
		t.Fatalf("failed to write corrupted file: %v", err)
	}

	loaded, err := LoadState(path)
	if err == nil {
		t.Fatal("LoadState should return error for corrupted file")
	}
	if loaded != nil {
		t.Fatalf("LoadState should return nil for corrupted file, got: %+v", loaded)
	}
}
internal/state/state_process.go (3)

372-379: Mutex held during disk I/O in SaveState.

The mutex is held while SaveState performs file creation, write, and rename (lines 372–379 here and 438–446 in processGroupState). Every concurrent processGroupState goroutine will block on this mutex while disk I/O completes, serializing what was meant to be parallel group processing.

This is correct for safety and acceptable for the current scale, but worth noting: if the number of groups grows or the filesystem is slow (e.g., network-attached storage), this will become a bottleneck. A possible future optimization is to snapshot the data under the lock and perform the write outside it.

438-446: State is saved on every individual group completion — consider batching.

Each processGroupState goroutine calls SaveState independently after its own group finishes. With N groups, this means N full-state serializations and atomic file writes in a single sync cycle (plus potentially one more from reconcileRemoteConfig). Since collectResults already waits for all goroutines to finish, a single SaveState call after collectResults would be both simpler and more efficient.

That said, the current approach gives stronger durability (each completed group is immediately persisted), so this is a trade-off. If the stronger guarantee is intentional, a brief comment documenting the choice would help future readers.

15-23: Two distinct mutexes protect different things — add clarifying comments.

f.mu (line 21) guards isRunning/name, while the local mutex created in Execute (line 111) guards stateMap/currentConfigDigest writes and persistence. The separation is fine, but without any documentation it's easy for a future contributor to conflate them or accidentally use the wrong one. A brief comment on each noting what it protects would improve maintainability.

taskfiles/e2e.yml (1)

814-815: Hardcoded sleep 15 for satellite startup is fragile.

If the satellite takes longer than 15 seconds (e.g., slow CI, image pulls) the subsequent verification could see incomplete state. A polling loop (similar to the wait-harbor and wait-gc tasks) that checks the satellite's health endpoint would be more resilient.

internal/state/state_persistence.go

internal/state/state_process.go

taskfiles/e2e.yml

cubic-dev-ai

4 issues found across 9 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/state/state_persistence.go">

<violation number="1" location="internal/state/state_persistence.go:48">
P1: `SaveState` never fsyncs the temp file before renaming it over the state file. If the process or machine crashes after `SaveState` returns, the kernel may drop the unwritten buffers and leave `state.json` empty or corrupt despite the rename. Add a `tmp.Sync()` (and handle its error) after writing the data so the next rename actually durably persists the state.</violation>
</file>

<file name="internal/state/state_process.go">

<violation number="1" location="internal/state/state_process.go:48">
P2: When `LoadState` returns an error (e.g., corrupted JSON), the error is silently discarded and the satellite starts with empty state. For crash recovery, operators should have visibility into state file corruption. Consider logging a warning when the persisted state cannot be loaded.</violation>

<violation number="2" location="internal/state/state_process.go:374">
P2: Persisted state is written only when a group fetch completes, so when `updateStateMap` removes a group there is no subsequent `SaveState` call and the deletion is never flushed to disk. This leaves stale groups/entities in `state.json`, and after a crash/restart the satellite will try to replicate artifacts that were already removed upstream. Please persist immediately after stateMap shrinks (e.g., right after `updateStateMap` prunes entries or whenever the set of groups changes) so removals are reflected on disk.</violation>
</file>

<file name="taskfiles/e2e.yml">

<violation number="1" location="taskfiles/e2e.yml:823">
P3: Typo in log message assertion: "entites" should be "entities". This typo currently matches the source code, but creates a fragile coupling. If the source typo is fixed, this grep will never match and the test assertion will be silently bypassed.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

internal/state/state_persistence.go

internal/state/state_process.go

taskfiles/e2e.yml

coderabbitai · 2026-02-10T10:46:47Z

✅ Created PR with unit tests: #322

Signed-off-by: bupd <bupdprasanth@gmail.com>

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@internal/state/replicator.go`:
- Around line 90-104: The custom TLS transport returned by buildTLSTransport is
only being applied to pullOpts and not to pushOpts, so pushes use the default
transport; modify the block where transport is appended (inside the else of if
r.useUnsecure) to also append remote.WithTransport(transport) to pushOpts
alongside pullOpts, ensuring that when transport != nil both pullOpts and
pushOpts receive the custom transport so pushes and pulls use the same TLS
config (referencing nameOpts, pullOpts, pushOpts, buildTLSTransport,
remote.WithTransport, and r.useUnsecure).
- Around line 135-136: The current conversion only calls mutate.MediaType(img,
types.OCIManifestSchema1) which leaves the config and layer descriptors using
Docker V2S2 media types; update the code after creating ociImage to also call
mutate.ConfigMediaType(ociImage, types.OCIConfigJSON) and ensure every layer
descriptor's MediaType is set to an OCI type (e.g., types.OCILayer) before
finalizing the image—iterate the manifest's layers (from img/ociImage
descriptors), replace their MediaType fields with types.OCILayer (or appropriate
OCI compressed/uncompressed layer types), and return the fully converted image
so the manifest, config, and layers are consistently OCI media types.

In `@internal/state/state_process.go`:
- Around line 106-114: The code only compares oldLen := len(f.stateMap) to
detect changes after calling f.updateStateMap(satelliteState.States), which
misses swaps where the number of groups stays the same but URLs changed; change
the logic to detect real content changes by comparing the old and new URL sets
(or the full map contents) before calling SaveState: capture a copy of
f.stateMap (or compute a set-hash/serialized representation) before calling
f.updateStateMap, then after the update compare that snapshot to the current
f.stateMap and call SaveState(f.stateFilePath, f.stateMap,
f.currentConfigDigest) whenever they differ (still guarding by f.stateFilePath
!= ""), ensuring the new comparison is used instead of the length-only check so
swaps are persisted immediately and not left to processGroupState.

🧹 Nitpick comments (1)

internal/state/replicator.go (1)
122-144: Inconsistent error handling: some errors wrapped, others returned raw.

Reference parsing errors (lines 114, 119) are wrapped with fmt.Errorf for context, but descriptor fetch (line 126), image resolve (line 132), and write (line 142) return bare errors after logging. Wrapping all errors consistently makes debugging from call sites easier.
Proposed fix
 		desc, err := remote.Get(src, pullOpts...)
 		if err != nil {
-			log.Error().Msgf("Failed to fetch image descriptor: %v", err)
-			return err
+			return fmt.Errorf("fetch descriptor for %s: %w", srcRef, err)
 		}
 
 		img, err := desc.Image()
 		if err != nil {
-			log.Error().Msgf("Failed to resolve image: %v", err)
-			return err
+			return fmt.Errorf("resolve image for %s: %w", srcRef, err)
 		}
 
 		// Lazy OCI conversion - no data materialized
 		ociImage := mutate.MediaType(img, types.OCIManifestSchema1)
 
 		if err := remote.Write(dst, ociImage, pushOpts...); err != nil {
-			log.Error().Msgf("Failed to replicate image: %v", err)
-			return err
+			return fmt.Errorf("write image to %s: %w", dstRef, err)
 		}

internal/state/replicator.go

internal/state/state_process.go

Signed-off-by: bupd <bupdprasanth@gmail.com>

bupd added 5 commits February 7, 2026 14:21

feat: add state persistence for crash recovery

665faae

Signed-off-by: bupd <bupdprasanth@gmail.com>

test: add crash recovery e2e tests

6a7f22f

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: protect SaveState with mutex in reconcileRemoteConfig

a12e5ae

Signed-off-by: bupd <bupdprasanth@gmail.com>

refactor: remove zero-value fields and redundant comment

0e054ed

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: protect currentConfigDigest write with mutex

13ee56d

Signed-off-by: bupd <bupdprasanth@gmail.com>

github-actions bot added the golang label Feb 10, 2026

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

internal/state/state_persistence.go Outdated Show resolved Hide resolved

internal/state/state_persistence.go Show resolved Hide resolved

internal/state/state_process.go Outdated Show resolved Hide resolved

taskfiles/e2e.yml Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 10, 2026

View reviewed changes

internal/state/state_persistence.go Outdated Show resolved Hide resolved

internal/state/state_process.go Show resolved Hide resolved

internal/state/state_process.go Outdated Show resolved Hide resolved

taskfiles/e2e.yml Outdated Show resolved Hide resolved

coderabbitai bot mentioned this pull request Feb 10, 2026

CodeRabbit Generated Unit Tests: Add generated unit tests #322

Open

bupd added 8 commits February 10, 2026 16:37

fix: add fsync before rename in SaveState for crash durability

4a6bc91

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: log warning on corrupted state file instead of silent discard

02de814

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: correct 'entites' typo to 'entities' in log and e2e tests

6723ed0

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: persist state immediately after updateStateMap removes groups

2c032ca

Signed-off-by: bupd <bupdprasanth@gmail.com>

refactor: stream layers individually during replication

34d39b5

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: explicitly ignore cleanup errors in error paths

c3108ec

Signed-off-by: bupd <bupdprasanth@gmail.com>

refactor: use defer for cleanup with proper error handling

b750c80

Signed-off-by: bupd <bupdprasanth@gmail.com>

feat: skip already-present blobs before pulling from source

8c4c5bc

Signed-off-by: bupd <bupdprasanth@gmail.com>

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

internal/state/replicator.go Show resolved Hide resolved

internal/state/replicator.go Outdated Show resolved Hide resolved

internal/state/state_process.go Outdated Show resolved Hide resolved

bupd added 4 commits February 10, 2026 17:00

test: add replicator tests with mock OCI registries

3c2efa1

Signed-off-by: bupd <bupdprasanth@gmail.com>

test: verify layer-level resume on replication restart

329ada3

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: apply custom TLS transport to push options

6356cf7

Signed-off-by: bupd <bupdprasanth@gmail.com>

fix: detect group URL swaps in updateStateMap for persistence

07a0d7c

Signed-off-by: bupd <bupdprasanth@gmail.com>

bupd merged commit 17e5ddb into container-registry:main Feb 10, 2026
14 of 15 checks passed

bupd deleted the state-pers-sat branch February 10, 2026 11:53

bupd restored the state-pers-sat branch February 10, 2026 18:10

bupd added a commit to intojhanurag/harbor-satellite that referenced this pull request Feb 10, 2026

Add State Persistence for Crash Recovery (container-registry#321)

cefa3c8

Signed-off-by: bupd <bupdprasanth@gmail.com>

coderabbitai bot mentioned this pull request Apr 2, 2026

feat: add experimental direct image delivery to k3s/RKE2 nodes #356

Merged

Conversation

bupd commented Feb 10, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot commented Feb 10, 2026

Uh oh!

codacy-production bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codacy's Analysis Summary

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Feb 10, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bupd commented Feb 10, 2026 •

edited by cubic-dev-ai bot

Loading

coderabbitai bot commented Feb 10, 2026 •

edited

Loading

codacy-production bot commented Feb 10, 2026 •

edited

Loading