Skip to content

Commit 12e2754

Browse files
teknium1Ryan
authored andcommitted
feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (NousResearch#20709)
Replaces the per-directory shadow-repo design with a single shared shadow git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated across every working directory the agent has ever touched; a dozen worktrees of the same project cost near-zero in additional disk. Why --- Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/ grow to multi-GB on active machines: 1. Each working directory got its own full shadow git repo — no object dedup across projects or across worktrees of the same project. 2. _prune() was a documented no-op: max_snapshots only limited the /rollback listing. Loose objects accumulated forever. 3. Defaults: enabled=True, auto_prune=False — users paid the disk cost without ever asking for /rollback. Field report on a single workstation: 847 MB across 47 shadow repos, mostly redundant clones of the hermes-agent source tree. Changes ------- - tools/checkpoint_manager.py: full rewrite. Single bare store, per-project refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>), per-project metadata (store/projects/<hash>.json with workdir + created_at + last_touch). On first v2 init, any pre-v2 per-directory shadow repos are auto-migrated into legacy-<timestamp>/ so the new store starts clean. _prune() now actually rewrites the per-project ref to the last max_snapshots commits and runs git gc --prune=now. New _enforce_size_cap() drops oldest commits round-robin across projects when the store exceeds max_total_size_mb. _drop_oversize_from_index() filters any single file larger than max_file_size_mb out of the snapshot. - hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI (status / list / prune / clear / clear-legacy) for managing the store outside a session. - hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20, auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10. Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll, *.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.). - run_agent.py / cli.py / gateway/run.py: thread the new kwargs through AIAgent and the startup auto_prune hooks. - Tests rewritten to match v2 storage while keeping backwards-compat coverage for the pre-v2 prune path (per-directory shadow repos under base/ are still swept correctly for anyone mid-migration). - Docs updated: user-guide/checkpoints-and-rollback.md explains the shared store, new defaults, migration, and the new CLI; reference/cli-commands.md documents 'hermes checkpoints'. E2E validated ------------- - Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/. - Object dedup: two projects with an identical shared.py blob resolve to 7 total objects in the store (v1 would have stored the blob twice). - max_snapshots=3 actually enforced: after 6 commits, list shows 3. - Orphan prune: deleting a project's workdir + 'hermes checkpoints prune --retention-days 0' removes its ref, index, and metadata; GC reclaims the objects. - max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the tracked source code files. - hermes checkpoints {status,prune,clear,clear-legacy} all work from the CLI without an agent running. Breaking / migration -------------------- No in-place data migration — legacy per-directory shadow repos are moved into legacy-<timestamp>/ on first run. Old /rollback history is still accessible by inspecting the archive with git; run 'hermes checkpoints clear-legacy' to reclaim the space when ready. Users relying on /rollback must now set checkpoints.enabled=true (or pass --checkpoints) explicitly.
1 parent 5046314 commit 12e2754

10 files changed

Lines changed: 1967 additions & 717 deletions

File tree

cli.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -987,6 +987,7 @@ def _run_checkpoint_auto_maintenance() -> None:
987987
retention_days=int(cfg.get("retention_days", 7)),
988988
min_interval_hours=int(cfg.get("min_interval_hours", 24)),
989989
delete_orphans=bool(cfg.get("delete_orphans", True)),
990+
max_total_size_mb=int(cfg.get("max_total_size_mb", 500)),
990991
)
991992
except Exception as exc:
992993
logger.debug("checkpoint auto-maintenance skipped: %s", exc)
@@ -2273,7 +2274,9 @@ def __init__(
22732274
if isinstance(cp_cfg, bool):
22742275
cp_cfg = {"enabled": cp_cfg}
22752276
self.checkpoints_enabled = checkpoints or cp_cfg.get("enabled", False)
2276-
self.checkpoint_max_snapshots = cp_cfg.get("max_snapshots", 50)
2277+
self.checkpoint_max_snapshots = cp_cfg.get("max_snapshots", 20)
2278+
self.checkpoint_max_total_size_mb = cp_cfg.get("max_total_size_mb", 500)
2279+
self.checkpoint_max_file_size_mb = cp_cfg.get("max_file_size_mb", 10)
22772280
self.pass_session_id = pass_session_id
22782281
# --ignore-rules: honor either the constructor flag or the env var set
22792282
# by `hermes chat --ignore-rules` in hermes_cli/main.py. When true we
@@ -3845,6 +3848,8 @@ def _init_agent(self, *, model_override: str = None, runtime_override: dict = No
38453848
thinking_callback=self._on_thinking,
38463849
checkpoints_enabled=self.checkpoints_enabled,
38473850
checkpoint_max_snapshots=self.checkpoint_max_snapshots,
3851+
checkpoint_max_total_size_mb=self.checkpoint_max_total_size_mb,
3852+
checkpoint_max_file_size_mb=self.checkpoint_max_file_size_mb,
38483853
pass_session_id=self.pass_session_id,
38493854
skip_context_files=self.ignore_rules,
38503855
skip_memory=self.ignore_rules,

gateway/run.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1160,6 +1160,7 @@ def __init__(self, config: Optional[GatewayConfig] = None):
11601160
retention_days=int(_ckpt_cfg.get("retention_days", 7)),
11611161
min_interval_hours=int(_ckpt_cfg.get("min_interval_hours", 24)),
11621162
delete_orphans=bool(_ckpt_cfg.get("delete_orphans", True)),
1163+
max_total_size_mb=int(_ckpt_cfg.get("max_total_size_mb", 500)),
11631164
)
11641165
except Exception as exc:
11651166
logger.debug("checkpoint auto-maintenance skipped: %s", exc)

hermes_cli/checkpoints.py

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
"""`hermes checkpoints` CLI subcommand.
2+
3+
Gives users direct visibility and control over the filesystem checkpoint
4+
store at ``~/.hermes/checkpoints/``. Actions:
5+
6+
hermes checkpoints # same as `status`
7+
hermes checkpoints status # total size, project count, breakdown
8+
hermes checkpoints list # per-project checkpoint counts + workdir
9+
hermes checkpoints prune [opts] # force a sweep (ignores the 24h marker)
10+
hermes checkpoints clear [-f] # nuke the entire base (asks first)
11+
hermes checkpoints clear-legacy # delete just the legacy-* archives
12+
13+
Examples::
14+
15+
hermes checkpoints
16+
hermes checkpoints prune --retention-days 3 --max-size-mb 200
17+
hermes checkpoints clear -f
18+
19+
None of these require the agent to be running. Safe to call any time.
20+
"""
21+
22+
from __future__ import annotations
23+
24+
import argparse
25+
import time
26+
from datetime import datetime
27+
from pathlib import Path
28+
from typing import Any, Dict
29+
30+
31+
def _fmt_bytes(n: int) -> str:
32+
units = ("B", "KB", "MB", "GB", "TB")
33+
size = float(n or 0)
34+
for unit in units:
35+
if size < 1024 or unit == units[-1]:
36+
if unit == "B":
37+
return f"{int(size)} {unit}"
38+
return f"{size:.1f} {unit}"
39+
size /= 1024
40+
return f"{size:.1f} TB"
41+
42+
43+
def _fmt_ts(ts: Any) -> str:
44+
try:
45+
return datetime.fromtimestamp(float(ts)).strftime("%Y-%m-%d %H:%M")
46+
except (TypeError, ValueError):
47+
return "—"
48+
49+
50+
def _fmt_age(ts: Any) -> str:
51+
try:
52+
age = time.time() - float(ts)
53+
except (TypeError, ValueError):
54+
return "—"
55+
if age < 0:
56+
return "now"
57+
if age < 60:
58+
return f"{int(age)}s ago"
59+
if age < 3600:
60+
return f"{int(age / 60)}m ago"
61+
if age < 86400:
62+
return f"{int(age / 3600)}h ago"
63+
return f"{int(age / 86400)}d ago"
64+
65+
66+
def cmd_status(args: argparse.Namespace) -> int:
67+
from tools.checkpoint_manager import store_status
68+
69+
info = store_status()
70+
base = info["base"]
71+
print(f"Checkpoint base: {base}")
72+
print(f"Total size: {_fmt_bytes(info['total_size_bytes'])}")
73+
print(f" store/ {_fmt_bytes(info['store_size_bytes'])}")
74+
print(f" legacy-* {_fmt_bytes(info['legacy_size_bytes'])}")
75+
print(f"Projects: {info['project_count']}")
76+
77+
projects = sorted(
78+
info["projects"],
79+
key=lambda p: (p.get("last_touch") or 0),
80+
reverse=True,
81+
)
82+
if projects:
83+
print()
84+
print(f" {'WORKDIR':<60} {'COMMITS':>7} {'LAST TOUCH':>12} STATE")
85+
for p in projects[: args.limit if hasattr(args, "limit") and args.limit else 20]:
86+
wd = p.get("workdir") or "(unknown)"
87+
if len(wd) > 60:
88+
wd = "…" + wd[-59:]
89+
exists = p.get("exists")
90+
state = "live" if exists else "orphan"
91+
commits = p.get("commits", 0)
92+
last = _fmt_age(p.get("last_touch"))
93+
print(f" {wd:<60} {commits:>7} {last:>12} {state}")
94+
95+
legacy = info.get("legacy_archives", [])
96+
if legacy:
97+
print()
98+
print(f"Legacy archives ({len(legacy)}):")
99+
for arch in sorted(legacy, key=lambda a: a.get("mtime", 0), reverse=True):
100+
print(f" {arch['name']:<40} {_fmt_bytes(arch['size_bytes']):>10}")
101+
print()
102+
print("Clear with: hermes checkpoints clear-legacy")
103+
return 0
104+
105+
106+
def cmd_list(args: argparse.Namespace) -> int:
107+
# `list` is just a terser status — already covered.
108+
return cmd_status(args)
109+
110+
111+
def cmd_prune(args: argparse.Namespace) -> int:
112+
from tools.checkpoint_manager import prune_checkpoints
113+
114+
retention_days = args.retention_days
115+
max_size_mb = args.max_size_mb
116+
117+
print("Pruning checkpoint store…")
118+
print(f" retention_days: {retention_days}")
119+
print(f" delete_orphans: {not args.keep_orphans}")
120+
print(f" max_total_size_mb: {max_size_mb}")
121+
print()
122+
123+
result = prune_checkpoints(
124+
retention_days=retention_days,
125+
delete_orphans=not args.keep_orphans,
126+
max_total_size_mb=max_size_mb,
127+
)
128+
print(f"Scanned: {result['scanned']}")
129+
print(f"Deleted orphan: {result['deleted_orphan']}")
130+
print(f"Deleted stale: {result['deleted_stale']}")
131+
print(f"Errors: {result['errors']}")
132+
print(f"Bytes reclaimed: {_fmt_bytes(result['bytes_freed'])}")
133+
return 0
134+
135+
136+
def _confirm(prompt: str) -> bool:
137+
try:
138+
resp = input(f"{prompt} [y/N]: ").strip().lower()
139+
except (EOFError, KeyboardInterrupt):
140+
print()
141+
return False
142+
return resp in ("y", "yes")
143+
144+
145+
def cmd_clear(args: argparse.Namespace) -> int:
146+
from tools.checkpoint_manager import CHECKPOINT_BASE, clear_all, store_status
147+
148+
info = store_status()
149+
if info["total_size_bytes"] == 0 and not Path(CHECKPOINT_BASE).exists():
150+
print("Nothing to clear — checkpoint base does not exist.")
151+
return 0
152+
153+
print(f"This will delete the ENTIRE checkpoint base at {info['base']}")
154+
print(f" size: {_fmt_bytes(info['total_size_bytes'])}")
155+
print(f" projects: {info['project_count']}")
156+
print(f" legacy dirs: {len(info.get('legacy_archives', []))}")
157+
print()
158+
print("All /rollback history for every working directory will be lost.")
159+
if not args.force and not _confirm("Proceed?"):
160+
print("Aborted.")
161+
return 1
162+
163+
result = clear_all()
164+
if result["deleted"]:
165+
print(f"Cleared. Reclaimed {_fmt_bytes(result['bytes_freed'])}.")
166+
return 0
167+
print("Could not clear checkpoint base (see logs).")
168+
return 2
169+
170+
171+
def cmd_clear_legacy(args: argparse.Namespace) -> int:
172+
from tools.checkpoint_manager import clear_legacy, store_status
173+
174+
info = store_status()
175+
legacy = info.get("legacy_archives", [])
176+
if not legacy:
177+
print("No legacy archives to clear.")
178+
return 0
179+
180+
total = sum(a.get("size_bytes", 0) for a in legacy)
181+
print(f"Found {len(legacy)} legacy archive(s), total {_fmt_bytes(total)}:")
182+
for arch in legacy:
183+
print(f" {arch['name']:<40} {_fmt_bytes(arch['size_bytes']):>10}")
184+
print()
185+
print("Legacy archives hold pre-v2 per-project shadow repos, moved aside")
186+
print("during the single-store migration. Delete when you're confident")
187+
print("you don't need the old /rollback history.")
188+
if not args.force and not _confirm("Delete all legacy archives?"):
189+
print("Aborted.")
190+
return 1
191+
192+
result = clear_legacy()
193+
print(f"Deleted {result['deleted']} archive(s), reclaimed {_fmt_bytes(result['bytes_freed'])}.")
194+
return 0
195+
196+
197+
def register_cli(parser: argparse.ArgumentParser) -> None:
198+
"""Wire subcommands onto the ``hermes checkpoints`` parser."""
199+
parser.set_defaults(func=cmd_status) # bare `hermes checkpoints` → status
200+
subs = parser.add_subparsers(dest="checkpoints_command", metavar="COMMAND")
201+
202+
p_status = subs.add_parser(
203+
"status",
204+
help="Show total size, project count, and per-project breakdown",
205+
)
206+
p_status.add_argument("--limit", type=int, default=20,
207+
help="Max projects to list (default 20)")
208+
p_status.set_defaults(func=cmd_status)
209+
210+
p_list = subs.add_parser(
211+
"list",
212+
help="Alias for 'status'",
213+
)
214+
p_list.add_argument("--limit", type=int, default=20)
215+
p_list.set_defaults(func=cmd_list)
216+
217+
p_prune = subs.add_parser(
218+
"prune",
219+
help="Delete orphan/stale checkpoints and GC the store",
220+
)
221+
p_prune.add_argument("--retention-days", type=int, default=7,
222+
help="Drop projects whose last_touch is older than N days (default 7)")
223+
p_prune.add_argument("--max-size-mb", type=int, default=500,
224+
help="After orphan/stale prune, drop oldest commits "
225+
"per project until total size <= this (default 500)")
226+
p_prune.add_argument("--keep-orphans", action="store_true",
227+
help="Skip deleting projects whose workdir no longer exists")
228+
p_prune.set_defaults(func=cmd_prune)
229+
230+
p_clear = subs.add_parser(
231+
"clear",
232+
help="Delete the entire checkpoint base (all /rollback history)",
233+
)
234+
p_clear.add_argument("-f", "--force", action="store_true",
235+
help="Skip confirmation prompt")
236+
p_clear.set_defaults(func=cmd_clear)
237+
238+
p_legacy = subs.add_parser(
239+
"clear-legacy",
240+
help="Delete only the legacy-<ts>/ archives from v1 migration",
241+
)
242+
p_legacy.add_argument("-f", "--force", action="store_true",
243+
help="Skip confirmation prompt")
244+
p_legacy.set_defaults(func=cmd_clear_legacy)

hermes_cli/config.py

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -574,21 +574,39 @@ def _ensure_hermes_home_managed(home: Path):
574574
},
575575

576576
# Filesystem checkpoints — automatic snapshots before destructive file ops.
577-
# When enabled, the agent takes a snapshot of the working directory once per
578-
# conversation turn (on first write_file/patch call). Use /rollback to restore.
577+
# When enabled, the agent takes a snapshot of the working directory once
578+
# per conversation turn (on first write_file/patch call). Use /rollback
579+
# to restore.
580+
#
581+
# Defaults changed in v2 (single shared shadow store, real pruning):
582+
# - enabled: True -> False (opt-in; most users never use /rollback)
583+
# - max_snapshots: 50 -> 20 (now actually enforced via ref rewrite)
584+
# - auto_prune: False -> True (orphans/stale pruned automatically)
585+
# Opt in via ``hermes chat --checkpoints`` or set enabled=True here.
579586
"checkpoints": {
580-
"enabled": True,
581-
"max_snapshots": 50, # Max checkpoints to keep per directory
582-
# Auto-maintenance: shadow repos accumulate forever under
583-
# ~/.hermes/checkpoints/ (one per cd'd working directory). Field
584-
# reports put the typical offender at 1000+ repos / ~12 GB. When
585-
# auto_prune is on, hermes sweeps at startup (at most once per
586-
# min_interval_hours) and deletes:
587-
# * orphan repos: HERMES_WORKDIR no longer exists on disk
588-
# * stale repos: newest mtime older than retention_days
589-
# Opt-in so users who rely on /rollback against long-ago sessions
590-
# never lose data silently.
591-
"auto_prune": False,
587+
"enabled": False,
588+
# Max checkpoints to keep per working directory. Pre-v2 this only
589+
# limited the `/rollback` listing; v2 actually rewrites the ref and
590+
# garbage-collects older commits.
591+
"max_snapshots": 20,
592+
# Hard ceiling on total ``~/.hermes/checkpoints/`` size (MB). When
593+
# exceeded, the oldest checkpoint per project is dropped in a
594+
# round-robin pass until total size falls under the cap.
595+
# 0 disables the size cap.
596+
"max_total_size_mb": 500,
597+
# Skip any single file larger than this when staging a checkpoint.
598+
# Prevents accidental snapshotting of datasets, model weights, and
599+
# other large generated assets. 0 disables the filter.
600+
"max_file_size_mb": 10,
601+
# Auto-maintenance: hermes sweeps the checkpoint base at startup
602+
# (at most once per ``min_interval_hours``) and:
603+
# * deletes project entries whose workdir no longer exists (orphan)
604+
# * deletes project entries whose last_touch is older than
605+
# ``retention_days``
606+
# * GCs the single shared store to reclaim unreachable objects
607+
# * enforces ``max_total_size_mb`` across remaining projects
608+
# * deletes ``legacy-*`` archives older than ``retention_days``
609+
"auto_prune": True,
592610
"retention_days": 7,
593611
"delete_orphans": True,
594612
"min_interval_hours": 24,

hermes_cli/main.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9379,6 +9379,20 @@ def main():
93799379
)
93809380
backup_parser.set_defaults(func=cmd_backup)
93819381

9382+
# =========================================================================
9383+
# checkpoints command
9384+
# =========================================================================
9385+
checkpoints_parser = subparsers.add_parser(
9386+
"checkpoints",
9387+
help="Inspect / prune / clear ~/.hermes/checkpoints/",
9388+
description="Manage the filesystem checkpoint store — the shadow git "
9389+
"repo hermes uses to snapshot working directories before "
9390+
"write_file/patch/terminal calls. Lets you see how much "
9391+
"space checkpoints occupy, force a prune, or wipe the base.",
9392+
)
9393+
from hermes_cli.checkpoints import register_cli as _register_checkpoints_cli
9394+
_register_checkpoints_cli(checkpoints_parser)
9395+
93829396
# =========================================================================
93839397
# import command
93849398
# =========================================================================

run_agent.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -966,7 +966,9 @@ def __init__(
966966
fallback_model: Dict[str, Any] = None,
967967
credential_pool=None,
968968
checkpoints_enabled: bool = False,
969-
checkpoint_max_snapshots: int = 50,
969+
checkpoint_max_snapshots: int = 20,
970+
checkpoint_max_total_size_mb: int = 500,
971+
checkpoint_max_file_size_mb: int = 10,
970972
pass_session_id: bool = False,
971973
):
972974
"""
@@ -1689,6 +1691,8 @@ def __init__(
16891691
self._checkpoint_mgr = CheckpointManager(
16901692
enabled=checkpoints_enabled,
16911693
max_snapshots=checkpoint_max_snapshots,
1694+
max_total_size_mb=checkpoint_max_total_size_mb,
1695+
max_file_size_mb=checkpoint_max_file_size_mb,
16921696
)
16931697

16941698
# SQLite session store (optional -- provided by CLI or gateway)

0 commit comments

Comments
 (0)