Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files
Version: 0.15.0 (49796ed)
Platform: macOS 26.4 (Apple Silicon), APFS, iCloud Drive Documents enabled
Summary
Two related issues observed during a routine vykar backup on a developer Mac with iCloud-managed ~/Documents:
- Progress display freezes for many seconds at a time with no indication of liveness, even though the process is alive and doing work (or waiting on a single blocking syscall). To the user this is indistinguishable from a deadlock.
- Vykar reads xattrs eagerly on every walked file, including SF_DATALESS placeholders — even ones it is about to discard via
CacheResolution::SkipDataless. On iCloud-managed paths, listxattr / getxattr can round-trip through fileproviderd/bird, which adds latency and locking contention to a phase that is already single-threaded.
Issue #118 fixed content hydration on dataless files. This is the next layer: metadata round-trips on dataless files and observability of walker stalls.
Reproduction
Symptoms (from a real run):
Files: 165177, Original: 25.89 GiB, Compressed: 12.39 GiB, Deduplicated: 393.61 MiB,
Errors: 1, Current: <one cloud-only file path>
Progress line frozen on a compressed,dataless placeholder file in iCloud Drive — one of several hundred small files in the same directory (a medical-imaging scan series). Several minutes of no UI movement, then resumed without intervention.
Diagnosis (live)
Sampled the running process (sample $PID 2 -mayDie) while frozen:
- 12 worker threads parked on
_dispatch_semaphore_wait_slow (idle, channel recv)
- Consumer (main thread) parked on
_dispatch_semaphore_wait_slow
- One walker thread 100% in
stat syscall for the full sample window
- 3 HTTPS sockets to R2 backend, all idle (no upload in flight)
fs_usage after it unstuck showed the walker grinding through Keynote .key packages with lstat64 → listxattr → getxattr → open/getdirentries64/close per file, plus RdMeta[S] (APFS B-tree node reads from cold cache) on /dev/disk3.
System context at the time: load average 29.67 / 32.84 / 38.76, 6,132 / 2,097,152 free pages (~0.3% free) — heavy memory pressure, kernel was thrashing. That explains why a single stat() blocked for many seconds, but it doesn't explain why the user is left staring at a frozen progress line with no diagnostic signal.
Root causes
A. Single-threaded stat phase, no per-syscall watchdog
crates/vykar-core/src/commands/backup/walk/inode_walk.rs walks the tree on a single thread inside Stage A of the parallel pipeline (crates/vykar-core/src/commands/backup/pipeline/mod.rs:248-291). Each file goes:
readdir → 2. inode-sort → 3. symlink_metadata() → 4. materialize_item() (which reads xattrs) → 5. push into crossbeam_channel.
Any single blocking call in steps 3-4 freezes the whole pipeline. Workers and consumer have nothing to do; BackupProgressEvent::StatsUpdated only fires from the consumer when a file finishes. End result: a frozen progress line for the duration of the stuck syscall.
B. Xattrs read eagerly on dataless files, even when about to skip
crates/vykar-core/src/commands/backup/walk/mod.rs:188-190:
if xattrs_enabled {
item.xattrs = read_item_xattrs(&walked.abs_path);
}
read_item_xattrs calls xattr::list(path) then xattr::get(path, &name) for every attr. This happens inside materialize_item, which is called before resolve_cache_hit in walked_entry_to_walk_items (walk/mod.rs:469). For a SF_DATALESS placeholder that will end up routed to CacheResolution::SkipDataless (dropped), vykar still pays the xattr cost.
On macOS FileProvider paths, several xattrs are FileProvider-managed (com.apple.fileprovider.fpfs#P, com.apple.metadata:kMDItemUserTags, etc.). getxattr for those values can call into fileproviderd to fetch cloud-side state. iCloud-heavy directories often hold hundreds of small dataless files (image bursts, scan series, document attachment dumps) at hundreds of KiB each — every one of them triggers a listxattr + per-attr getxattr round-trip, multiplying syscall pressure on a phase that is already single-threaded.
is_dataless is already on MetadataSummary — it's just not consulted before xattr reads.
C. No observability of walker liveness
BackupProgressRenderer (crates/vykar-cli/src/progress.rs) only repaints when it receives a StatsUpdated event from the consumer. The walker emits no progress events. A stuck stat or stuck getxattr means the renderer's last_draw is stale and there's no visible heartbeat. Tracing also stays silent on long-running individual syscalls.
Proposed fixes
1. Skip xattr reads on dataless files in materialize_item (small, low-risk)
if xattrs_enabled && !metadata_summary.is_dataless {
item.xattrs = read_item_xattrs(&walked.abs_path);
}
Justification: dataless files take the SkipDataless or cache-hit path (which already supplies the cached Item's xattrs from the prior snapshot via parent reuse). Reading them again from disk is both expensive (FileProvider round-trip) and pointless (we just discard or replace).
2. Defer xattr reads until after resolve_cache_hit (slightly larger refactor)
For the Miss path, xattr reading still has to happen before we hand the entry to a worker. But we can move it out of materialize_item and into the post-cache-check branch in walked_entry_to_walk_items, so cache hits and dataless skips never pay for xattrs at all. This generalizes (1) to all cache hits, not just dataless ones — speeds up cold-cache incremental backups against an iCloud-quiet but inode-cold tree.
3. Per-syscall watchdog with timeout-and-soft-skip (defense-in-depth)
Wrap symlink_metadata and read_item_xattrs in a timeout (e.g. 30 s default, configurable via limits.metadata_timeout_secs). On timeout: emit tracing::warn! with the path and the syscall name, count the file as a soft error, and continue. Prevents one stuck FileProvider-managed file (or stalled NFS mount) from killing an entire backup.
This needs care on macOS — we cannot interrupt a kernel syscall in the syscall-issuing thread. Easiest implementation: do the metadata/xattr reads on a short-lived helper thread joined with recv_timeout. The helper is leaked on timeout (it'll resolve eventually); the main walker advances. Memory cost per leak is small; events are bounded by the small fraction of paths that hit a stuck daemon.
4. Walker heartbeat events (observability)
Add a BackupProgressEvent::WalkerHeartbeat { current_path: String } emitted every N ms (e.g. 250 ms) from the walker thread, carrying whatever path the walker is currently statting. Consumer-side: render it in the same slot as Current: ... so a stuck walker shows the path that's stuck, not the path the consumer last finished. Also makes it possible for the renderer to draw a spinner / elapsed-on-current-path indicator.
Optional addition: log WARN if a single path's heartbeat exceeds T seconds (e.g. 10 s), so post-hoc users can grep their tracing output for slow paths.
5. (Maybe) parallelize the stat phase on multi-disk APFS
Out of scope for this bug, but worth filing separately. Single-threaded inode-sorted stat is right for HDD; on macOS APFS NVMe SSDs, multiple concurrent stats may be a net win, especially when fileproviderd is the bottleneck rather than disk seek time. Today an iCloud-managed Documents folder with hundreds of thousands of files is bottlenecked at one stat at a time.
Suggested priority
- Fix 1 is one line, no behavior risk — would close the immediate user-visible problem on iCloud-heavy Macs.
- Fix 4 is independent, fixes the "indistinguishable from deadlock" UX failure regardless of root cause.
- Fixes 2, 3, 5 are larger and can be queued.
I can submit a PR for (1) + a regression test that mocks SF_DATALESS via a fake fs layer if there's interest.
Related
Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files
Version: 0.15.0 (
49796ed)Platform: macOS 26.4 (Apple Silicon), APFS, iCloud Drive Documents enabled
Summary
Two related issues observed during a routine
vykar backupon a developer Mac with iCloud-managed~/Documents:CacheResolution::SkipDataless. On iCloud-managed paths,listxattr/getxattrcan round-trip throughfileproviderd/bird, which adds latency and locking contention to a phase that is already single-threaded.Issue #118 fixed content hydration on dataless files. This is the next layer: metadata round-trips on dataless files and observability of walker stalls.
Reproduction
Symptoms (from a real run):
Progress line frozen on a
compressed,datalessplaceholder file in iCloud Drive — one of several hundred small files in the same directory (a medical-imaging scan series). Several minutes of no UI movement, then resumed without intervention.Diagnosis (live)
Sampled the running process (
sample $PID 2 -mayDie) while frozen:_dispatch_semaphore_wait_slow(idle, channelrecv)_dispatch_semaphore_wait_slowstatsyscall for the full sample windowfs_usageafter it unstuck showed the walker grinding through Keynote.keypackages withlstat64→listxattr→getxattr→open/getdirentries64/closeper file, plusRdMeta[S](APFS B-tree node reads from cold cache) on/dev/disk3.System context at the time: load average 29.67 / 32.84 / 38.76, 6,132 / 2,097,152 free pages (~0.3% free) — heavy memory pressure, kernel was thrashing. That explains why a single
stat()blocked for many seconds, but it doesn't explain why the user is left staring at a frozen progress line with no diagnostic signal.Root causes
A. Single-threaded stat phase, no per-syscall watchdog
crates/vykar-core/src/commands/backup/walk/inode_walk.rswalks the tree on a single thread inside Stage A of the parallel pipeline (crates/vykar-core/src/commands/backup/pipeline/mod.rs:248-291). Each file goes:readdir→ 2. inode-sort → 3.symlink_metadata()→ 4.materialize_item()(which reads xattrs) → 5. push intocrossbeam_channel.Any single blocking call in steps 3-4 freezes the whole pipeline. Workers and consumer have nothing to do;
BackupProgressEvent::StatsUpdatedonly fires from the consumer when a file finishes. End result: a frozen progress line for the duration of the stuck syscall.B. Xattrs read eagerly on dataless files, even when about to skip
crates/vykar-core/src/commands/backup/walk/mod.rs:188-190:read_item_xattrscallsxattr::list(path)thenxattr::get(path, &name)for every attr. This happens insidematerialize_item, which is called beforeresolve_cache_hitinwalked_entry_to_walk_items(walk/mod.rs:469). For a SF_DATALESS placeholder that will end up routed toCacheResolution::SkipDataless(dropped), vykar still pays the xattr cost.On macOS FileProvider paths, several xattrs are FileProvider-managed (
com.apple.fileprovider.fpfs#P,com.apple.metadata:kMDItemUserTags, etc.).getxattrfor those values can call intofileproviderdto fetch cloud-side state. iCloud-heavy directories often hold hundreds of small dataless files (image bursts, scan series, document attachment dumps) at hundreds of KiB each — every one of them triggers alistxattr+ per-attrgetxattrround-trip, multiplying syscall pressure on a phase that is already single-threaded.is_datalessis already onMetadataSummary— it's just not consulted before xattr reads.C. No observability of walker liveness
BackupProgressRenderer(crates/vykar-cli/src/progress.rs) only repaints when it receives aStatsUpdatedevent from the consumer. The walker emits no progress events. A stuck stat or stuckgetxattrmeans the renderer's last_draw is stale and there's no visible heartbeat. Tracing also stays silent on long-running individual syscalls.Proposed fixes
1. Skip xattr reads on dataless files in
materialize_item(small, low-risk)Justification: dataless files take the SkipDataless or cache-hit path (which already supplies the cached
Item's xattrs from the prior snapshot via parent reuse). Reading them again from disk is both expensive (FileProvider round-trip) and pointless (we just discard or replace).2. Defer xattr reads until after
resolve_cache_hit(slightly larger refactor)For the Miss path, xattr reading still has to happen before we hand the entry to a worker. But we can move it out of
materialize_itemand into the post-cache-check branch inwalked_entry_to_walk_items, so cache hits and dataless skips never pay for xattrs at all. This generalizes (1) to all cache hits, not just dataless ones — speeds up cold-cache incremental backups against an iCloud-quiet but inode-cold tree.3. Per-syscall watchdog with timeout-and-soft-skip (defense-in-depth)
Wrap
symlink_metadataandread_item_xattrsin a timeout (e.g. 30 s default, configurable vialimits.metadata_timeout_secs). On timeout: emittracing::warn!with the path and the syscall name, count the file as a soft error, and continue. Prevents one stuck FileProvider-managed file (or stalled NFS mount) from killing an entire backup.This needs care on macOS — we cannot interrupt a kernel syscall in the syscall-issuing thread. Easiest implementation: do the metadata/xattr reads on a short-lived helper thread joined with
recv_timeout. The helper is leaked on timeout (it'll resolve eventually); the main walker advances. Memory cost per leak is small; events are bounded by the small fraction of paths that hit a stuck daemon.4. Walker heartbeat events (observability)
Add a
BackupProgressEvent::WalkerHeartbeat { current_path: String }emitted every N ms (e.g. 250 ms) from the walker thread, carrying whatever path the walker is currently statting. Consumer-side: render it in the same slot asCurrent: ...so a stuck walker shows the path that's stuck, not the path the consumer last finished. Also makes it possible for the renderer to draw a spinner / elapsed-on-current-path indicator.Optional addition: log
WARNif a single path's heartbeat exceeds T seconds (e.g. 10 s), so post-hoc users can grep their tracing output for slow paths.5. (Maybe) parallelize the stat phase on multi-disk APFS
Out of scope for this bug, but worth filing separately. Single-threaded inode-sorted stat is right for HDD; on macOS APFS NVMe SSDs, multiple concurrent stats may be a net win, especially when fileproviderd is the bottleneck rather than disk seek time. Today an iCloud-managed Documents folder with hundreds of thousands of files is bottlenecked at one stat at a time.
Suggested priority
I can submit a PR for (1) + a regression test that mocks SF_DATALESS via a fake fs layer if there's interest.
Related