Skip to content

Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files #133

@hexsprite

Description

@hexsprite

Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files

Version: 0.15.0 (49796ed)
Platform: macOS 26.4 (Apple Silicon), APFS, iCloud Drive Documents enabled

Summary

Two related issues observed during a routine vykar backup on a developer Mac with iCloud-managed ~/Documents:

  1. Progress display freezes for many seconds at a time with no indication of liveness, even though the process is alive and doing work (or waiting on a single blocking syscall). To the user this is indistinguishable from a deadlock.
  2. Vykar reads xattrs eagerly on every walked file, including SF_DATALESS placeholders — even ones it is about to discard via CacheResolution::SkipDataless. On iCloud-managed paths, listxattr / getxattr can round-trip through fileproviderd/bird, which adds latency and locking contention to a phase that is already single-threaded.

Issue #118 fixed content hydration on dataless files. This is the next layer: metadata round-trips on dataless files and observability of walker stalls.

Reproduction

Symptoms (from a real run):

Files: 165177, Original: 25.89 GiB, Compressed: 12.39 GiB, Deduplicated: 393.61 MiB,
  Errors: 1, Current: <one cloud-only file path>

Progress line frozen on a compressed,dataless placeholder file in iCloud Drive — one of several hundred small files in the same directory (a medical-imaging scan series). Several minutes of no UI movement, then resumed without intervention.

Diagnosis (live)

Sampled the running process (sample $PID 2 -mayDie) while frozen:

  • 12 worker threads parked on _dispatch_semaphore_wait_slow (idle, channel recv)
  • Consumer (main thread) parked on _dispatch_semaphore_wait_slow
  • One walker thread 100% in stat syscall for the full sample window
  • 3 HTTPS sockets to R2 backend, all idle (no upload in flight)

fs_usage after it unstuck showed the walker grinding through Keynote .key packages with lstat64listxattrgetxattropen/getdirentries64/close per file, plus RdMeta[S] (APFS B-tree node reads from cold cache) on /dev/disk3.

System context at the time: load average 29.67 / 32.84 / 38.76, 6,132 / 2,097,152 free pages (~0.3% free) — heavy memory pressure, kernel was thrashing. That explains why a single stat() blocked for many seconds, but it doesn't explain why the user is left staring at a frozen progress line with no diagnostic signal.

Root causes

A. Single-threaded stat phase, no per-syscall watchdog

crates/vykar-core/src/commands/backup/walk/inode_walk.rs walks the tree on a single thread inside Stage A of the parallel pipeline (crates/vykar-core/src/commands/backup/pipeline/mod.rs:248-291). Each file goes:

  1. readdir → 2. inode-sort → 3. symlink_metadata() → 4. materialize_item() (which reads xattrs) → 5. push into crossbeam_channel.

Any single blocking call in steps 3-4 freezes the whole pipeline. Workers and consumer have nothing to do; BackupProgressEvent::StatsUpdated only fires from the consumer when a file finishes. End result: a frozen progress line for the duration of the stuck syscall.

B. Xattrs read eagerly on dataless files, even when about to skip

crates/vykar-core/src/commands/backup/walk/mod.rs:188-190:

if xattrs_enabled {
    item.xattrs = read_item_xattrs(&walked.abs_path);
}

read_item_xattrs calls xattr::list(path) then xattr::get(path, &name) for every attr. This happens inside materialize_item, which is called before resolve_cache_hit in walked_entry_to_walk_items (walk/mod.rs:469). For a SF_DATALESS placeholder that will end up routed to CacheResolution::SkipDataless (dropped), vykar still pays the xattr cost.

On macOS FileProvider paths, several xattrs are FileProvider-managed (com.apple.fileprovider.fpfs#P, com.apple.metadata:kMDItemUserTags, etc.). getxattr for those values can call into fileproviderd to fetch cloud-side state. iCloud-heavy directories often hold hundreds of small dataless files (image bursts, scan series, document attachment dumps) at hundreds of KiB each — every one of them triggers a listxattr + per-attr getxattr round-trip, multiplying syscall pressure on a phase that is already single-threaded.

is_dataless is already on MetadataSummary — it's just not consulted before xattr reads.

C. No observability of walker liveness

BackupProgressRenderer (crates/vykar-cli/src/progress.rs) only repaints when it receives a StatsUpdated event from the consumer. The walker emits no progress events. A stuck stat or stuck getxattr means the renderer's last_draw is stale and there's no visible heartbeat. Tracing also stays silent on long-running individual syscalls.

Proposed fixes

1. Skip xattr reads on dataless files in materialize_item (small, low-risk)

if xattrs_enabled && !metadata_summary.is_dataless {
    item.xattrs = read_item_xattrs(&walked.abs_path);
}

Justification: dataless files take the SkipDataless or cache-hit path (which already supplies the cached Item's xattrs from the prior snapshot via parent reuse). Reading them again from disk is both expensive (FileProvider round-trip) and pointless (we just discard or replace).

2. Defer xattr reads until after resolve_cache_hit (slightly larger refactor)

For the Miss path, xattr reading still has to happen before we hand the entry to a worker. But we can move it out of materialize_item and into the post-cache-check branch in walked_entry_to_walk_items, so cache hits and dataless skips never pay for xattrs at all. This generalizes (1) to all cache hits, not just dataless ones — speeds up cold-cache incremental backups against an iCloud-quiet but inode-cold tree.

3. Per-syscall watchdog with timeout-and-soft-skip (defense-in-depth)

Wrap symlink_metadata and read_item_xattrs in a timeout (e.g. 30 s default, configurable via limits.metadata_timeout_secs). On timeout: emit tracing::warn! with the path and the syscall name, count the file as a soft error, and continue. Prevents one stuck FileProvider-managed file (or stalled NFS mount) from killing an entire backup.

This needs care on macOS — we cannot interrupt a kernel syscall in the syscall-issuing thread. Easiest implementation: do the metadata/xattr reads on a short-lived helper thread joined with recv_timeout. The helper is leaked on timeout (it'll resolve eventually); the main walker advances. Memory cost per leak is small; events are bounded by the small fraction of paths that hit a stuck daemon.

4. Walker heartbeat events (observability)

Add a BackupProgressEvent::WalkerHeartbeat { current_path: String } emitted every N ms (e.g. 250 ms) from the walker thread, carrying whatever path the walker is currently statting. Consumer-side: render it in the same slot as Current: ... so a stuck walker shows the path that's stuck, not the path the consumer last finished. Also makes it possible for the renderer to draw a spinner / elapsed-on-current-path indicator.

Optional addition: log WARN if a single path's heartbeat exceeds T seconds (e.g. 10 s), so post-hoc users can grep their tracing output for slow paths.

5. (Maybe) parallelize the stat phase on multi-disk APFS

Out of scope for this bug, but worth filing separately. Single-threaded inode-sorted stat is right for HDD; on macOS APFS NVMe SSDs, multiple concurrent stats may be a net win, especially when fileproviderd is the bottleneck rather than disk seek time. Today an iCloud-managed Documents folder with hundreds of thousands of files is bottlenecked at one stat at a time.

Suggested priority

  • Fix 1 is one line, no behavior risk — would close the immediate user-visible problem on iCloud-heavy Macs.
  • Fix 4 is independent, fixes the "indistinguishable from deadlock" UX failure regardless of root cause.
  • Fixes 2, 3, 5 are larger and can be queued.

I can submit a PR for (1) + a regression test that mocks SF_DATALESS via a fake fs layer if there's interest.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions