Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files

# Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files

**Version:** 0.15.0 (`49796ed`)
**Platform:** macOS 26.4 (Apple Silicon), APFS, iCloud Drive Documents enabled

## Summary

Two related issues observed during a routine `vykar backup` on a developer Mac with iCloud-managed `~/Documents`:

1. **Progress display freezes for many seconds at a time with no indication of liveness**, even though the process is alive and doing work (or waiting on a single blocking syscall). To the user this is indistinguishable from a deadlock.
2. **Vykar reads xattrs eagerly on every walked file, including SF_DATALESS placeholders** — even ones it is about to discard via `CacheResolution::SkipDataless`. On iCloud-managed paths, `listxattr` / `getxattr` can round-trip through `fileproviderd`/`bird`, which adds latency and locking contention to a phase that is already single-threaded.

Issue #118 fixed *content* hydration on dataless files. This is the next layer: *metadata* round-trips on dataless files and observability of walker stalls.

## Reproduction

Symptoms (from a real run):

```
Files: 165177, Original: 25.89 GiB, Compressed: 12.39 GiB, Deduplicated: 393.61 MiB,
  Errors: 1, Current: <one cloud-only file path>
```

Progress line frozen on a `compressed,dataless` placeholder file in iCloud Drive — one of several hundred small files in the same directory (a medical-imaging scan series). Several minutes of no UI movement, then resumed without intervention.

## Diagnosis (live)

Sampled the running process (`sample $PID 2 -mayDie`) while frozen:

- 12 worker threads parked on `_dispatch_semaphore_wait_slow` (idle, channel `recv`)
- Consumer (main thread) parked on `_dispatch_semaphore_wait_slow`
- One walker thread 100% in `stat` syscall for the full sample window
- 3 HTTPS sockets to R2 backend, all idle (no upload in flight)

`fs_usage` after it unstuck showed the walker grinding through Keynote `.key` packages with `lstat64` → `listxattr` → `getxattr` → `open`/`getdirentries64`/`close` per file, plus `RdMeta[S]` (APFS B-tree node reads from cold cache) on `/dev/disk3`.

System context at the time: load average **29.67 / 32.84 / 38.76**, **6,132 / 2,097,152 free pages (~0.3% free)** — heavy memory pressure, kernel was thrashing. That explains *why* a single `stat()` blocked for many seconds, but it doesn't explain why the user is left staring at a frozen progress line with no diagnostic signal.

## Root causes

### A. Single-threaded stat phase, no per-syscall watchdog

`crates/vykar-core/src/commands/backup/walk/inode_walk.rs` walks the tree on a single thread inside Stage A of the parallel pipeline (`crates/vykar-core/src/commands/backup/pipeline/mod.rs:248-291`). Each file goes:

1. `readdir` → 2. inode-sort → 3. `symlink_metadata()` → 4. `materialize_item()` (which reads xattrs) → 5. push into `crossbeam_channel`.

Any single blocking call in steps 3-4 freezes the whole pipeline. Workers and consumer have nothing to do; `BackupProgressEvent::StatsUpdated` only fires from the consumer when a file finishes. End result: a frozen progress line for the duration of the stuck syscall.

### B. Xattrs read eagerly on dataless files, even when about to skip

`crates/vykar-core/src/commands/backup/walk/mod.rs:188-190`:

```rust
if xattrs_enabled {
    item.xattrs = read_item_xattrs(&walked.abs_path);
}
```

`read_item_xattrs` calls `xattr::list(path)` then `xattr::get(path, &name)` for every attr. This happens *inside* `materialize_item`, which is called *before* `resolve_cache_hit` in `walked_entry_to_walk_items` (`walk/mod.rs:469`). For a SF_DATALESS placeholder that will end up routed to `CacheResolution::SkipDataless` (dropped), vykar still pays the xattr cost.

On macOS FileProvider paths, several xattrs are FileProvider-managed (`com.apple.fileprovider.fpfs#P`, `com.apple.metadata:kMDItemUserTags`, etc.). `getxattr` for those values can call into `fileproviderd` to fetch cloud-side state. iCloud-heavy directories often hold hundreds of small dataless files (image bursts, scan series, document attachment dumps) at hundreds of KiB each — every one of them triggers a `listxattr` + per-attr `getxattr` round-trip, multiplying syscall pressure on a phase that is already single-threaded.

`is_dataless` is already on `MetadataSummary` — it's just not consulted before xattr reads.

### C. No observability of walker liveness

`BackupProgressRenderer` (`crates/vykar-cli/src/progress.rs`) only repaints when it receives a `StatsUpdated` event from the consumer. The walker emits no progress events. A stuck stat or stuck `getxattr` means the renderer's last_draw is stale and there's no visible heartbeat. Tracing also stays silent on long-running individual syscalls.

## Proposed fixes

### 1. Skip xattr reads on dataless files in `materialize_item` (small, low-risk)

```rust
if xattrs_enabled && !metadata_summary.is_dataless {
    item.xattrs = read_item_xattrs(&walked.abs_path);
}
```

Justification: dataless files take the SkipDataless or cache-hit path (which already supplies the cached `Item`'s xattrs from the prior snapshot via parent reuse). Reading them again from disk is both expensive (FileProvider round-trip) and pointless (we just discard or replace).

### 2. Defer xattr reads until after `resolve_cache_hit` (slightly larger refactor)

For the Miss path, xattr reading still has to happen before we hand the entry to a worker. But we can move it out of `materialize_item` and into the post-cache-check branch in `walked_entry_to_walk_items`, so cache hits and dataless skips never pay for xattrs at all. This generalizes (1) to all cache hits, not just dataless ones — speeds up cold-cache incremental backups against an iCloud-quiet but inode-cold tree.

### 3. Per-syscall watchdog with timeout-and-soft-skip (defense-in-depth)

Wrap `symlink_metadata` and `read_item_xattrs` in a timeout (e.g. 30 s default, configurable via `limits.metadata_timeout_secs`). On timeout: emit `tracing::warn!` with the path and the syscall name, count the file as a soft error, and continue. Prevents one stuck FileProvider-managed file (or stalled NFS mount) from killing an entire backup.

This needs care on macOS — we cannot interrupt a kernel syscall in the syscall-issuing thread. Easiest implementation: do the metadata/xattr reads on a short-lived helper thread joined with `recv_timeout`. The helper is leaked on timeout (it'll resolve eventually); the main walker advances. Memory cost per leak is small; events are bounded by the small fraction of paths that hit a stuck daemon.

### 4. Walker heartbeat events (observability)

Add a `BackupProgressEvent::WalkerHeartbeat { current_path: String }` emitted every N ms (e.g. 250 ms) from the walker thread, carrying whatever path the walker is currently statting. Consumer-side: render it in the same slot as `Current: ...` so a stuck walker shows the path that's stuck, not the path the consumer last finished. Also makes it possible for the renderer to draw a spinner / elapsed-on-current-path indicator.

Optional addition: log `WARN` if a single path's heartbeat exceeds T seconds (e.g. 10 s), so post-hoc users can grep their tracing output for slow paths.

### 5. (Maybe) parallelize the stat phase on multi-disk APFS

Out of scope for this bug, but worth filing separately. Single-threaded inode-sorted stat is right for HDD; on macOS APFS NVMe SSDs, multiple concurrent stats may be a net win, especially when fileproviderd is the bottleneck rather than disk seek time. Today an iCloud-managed Documents folder with hundreds of thousands of files is bottlenecked at one stat at a time.

## Suggested priority

- **Fix 1** is one line, no behavior risk — would close the immediate user-visible problem on iCloud-heavy Macs.
- **Fix 4** is independent, fixes the "indistinguishable from deadlock" UX failure regardless of root cause.
- **Fixes 2, 3, 5** are larger and can be queued.

I can submit a PR for (1) + a regression test that mocks SF_DATALESS via a fake fs layer if there's interest.

## Related

- #118 (closed) — content hydration on dataless files. This issue is the metadata analogue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files #133

Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files

Summary

Reproduction

Diagnosis (live)

Root causes

A. Single-threaded stat phase, no per-syscall watchdog

B. Xattrs read eagerly on dataless files, even when about to skip

C. No observability of walker liveness

Proposed fixes

1. Skip xattr reads on dataless files in `materialize_item` (small, low-risk)

2. Defer xattr reads until after `resolve_cache_hit` (slightly larger refactor)

3. Per-syscall watchdog with timeout-and-soft-skip (defense-in-depth)

4. Walker heartbeat events (observability)

5. (Maybe) parallelize the stat phase on multi-disk APFS

Suggested priority

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files #133

Description

Walker stalls silently on macOS under metadata pressure; xattrs read eagerly on dataless cloud files

Summary

Reproduction

Diagnosis (live)

Root causes

A. Single-threaded stat phase, no per-syscall watchdog

B. Xattrs read eagerly on dataless files, even when about to skip

C. No observability of walker liveness

Proposed fixes

1. Skip xattr reads on dataless files in materialize_item (small, low-risk)

2. Defer xattr reads until after resolve_cache_hit (slightly larger refactor)

3. Per-syscall watchdog with timeout-and-soft-skip (defense-in-depth)

4. Walker heartbeat events (observability)

5. (Maybe) parallelize the stat phase on multi-disk APFS

Suggested priority

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Skip xattr reads on dataless files in `materialize_item` (small, low-risk)

2. Defer xattr reads until after `resolve_cache_hit` (slightly larger refactor)