Skip to content

macOS: handle dataless (cloud-only) files instead of failing with 'file changed during read' #118

@hexsprite

Description

@hexsprite

macOS: handle dataless (cloud-only) files instead of failing with "file changed during read"

Summary

On macOS, files backed by Apple's FileProvider framework (iCloud Drive, Dropbox, OneDrive, Box, new Google Drive, etc.) can exist on disk as dataless placeholdersstat() returns correct metadata but blocks aren't materialized until something reads the file. macOS sets st_flags & SF_DATALESS (0x40000000) on such inodes.

Vykar 0.14.1 doesn't recognize this flag. When the walker encounters a dataless file it open()s and read()s normally, which:

  1. Triggers fileproviderd to start asynchronous hydration
  2. The file's size/blocks/ctime change mid-read as bytes stream in
  3. Vykar's atomicity guard fires and emits warning: skipping file '...': file changed during read

Net effect: dataless files are silently never backed up, while every backup attempt unnecessarily downloads them from the cloud provider, then aborts the chunking. Bandwidth is spent and nothing is captured.

This affects every macOS user with any FileProvider-backed sync enabled — iCloud "Desktop & Documents Folders" alone is one click in System Settings and is on by default for many users.

Reproduction

# Confirm dataless files exist in a typical Mac home
find ~/Documents -flags +dataless -type f | head

# Run backup
vykar backup

Activity log shows hundreds-to-thousands of file changed during read warnings, all on iCloud-managed paths. Verified on a 36,624-file / ~96 GB iCloud-managed ~/Documents and ~/Desktop.

Sample log lines (paths sanitized):

[r2] warning: skipping file '/Users/jbb/Documents/.../Awake Loop.mp3': file changed during read: ...
[r2] warning: skipping file '/Users/jbb/Documents/scryent_logo.pdf': file changed during read: ...
[r2] warning: skipping file '/Users/jbb/Documents/drive.dd': file changed during read: ...

Every one of those files has SF_DATALESS set at walk time.

Proposed fix

1. Detect dataless at walk time

In crates/vykar-core/src/commands/backup/walk/inode_walk.rs, branch on metadata.st_flags() & 0x40000000. The flag is documented in <sys/stat.h> as SF_DATALESS and stable since macOS 10.15.

2. Propagate from parent snapshot when dataless

This is the most important behavior for skip mode and arguably the whole feature.

When the walker encounters a dataless file, check the parent snapshot for an entry at the same path. If (mtime, size, inode, xattr_hash) match, propagate the ChunkRefs forward into the new snapshot without reading the file. No hydration, no warning, no missing data.

Vykar already has parent-reuse infrastructure (the binary contains built parent reuse index for cold-start fallback). This extends it with one rule: dataless + identity-match ⇒ reuse.

Net behavior:

File state Cache/parent match Result
warm (currently materialized) matched normal cache hit, no read
warm no match read + chunk normally
dataless matched in parent propagate ChunkRefs, no hydration
dataless no parent match mode-dependent: skip warns, hydrate* reads
dataless parent has different mtime/size identity changed remotely — must hydrate to capture

Effect: any file warm during any backup window stays in every subsequent snapshot until its identity actually changes, even if the cloud provider evicts it the next morning. This is the right default — snapshot membership shouldn't oscillate based on local cache state.

3. Configurable handling

Add a per-source (and global default) setting:

sources:
  - path: /Users/jbb
    one_file_system: true
    dataless: skip            # default — recommended

Modes:

Mode Behavior
skip (default) lstat only, log dataless: cloud-only, skipping <path> once with a tally, do not open. Snapshot omits these files.
hydrate Read normally, accept the wait, leave file warm on disk.
hydrate-evict Read, back up, then call NSFileProviderManager.evictItem(identifier:) to return the file to dataless state once the snapshot commits.
stub (future) Record metadata only (path, size, mtime, xattrs). No contents. Useful for inventory snapshots.

skip should be the default — it preserves user bandwidth, mirrors the de-facto current behavior, and replaces noisy false-positive warnings with a single clear message.

4. File cache must tolerate SF_DATALESS ctime churn

Vykar's file cache key currently appears to include ctime (the binary contains parent snapshot lacks ctime on filesystem files, skipping parent fallback). Eviction toggles SF_DATALESS, which bumps ctime even though logical content is unchanged.

Without a fix, hydrate-evict would re-hydrate every file on every run — defeating the whole point.

Proposal: when SF_DATALESS is currently set, or the previous snapshot recorded the file as dataless, match cache entries on (mtime, size, inode, xattr_hash) only. ctime is unreliable across hydration transitions.

5. Improve the "file changed during read" message

Even outside this feature, the error should hint at SF_DATALESS when present at retry time, e.g.:

file changed during read: /Users/jbb/Documents/foo.pdf (cloud-only file, hydration in progress)

Implementation notes / edge cases

Detection scope

  • SF_DATALESS is FileProvider-generic. Covers iCloud Drive, Dropbox, OneDrive, Box, Google Drive (current .gfile model). Not iCloud-only.
  • Old Google Drive File Stream .gdoc placeholder shortcuts are a different mechanism — out of scope here.
  • Symlinks: lstat on the link doesn't reflect the target's flag. If the walker dereferences, stat the target.
  • Hard links: same inode reached via multiple paths — only hydrate once. Existing inode-dedup logic should already cover this.

Eviction safety (hydrate-evict)

  • Track which files vykar itself hydrated this run. Only evict that set. Never evict a file that was already warm — the user may be working on it.
  • Re-stat each tracked file just before evicting: if mtime advanced since hydration, the user edited the file mid-backup — skip eviction.
  • Respect user pins ("Keep Downloaded"). NSFileProviderManager.evictItem is expected to honor NSFileProviderItemCapabilities / pin state, but worth verifying and surfacing per-file errors.
  • Eviction is async and best-effort. Failures (file in use, fileprovider busy, item not found) should log and continue, not fail the backup.

Disk + bandwidth

  • Pre-flight free-space check vs total dataless bytes in hydrate* modes. Refuse with a clear error rather than ENOSPC mid-backup.
  • A concurrency knob (hydrate_parallel: N) helps avoid head-of-line blocking on a single multi-GB file.
  • Optional future: skip-on-metered-network (macOS exposes reachability flags).

Restore semantics

  • Restoring a snapshot of a previously-dataless file produces a normal warm file on disk — not dataless. The user must re-sync via iCloud / their provider if they want it cloud-only again. Document in restore docs.
  • FileProvider xattrs (com.apple.fileprovider.fpfs#P etc.) are present on dataless files. Probably worth stripping on restore — provider-domain context is stale once the snapshot is rehomed.

Snapshot consistency

  • If a file changes between hydrate and evict (e.g. cloud delivers an update mid-backup), the snapshot reflects the moment-of-read state. Acceptable; backup is point-in-time.

Why this matters

This is the difference between vykar quietly missing tens of gigabytes of user data on a default macOS install vs. providing real defense-in-depth backup that's independent of the cloud provider. The detection is one bit; the rest is opt-in mode.

Happy to PR if there's interest. macOS-only code can sit behind #[cfg(target_os = "macos")]; eviction needs a small Objective-C / Swift bridge for NSFileProviderManager.

I'm not certain the specific shape proposed above is the right one — the modes, the parent-propagation semantics, and the eviction tracking all involve tradeoffs I haven't fully thought through. Mostly hoping to stimulate discussion on what the right behavior should look like, since the current behavior (silently dropping dataless files while burning bandwidth re-attempting them) is clearly a bug regardless of what replaces it.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions