macOS: handle dataless (cloud-only) files instead of failing with "file changed during read"
Summary
On macOS, files backed by Apple's FileProvider framework (iCloud Drive, Dropbox, OneDrive, Box, new Google Drive, etc.) can exist on disk as dataless placeholders — stat() returns correct metadata but blocks aren't materialized until something reads the file. macOS sets st_flags & SF_DATALESS (0x40000000) on such inodes.
Vykar 0.14.1 doesn't recognize this flag. When the walker encounters a dataless file it open()s and read()s normally, which:
- Triggers
fileproviderd to start asynchronous hydration
- The file's size/blocks/ctime change mid-read as bytes stream in
- Vykar's atomicity guard fires and emits
warning: skipping file '...': file changed during read
Net effect: dataless files are silently never backed up, while every backup attempt unnecessarily downloads them from the cloud provider, then aborts the chunking. Bandwidth is spent and nothing is captured.
This affects every macOS user with any FileProvider-backed sync enabled — iCloud "Desktop & Documents Folders" alone is one click in System Settings and is on by default for many users.
Reproduction
# Confirm dataless files exist in a typical Mac home
find ~/Documents -flags +dataless -type f | head
# Run backup
vykar backup
Activity log shows hundreds-to-thousands of file changed during read warnings, all on iCloud-managed paths. Verified on a 36,624-file / ~96 GB iCloud-managed ~/Documents and ~/Desktop.
Sample log lines (paths sanitized):
[r2] warning: skipping file '/Users/jbb/Documents/.../Awake Loop.mp3': file changed during read: ...
[r2] warning: skipping file '/Users/jbb/Documents/scryent_logo.pdf': file changed during read: ...
[r2] warning: skipping file '/Users/jbb/Documents/drive.dd': file changed during read: ...
Every one of those files has SF_DATALESS set at walk time.
Proposed fix
1. Detect dataless at walk time
In crates/vykar-core/src/commands/backup/walk/inode_walk.rs, branch on metadata.st_flags() & 0x40000000. The flag is documented in <sys/stat.h> as SF_DATALESS and stable since macOS 10.15.
2. Propagate from parent snapshot when dataless
This is the most important behavior for skip mode and arguably the whole feature.
When the walker encounters a dataless file, check the parent snapshot for an entry at the same path. If (mtime, size, inode, xattr_hash) match, propagate the ChunkRefs forward into the new snapshot without reading the file. No hydration, no warning, no missing data.
Vykar already has parent-reuse infrastructure (the binary contains built parent reuse index for cold-start fallback). This extends it with one rule: dataless + identity-match ⇒ reuse.
Net behavior:
| File state |
Cache/parent match |
Result |
| warm (currently materialized) |
matched |
normal cache hit, no read |
| warm |
no match |
read + chunk normally |
| dataless |
matched in parent |
propagate ChunkRefs, no hydration |
| dataless |
no parent match |
mode-dependent: skip warns, hydrate* reads |
| dataless |
parent has different mtime/size |
identity changed remotely — must hydrate to capture |
Effect: any file warm during any backup window stays in every subsequent snapshot until its identity actually changes, even if the cloud provider evicts it the next morning. This is the right default — snapshot membership shouldn't oscillate based on local cache state.
3. Configurable handling
Add a per-source (and global default) setting:
sources:
- path: /Users/jbb
one_file_system: true
dataless: skip # default — recommended
Modes:
| Mode |
Behavior |
skip (default) |
lstat only, log dataless: cloud-only, skipping <path> once with a tally, do not open. Snapshot omits these files. |
hydrate |
Read normally, accept the wait, leave file warm on disk. |
hydrate-evict |
Read, back up, then call NSFileProviderManager.evictItem(identifier:) to return the file to dataless state once the snapshot commits. |
stub (future) |
Record metadata only (path, size, mtime, xattrs). No contents. Useful for inventory snapshots. |
skip should be the default — it preserves user bandwidth, mirrors the de-facto current behavior, and replaces noisy false-positive warnings with a single clear message.
4. File cache must tolerate SF_DATALESS ctime churn
Vykar's file cache key currently appears to include ctime (the binary contains parent snapshot lacks ctime on filesystem files, skipping parent fallback). Eviction toggles SF_DATALESS, which bumps ctime even though logical content is unchanged.
Without a fix, hydrate-evict would re-hydrate every file on every run — defeating the whole point.
Proposal: when SF_DATALESS is currently set, or the previous snapshot recorded the file as dataless, match cache entries on (mtime, size, inode, xattr_hash) only. ctime is unreliable across hydration transitions.
5. Improve the "file changed during read" message
Even outside this feature, the error should hint at SF_DATALESS when present at retry time, e.g.:
file changed during read: /Users/jbb/Documents/foo.pdf (cloud-only file, hydration in progress)
Implementation notes / edge cases
Detection scope
SF_DATALESS is FileProvider-generic. Covers iCloud Drive, Dropbox, OneDrive, Box, Google Drive (current .gfile model). Not iCloud-only.
- Old Google Drive File Stream
.gdoc placeholder shortcuts are a different mechanism — out of scope here.
- Symlinks:
lstat on the link doesn't reflect the target's flag. If the walker dereferences, stat the target.
- Hard links: same inode reached via multiple paths — only hydrate once. Existing inode-dedup logic should already cover this.
Eviction safety (hydrate-evict)
- Track which files vykar itself hydrated this run. Only evict that set. Never evict a file that was already warm — the user may be working on it.
- Re-stat each tracked file just before evicting: if
mtime advanced since hydration, the user edited the file mid-backup — skip eviction.
- Respect user pins ("Keep Downloaded").
NSFileProviderManager.evictItem is expected to honor NSFileProviderItemCapabilities / pin state, but worth verifying and surfacing per-file errors.
- Eviction is async and best-effort. Failures (file in use, fileprovider busy, item not found) should log and continue, not fail the backup.
Disk + bandwidth
- Pre-flight free-space check vs total dataless bytes in
hydrate* modes. Refuse with a clear error rather than ENOSPC mid-backup.
- A concurrency knob (
hydrate_parallel: N) helps avoid head-of-line blocking on a single multi-GB file.
- Optional future: skip-on-metered-network (macOS exposes reachability flags).
Restore semantics
- Restoring a snapshot of a previously-dataless file produces a normal warm file on disk — not dataless. The user must re-sync via iCloud / their provider if they want it cloud-only again. Document in restore docs.
- FileProvider xattrs (
com.apple.fileprovider.fpfs#P etc.) are present on dataless files. Probably worth stripping on restore — provider-domain context is stale once the snapshot is rehomed.
Snapshot consistency
- If a file changes between hydrate and evict (e.g. cloud delivers an update mid-backup), the snapshot reflects the moment-of-read state. Acceptable; backup is point-in-time.
Why this matters
This is the difference between vykar quietly missing tens of gigabytes of user data on a default macOS install vs. providing real defense-in-depth backup that's independent of the cloud provider. The detection is one bit; the rest is opt-in mode.
Happy to PR if there's interest. macOS-only code can sit behind #[cfg(target_os = "macos")]; eviction needs a small Objective-C / Swift bridge for NSFileProviderManager.
I'm not certain the specific shape proposed above is the right one — the modes, the parent-propagation semantics, and the eviction tracking all involve tradeoffs I haven't fully thought through. Mostly hoping to stimulate discussion on what the right behavior should look like, since the current behavior (silently dropping dataless files while burning bandwidth re-attempting them) is clearly a bug regardless of what replaces it.
References
macOS: handle dataless (cloud-only) files instead of failing with "file changed during read"
Summary
On macOS, files backed by Apple's FileProvider framework (iCloud Drive, Dropbox, OneDrive, Box, new Google Drive, etc.) can exist on disk as dataless placeholders —
stat()returns correct metadata but blocks aren't materialized until something reads the file. macOS setsst_flags & SF_DATALESS(0x40000000) on such inodes.Vykar 0.14.1 doesn't recognize this flag. When the walker encounters a dataless file it
open()s andread()s normally, which:fileproviderdto start asynchronous hydrationwarning: skipping file '...': file changed during readNet effect: dataless files are silently never backed up, while every backup attempt unnecessarily downloads them from the cloud provider, then aborts the chunking. Bandwidth is spent and nothing is captured.
This affects every macOS user with any FileProvider-backed sync enabled — iCloud "Desktop & Documents Folders" alone is one click in System Settings and is on by default for many users.
Reproduction
Activity log shows hundreds-to-thousands of
file changed during readwarnings, all on iCloud-managed paths. Verified on a 36,624-file / ~96 GB iCloud-managed~/Documentsand~/Desktop.Sample log lines (paths sanitized):
Every one of those files has
SF_DATALESSset at walk time.Proposed fix
1. Detect dataless at walk time
In
crates/vykar-core/src/commands/backup/walk/inode_walk.rs, branch onmetadata.st_flags() & 0x40000000. The flag is documented in<sys/stat.h>asSF_DATALESSand stable since macOS 10.15.2. Propagate from parent snapshot when dataless
This is the most important behavior for
skipmode and arguably the whole feature.When the walker encounters a dataless file, check the parent snapshot for an entry at the same path. If
(mtime, size, inode, xattr_hash)match, propagate the ChunkRefs forward into the new snapshot without reading the file. No hydration, no warning, no missing data.Vykar already has parent-reuse infrastructure (the binary contains
built parent reuse index for cold-start fallback). This extends it with one rule: dataless + identity-match ⇒ reuse.Net behavior:
skipwarns,hydrate*readsEffect: any file warm during any backup window stays in every subsequent snapshot until its identity actually changes, even if the cloud provider evicts it the next morning. This is the right default — snapshot membership shouldn't oscillate based on local cache state.
3. Configurable handling
Add a per-source (and global default) setting:
Modes:
skip(default)lstatonly, logdataless: cloud-only, skipping <path>once with a tally, do not open. Snapshot omits these files.hydratehydrate-evictNSFileProviderManager.evictItem(identifier:)to return the file to dataless state once the snapshot commits.stub(future)skipshould be the default — it preserves user bandwidth, mirrors the de-facto current behavior, and replaces noisy false-positive warnings with a single clear message.4. File cache must tolerate
SF_DATALESSctime churnVykar's file cache key currently appears to include ctime (the binary contains
parent snapshot lacks ctime on filesystem files, skipping parent fallback). Eviction togglesSF_DATALESS, which bumps ctime even though logical content is unchanged.Without a fix,
hydrate-evictwould re-hydrate every file on every run — defeating the whole point.Proposal: when
SF_DATALESSis currently set, or the previous snapshot recorded the file as dataless, match cache entries on(mtime, size, inode, xattr_hash)only. ctime is unreliable across hydration transitions.5. Improve the "file changed during read" message
Even outside this feature, the error should hint at
SF_DATALESSwhen present at retry time, e.g.:Implementation notes / edge cases
Detection scope
SF_DATALESSis FileProvider-generic. Covers iCloud Drive, Dropbox, OneDrive, Box, Google Drive (current.gfilemodel). Not iCloud-only..gdocplaceholder shortcuts are a different mechanism — out of scope here.lstaton the link doesn't reflect the target's flag. If the walker dereferences, stat the target.Eviction safety (
hydrate-evict)mtimeadvanced since hydration, the user edited the file mid-backup — skip eviction.NSFileProviderManager.evictItemis expected to honorNSFileProviderItemCapabilities/ pin state, but worth verifying and surfacing per-file errors.Disk + bandwidth
hydrate*modes. Refuse with a clear error rather than ENOSPC mid-backup.hydrate_parallel: N) helps avoid head-of-line blocking on a single multi-GB file.Restore semantics
com.apple.fileprovider.fpfs#Petc.) are present on dataless files. Probably worth stripping on restore — provider-domain context is stale once the snapshot is rehomed.Snapshot consistency
Why this matters
This is the difference between vykar quietly missing tens of gigabytes of user data on a default macOS install vs. providing real defense-in-depth backup that's independent of the cloud provider. The detection is one bit; the rest is opt-in mode.
Happy to PR if there's interest. macOS-only code can sit behind
#[cfg(target_os = "macos")]; eviction needs a small Objective-C / Swift bridge forNSFileProviderManager.I'm not certain the specific shape proposed above is the right one — the modes, the parent-propagation semantics, and the eviction tracking all involve tradeoffs I haven't fully thought through. Mostly hoping to stimulate discussion on what the right behavior should look like, since the current behavior (silently dropping dataless files while burning bandwidth re-attempting them) is clearly a bug regardless of what replaces it.
References
<sys/stat.h>:#define SF_DATALESS 0x40000000("file is dataless object")NSFileProviderManager.evictItem(identifier:completionHandler:)(macOS 11+)find -flags +datalessenumerates affected files