macOS: handle dataless (cloud-only) files instead of failing with 'file changed during read'

# macOS: handle dataless (cloud-only) files instead of failing with "file changed during read"

## Summary

On macOS, files backed by Apple's FileProvider framework (iCloud Drive, Dropbox, OneDrive, Box, new Google Drive, etc.) can exist on disk as **dataless placeholders** — `stat()` returns correct metadata but blocks aren't materialized until something reads the file. macOS sets `st_flags & SF_DATALESS` (`0x40000000`) on such inodes.

Vykar 0.14.1 doesn't recognize this flag. When the walker encounters a dataless file it `open()`s and `read()`s normally, which:

1. Triggers `fileproviderd` to start asynchronous hydration
2. The file's size/blocks/ctime change mid-read as bytes stream in
3. Vykar's atomicity guard fires and emits `warning: skipping file '...': file changed during read`

Net effect: **dataless files are silently never backed up**, while every backup attempt unnecessarily downloads them from the cloud provider, then aborts the chunking. Bandwidth is spent and nothing is captured.

This affects every macOS user with any FileProvider-backed sync enabled — iCloud "Desktop & Documents Folders" alone is one click in System Settings and is on by default for many users.

## Reproduction

```bash
# Confirm dataless files exist in a typical Mac home
find ~/Documents -flags +dataless -type f | head

# Run backup
vykar backup
```

Activity log shows hundreds-to-thousands of `file changed during read` warnings, all on iCloud-managed paths. Verified on a 36,624-file / ~96 GB iCloud-managed `~/Documents` and `~/Desktop`.

Sample log lines (paths sanitized):

```
[r2] warning: skipping file '/Users/jbb/Documents/.../Awake Loop.mp3': file changed during read: ...
[r2] warning: skipping file '/Users/jbb/Documents/scryent_logo.pdf': file changed during read: ...
[r2] warning: skipping file '/Users/jbb/Documents/drive.dd': file changed during read: ...
```

Every one of those files has `SF_DATALESS` set at walk time.

## Proposed fix

### 1. Detect dataless at walk time

In `crates/vykar-core/src/commands/backup/walk/inode_walk.rs`, branch on `metadata.st_flags() & 0x40000000`. The flag is documented in `<sys/stat.h>` as `SF_DATALESS` and stable since macOS 10.15.

### 2. Propagate from parent snapshot when dataless

This is the most important behavior for `skip` mode and arguably the whole feature.

When the walker encounters a dataless file, check the parent snapshot for an entry at the same path. If `(mtime, size, inode, xattr_hash)` match, **propagate the ChunkRefs forward into the new snapshot without reading the file**. No hydration, no warning, no missing data.

Vykar already has parent-reuse infrastructure (the binary contains `built parent reuse index for cold-start fallback`). This extends it with one rule: dataless + identity-match ⇒ reuse.

Net behavior:

| File state | Cache/parent match | Result |
|---|---|---|
| warm (currently materialized) | matched | normal cache hit, no read |
| warm | no match | read + chunk normally |
| dataless | matched in parent | propagate ChunkRefs, **no hydration** |
| dataless | no parent match | mode-dependent: `skip` warns, `hydrate*` reads |
| dataless | parent has different mtime/size | identity changed remotely — must hydrate to capture |

Effect: any file warm during *any* backup window stays in every subsequent snapshot until its identity actually changes, even if the cloud provider evicts it the next morning. This is the right default — snapshot membership shouldn't oscillate based on local cache state.

### 3. Configurable handling

Add a per-source (and global default) setting:

```yaml
sources:
  - path: /Users/jbb
    one_file_system: true
    dataless: skip            # default — recommended
```

Modes:

| Mode             | Behavior |
|------------------|----------|
| `skip` *(default)* | `lstat` only, log `dataless: cloud-only, skipping <path>` once with a tally, do not open. Snapshot omits these files. |
| `hydrate`        | Read normally, accept the wait, leave file warm on disk. |
| `hydrate-evict`  | Read, back up, then call `NSFileProviderManager.evictItem(identifier:)` to return the file to dataless state once the snapshot commits. |
| `stub` *(future)* | Record metadata only (path, size, mtime, xattrs). No contents. Useful for inventory snapshots. |

`skip` should be the default — it preserves user bandwidth, mirrors the de-facto current behavior, and replaces noisy false-positive warnings with a single clear message.

### 4. File cache must tolerate `SF_DATALESS` ctime churn

Vykar's file cache key currently appears to include ctime (the binary contains `parent snapshot lacks ctime on filesystem files, skipping parent fallback`). Eviction toggles `SF_DATALESS`, which bumps ctime even though logical content is unchanged.

Without a fix, `hydrate-evict` would re-hydrate every file on every run — defeating the whole point.

Proposal: when `SF_DATALESS` is currently set, or the previous snapshot recorded the file as dataless, match cache entries on `(mtime, size, inode, xattr_hash)` only. ctime is unreliable across hydration transitions.

### 5. Improve the "file changed during read" message

Even outside this feature, the error should hint at `SF_DATALESS` when present at retry time, e.g.:

```
file changed during read: /Users/jbb/Documents/foo.pdf (cloud-only file, hydration in progress)
```

## Implementation notes / edge cases

**Detection scope**
- `SF_DATALESS` is FileProvider-generic. Covers iCloud Drive, Dropbox, OneDrive, Box, Google Drive (current `.gfile` model). Not iCloud-only.
- Old Google Drive File Stream `.gdoc` placeholder shortcuts are a different mechanism — out of scope here.
- Symlinks: `lstat` on the link doesn't reflect the target's flag. If the walker dereferences, stat the target.
- Hard links: same inode reached via multiple paths — only hydrate once. Existing inode-dedup logic should already cover this.

**Eviction safety (`hydrate-evict`)**
- Track which files vykar itself hydrated this run. Only evict that set. Never evict a file that was already warm — the user may be working on it.
- Re-stat each tracked file just before evicting: if `mtime` advanced since hydration, the user edited the file mid-backup — skip eviction.
- Respect user pins ("Keep Downloaded"). `NSFileProviderManager.evictItem` is expected to honor `NSFileProviderItemCapabilities` / pin state, but worth verifying and surfacing per-file errors.
- Eviction is async and best-effort. Failures (file in use, fileprovider busy, item not found) should log and continue, not fail the backup.

**Disk + bandwidth**
- Pre-flight free-space check vs total dataless bytes in `hydrate*` modes. Refuse with a clear error rather than ENOSPC mid-backup.
- A concurrency knob (`hydrate_parallel: N`) helps avoid head-of-line blocking on a single multi-GB file.
- Optional future: skip-on-metered-network (macOS exposes reachability flags).

**Restore semantics**
- Restoring a snapshot of a previously-dataless file produces a normal warm file on disk — *not* dataless. The user must re-sync via iCloud / their provider if they want it cloud-only again. Document in restore docs.
- FileProvider xattrs (`com.apple.fileprovider.fpfs#P` etc.) are present on dataless files. Probably worth stripping on restore — provider-domain context is stale once the snapshot is rehomed.

**Snapshot consistency**
- If a file changes between hydrate and evict (e.g. cloud delivers an update mid-backup), the snapshot reflects the moment-of-read state. Acceptable; backup is point-in-time.

## Why this matters

This is the difference between vykar quietly missing tens of gigabytes of user data on a default macOS install vs. providing real defense-in-depth backup that's independent of the cloud provider. The detection is one bit; the rest is opt-in mode.

Happy to PR if there's interest. macOS-only code can sit behind `#[cfg(target_os = "macos")]`; eviction needs a small Objective-C / Swift bridge for `NSFileProviderManager`.

I'm not certain the specific shape proposed above is the right one — the modes, the parent-propagation semantics, and the eviction tracking all involve tradeoffs I haven't fully thought through. Mostly hoping to stimulate discussion on what the right behavior should look like, since the current behavior (silently dropping dataless files while burning bandwidth re-attempting them) is clearly a bug regardless of what replaces it.

## References

- Apple, `<sys/stat.h>`: `#define SF_DATALESS 0x40000000` ("file is dataless object")
- Apple, [`NSFileProviderManager.evictItem(identifier:completionHandler:)`](https://developer.apple.com/documentation/fileprovider/nsfileprovidermanager/3553323-evictitem) (macOS 11+)
- macOS `find -flags +dataless` enumerates affected files


Mode	Behavior
`skip` (default)	`lstat` only, log `dataless: cloud-only, skipping <path>` once with a tally, do not open. Snapshot omits these files.
`hydrate`	Read normally, accept the wait, leave file warm on disk.
`hydrate-evict`	Read, back up, then call `NSFileProviderManager.evictItem(identifier:)` to return the file to dataless state once the snapshot commits.
`stub` (future)	Record metadata only (path, size, mtime, xattrs). No contents. Useful for inventory snapshots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

macOS: handle dataless (cloud-only) files instead of failing with 'file changed during read' #118

macOS: handle dataless (cloud-only) files instead of failing with "file changed during read"

Summary

Reproduction

Proposed fix

1. Detect dataless at walk time

2. Propagate from parent snapshot when dataless

3. Configurable handling

4. File cache must tolerate `SF_DATALESS` ctime churn

5. Improve the "file changed during read" message

Implementation notes / edge cases

Why this matters

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File state	Cache/parent match	Result
warm (currently materialized)	matched	normal cache hit, no read
warm	no match	read + chunk normally
dataless	matched in parent	propagate ChunkRefs, no hydration
dataless	no parent match	mode-dependent: `skip` warns, `hydrate*` reads
dataless	parent has different mtime/size	identity changed remotely — must hydrate to capture

macOS: handle dataless (cloud-only) files instead of failing with 'file changed during read' #118

Description

macOS: handle dataless (cloud-only) files instead of failing with "file changed during read"

Summary

Reproduction

Proposed fix

1. Detect dataless at walk time

2. Propagate from parent snapshot when dataless

3. Configurable handling

4. File cache must tolerate SF_DATALESS ctime churn

5. Improve the "file changed during read" message

Implementation notes / edge cases

Why this matters

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4. File cache must tolerate `SF_DATALESS` ctime churn