📄 FAR - File-Augmented Retrieval

Making Every File Readable to AI Coding Agents

📖 Read the Paper • 🚀 Quick Start • ✨ Features

"RAG performs retrieval at query time. FAR performs augmentation at file time." — FAR Paper, Kelly Peilin Chan, 2026

🚀 Quick Start

Install

In Claude Code:

/plugin marketplace add mr-kelly/far
/plugin install far

Via npx (other AI agents):

npx skills add mr-kelly/far

Manual:

git clone https://github.com/mr-kelly/far.git

Run

# Scan current directory (recursive)
far

# Scan specific directory
far ~/Documents/projects

# Process single file
far report.pdf

# Force regeneration (ignore cache)
far . --force

One Rule for Your Agent

Add to AGENTS.md or system prompt — that's all:

When you encounter a binary file you cannot read
(.png, .pdf, .xlsx, .mp4), check for a .meta file
beside it. The .meta contains extracted content as
Markdown. For directory overviews, read .dir.meta.

Configuration (AI Features)

cp skills/far/.env.example skills/far/.env
# Add OPENAI_API_KEY to enable OpenAI vision + transcription
# Optional (macOS):
# FAR_USE_APPLE_VISION=1
# FAR_USE_MACOS_METADATA=1
# FAR_APPLE_VISION_MAX_FRAMES=6

Without API keys, FAR falls back to local tools (Tesseract, FFprobe), and on macOS it also uses Apple Vision + Spotlight metadata on-device.

🎯 The Problem

AI coding agents (Claude Code, Codex, GitHub Copilot) can read code — but they're blind to 30–40% of critical context stored in binary formats:

File	Agent sees
`budget.xlsx`	Opaque bytes
`architecture.png`	Nothing
`requirements.pdf`	Nothing
`standup.mp4`	Nothing

"An AI agent operating without access to these files is like a developer who can read code but is forbidden from looking at the design docs, the architecture diagrams, or the product requirements."

💡 The Solution

FAR generates a persistent .meta sidecar next to every binary file:

project/
├── budget.xlsx           ← Binary (opaque to AI)
├── budget.xlsx.meta      ← Markdown table (readable by AI)
├── architecture.png      ← Binary
├── architecture.png.meta ← Caption + OCR text
└── standup.mp4.meta      ← Full transcript + topics

No vector database. No embedding service. No runtime pipeline.

✨ Features

📦 Supported Formats

Format	Extensions	Extractor	Output
📄 PDF	`.pdf`	pdfminer + tabula	Full text, tables as Markdown
📝 Word	`.docx`, `.doc`	python-docx / antiword	Full text
📊 Excel	`.xlsx`	openpyxl	Sheets as Markdown tables
📽️ PowerPoint	`.pptx`	python-pptx	Slide text
🖼️ Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`	Tesseract OCR + Apple Vision (macOS) + GPT-4V	Caption + OCR text + labels
🎬 Video	`.mp4`, `.mov`, `.avi`, `.mkv`	ffmpeg + Tesseract + Apple Vision (macOS) + Whisper	Metadata + OCR + on-device scene summary + transcript
🎵 Audio	`.mp3`, `.wav`, `.m4a`, `.flac`	Whisper	Transcript
📋 CSV	`.csv`	Built-in	Markdown table (up to 100 rows)
📓 Jupyter	`.ipynb`	Built-in	Markdown + code cells + outputs
📚 EPUB	`.epub`	Built-in	Full text from all chapters
🗜️ Archive	`.zip`, `.jar`, `.whl`, `.apk`	Built-in	File listing with sizes
📦 Tar	`.tar`, `.tar.gz`, `.tgz`, `.bz2`, `.xz`	Built-in	File listing with sizes
📧 Email	`.eml`, `.msg`	Built-in	Headers + body + attachment list
📝 RTF	`.rtf`	Built-in	Plain text extraction
🗄️ SQLite	`.db`, `.sqlite`, `.sqlite3`	Built-in	Table schemas + latest 20 rows per table
📊 Parquet	`.parquet`	pyarrow (optional)	Schema + row count (metadata only)
🎨 Design	`.fig`, `.sketch`, `.xd`	Built-in	File size + page count (metadata only)
💻 Code	`.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.sh`, ...	Direct mirror	Full content
📋 Text	`.txt`, `.md`, `.json`, `.yml`, `.xml`, `.html`, `.css`	Direct mirror	Full content
📦 Other	`*`	Fallback	MIME type + file metadata

🍎 macOS On-Device Enhancements

When running on macOS (default enabled):

Apple Vision for images and video frames (OCR, labels, faces, barcodes/QR, body pose)
Apple Vision feature-print fingerprint hash for image similarity workflows
Spotlight (mdls) metadata enrichment in .meta files

⚡ Intelligent Caching

Two-layer cache for instant incremental builds:

Fast check (mtime + size) — skip unchanged files in 0.003s
Content check (SHA-256) — detect true changes even if timestamp differs

Only files whose content has actually changed are re-extracted. The rest are instant cache hits.

.dir.meta is also content-stable: if directory summary content hasn't changed, FAR will not rewrite it (so extract.extracted_at won't churn on every scheduled run).

📁 Directory Summaries

Auto-generated .dir.meta files let agents "browse" entire directories without reading every file:

project/.dir.meta       ← "What is this project?"
  src/.dir.meta         ← "What's in src/?"
  docs/.dir.meta        ← "What docs exist?"

🔒 Privacy & Security

.farignore file (gitignore syntax) to exclude sensitive paths and directories
Fully offline — no files leave your machine without API keys
Selective extraction: mark directories as "metadata-only" (no content extraction)

📊 Why Not RAG?

RAG chunks documents into 500–1000 token fragments. This destroys structure:

Original table in report.pdf:
| Region  | Revenue | Growth |
| APAC    | $2.3M   | +28%   |   ← complete, meaningful
| NA      | $1.9M   | +12%   |

After RAG chunking:
  Chunk 37: "...APAC $2.3M +28% NA"
  Chunk 38: "$1.9M +12% Europe..."  ← table split, context lost

FAR preserves the full file structure in every .meta. The agent always gets the complete picture.

	RAG	FAR
Infrastructure	3+ always-running services	Zero
Content quality	Lossy chunks	Complete file
Binary support	Partial	Full
Latency	200–500ms	<10ms
Offline	❌	✅

🧠 Inspired by Unity Engine

In 2005, Unity faced the same problem — game assets (.png, .fbx, .wav) are binary and opaque to the engine. Their solution: every asset gets a persistent text sidecar.

player.png      →   player.png.meta   (Unity: engine metadata)
report.pdf      →   report.pdf.meta   (FAR: AI-readable content)

Twenty years later, FAR applies the same insight to AI coding agents.

🔌 Ecosystem Compatibility

FAR sits at the file layer of the AI infrastructure stack — complementing, not replacing, existing tools:

Standard	Scope	Relationship to FAR
`AGENTS.md`	Project instructions	Add one FAR rule
`llms.txt`	Site/project summary	FAR is per-file granularity
MCP	Tool/resource protocol	FAR can be exposed as MCP resource
RAG	Query-time retrieval	FAR provides clean, structured input

📐 The `.meta` Format

---
far_version: 1
source:
  sha256: a1b2c3d4...
  mime: application/pdf
  size: 129509
extract:
  pipeline: far_gen_v14
  extracted_at: 2026-02-27T10:00:00Z
---
# report.pdf

## Executive Summary
Revenue grew 23% YoY driven by APAC expansion.

## Table 1 - Revenue by Region
| Region       | Q3 2025 | Growth |
|--------------|---------|--------|
| Asia-Pacific | $2.3M   | +28%   |
| N. America   | $1.9M   | +12%   |

📖 Research

File-Augmented Retrieval: Making Every File Readable to Coding Agents via Persistent .meta Sidecars

Kelly Peilin Chan, 2026

📚 Documentation

Full documentation in skills/far/SKILL.md

📄 License

MIT License

Built with ❤️ by Kelly

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.claude-plugin		.claude-plugin
skills/far		skills/far
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 FAR - File-Augmented Retrieval

🚀 Quick Start

Install

Run

One Rule for Your Agent

Configuration (AI Features)

🎯 The Problem

💡 The Solution

✨ Features

📦 Supported Formats

🍎 macOS On-Device Enhancements

⚡ Intelligent Caching

📁 Directory Summaries

🔒 Privacy & Security

📊 Why Not RAG?

🧠 Inspired by Unity Engine

🔌 Ecosystem Compatibility

📐 The `.meta` Format

📖 Research

📚 Documentation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 FAR - File-Augmented Retrieval

🚀 Quick Start

Install

Run

One Rule for Your Agent

Configuration (AI Features)

🎯 The Problem

💡 The Solution

✨ Features

📦 Supported Formats

🍎 macOS On-Device Enhancements

⚡ Intelligent Caching

📁 Directory Summaries

🔒 Privacy & Security

📊 Why Not RAG?

🧠 Inspired by Unity Engine

🔌 Ecosystem Compatibility

📐 The .meta Format

📖 Research

📚 Documentation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📐 The `.meta` Format

Packages