Making Every File Readable to AI Coding Agents
"RAG performs retrieval at query time. FAR performs augmentation at file time." β FAR Paper, Kelly Peilin Chan, 2026
In Claude Code:
/plugin marketplace add mr-kelly/far
/plugin install far
Via npx (other AI agents):
npx skills add mr-kelly/farManual:
git clone https://github.com/mr-kelly/far.git# Scan current directory (recursive)
far
# Scan specific directory
far ~/Documents/projects
# Process single file
far report.pdf
# Force regeneration (ignore cache)
far . --forceAdd to AGENTS.md or system prompt β that's all:
When you encounter a binary file you cannot read
(.png, .pdf, .xlsx, .mp4), check for a .meta file
beside it. The .meta contains extracted content as
Markdown. For directory overviews, read .dir.meta.
cp skills/far/.env.example skills/far/.env
# Add OPENAI_API_KEY to enable OpenAI vision + transcription
# Optional (macOS):
# FAR_USE_APPLE_VISION=1
# FAR_USE_MACOS_METADATA=1
# FAR_APPLE_VISION_MAX_FRAMES=6Without API keys, FAR falls back to local tools (Tesseract, FFprobe), and on macOS it also uses Apple Vision + Spotlight metadata on-device.
AI coding agents (Claude Code, Codex, GitHub Copilot) can read code β but they're blind to 30β40% of critical context stored in binary formats:
| File | Agent sees |
|---|---|
budget.xlsx |
Opaque bytes |
architecture.png |
Nothing |
requirements.pdf |
Nothing |
standup.mp4 |
Nothing |
"An AI agent operating without access to these files is like a developer who can read code but is forbidden from looking at the design docs, the architecture diagrams, or the product requirements."
FAR generates a persistent .meta sidecar next to every binary file:
project/
βββ budget.xlsx β Binary (opaque to AI)
βββ budget.xlsx.meta β Markdown table (readable by AI)
βββ architecture.png β Binary
βββ architecture.png.meta β Caption + OCR text
βββ standup.mp4.meta β Full transcript + topics
No vector database. No embedding service. No runtime pipeline.
| Format | Extensions | Extractor | Output |
|---|---|---|---|
| π PDF | .pdf |
pdfminer + tabula | Full text, tables as Markdown |
| π Word | .docx, .doc |
python-docx / antiword | Full text |
| π Excel | .xlsx |
openpyxl | Sheets as Markdown tables |
| π½οΈ PowerPoint | .pptx |
python-pptx | Slide text |
| πΌοΈ Images | .png, .jpg, .jpeg, .gif, .bmp, .webp |
Tesseract OCR + Apple Vision (macOS) + GPT-4V | Caption + OCR text + labels |
| π¬ Video | .mp4, .mov, .avi, .mkv |
ffmpeg + Tesseract + Apple Vision (macOS) + Whisper | Metadata + OCR + on-device scene summary + transcript |
| π΅ Audio | .mp3, .wav, .m4a, .flac |
Whisper | Transcript |
| π CSV | .csv |
Built-in | Markdown table (up to 100 rows) |
| π Jupyter | .ipynb |
Built-in | Markdown + code cells + outputs |
| π EPUB | .epub |
Built-in | Full text from all chapters |
| ποΈ Archive | .zip, .jar, .whl, .apk |
Built-in | File listing with sizes |
| π¦ Tar | .tar, .tar.gz, .tgz, .bz2, .xz |
Built-in | File listing with sizes |
| π§ Email | .eml, .msg |
Built-in | Headers + body + attachment list |
| π RTF | .rtf |
Built-in | Plain text extraction |
| ποΈ SQLite | .db, .sqlite, .sqlite3 |
Built-in | Table schemas + latest 20 rows per table |
| π Parquet | .parquet |
pyarrow (optional) | Schema + row count (metadata only) |
| π¨ Design | .fig, .sketch, .xd |
Built-in | File size + page count (metadata only) |
| π» Code | .py, .js, .ts, .go, .rs, .java, .sh, ... |
Direct mirror | Full content |
| π Text | .txt, .md, .json, .yml, .xml, .html, .css |
Direct mirror | Full content |
| π¦ Other | * |
Fallback | MIME type + file metadata |
When running on macOS (default enabled):
- Apple Vision for images and video frames (OCR, labels, faces, barcodes/QR, body pose)
- Apple Vision feature-print fingerprint hash for image similarity workflows
- Spotlight (
mdls) metadata enrichment in.metafiles
Two-layer cache for instant incremental builds:
- Fast check (mtime + size) β skip unchanged files in 0.003s
- Content check (SHA-256) β detect true changes even if timestamp differs
Only files whose content has actually changed are re-extracted. The rest are instant cache hits.
.dir.meta is also content-stable: if directory summary content hasn't changed, FAR will not rewrite it (so extract.extracted_at won't churn on every scheduled run).
Auto-generated .dir.meta files let agents "browse" entire directories without reading every file:
project/.dir.meta β "What is this project?"
src/.dir.meta β "What's in src/?"
docs/.dir.meta β "What docs exist?"
.farignorefile (gitignore syntax) to exclude sensitive paths and directories- Fully offline β no files leave your machine without API keys
- Selective extraction: mark directories as "metadata-only" (no content extraction)
RAG chunks documents into 500β1000 token fragments. This destroys structure:
Original table in report.pdf:
| Region | Revenue | Growth |
| APAC | $2.3M | +28% | β complete, meaningful
| NA | $1.9M | +12% |
After RAG chunking:
Chunk 37: "...APAC $2.3M +28% NA"
Chunk 38: "$1.9M +12% Europe..." β table split, context lost
FAR preserves the full file structure in every .meta. The agent always gets the complete picture.
| RAG | FAR | |
|---|---|---|
| Infrastructure | 3+ always-running services | Zero |
| Content quality | Lossy chunks | Complete file |
| Binary support | Partial | Full |
| Latency | 200β500ms | <10ms |
| Offline | β | β |
In 2005, Unity faced the same problem β game assets (.png, .fbx, .wav) are binary and opaque to the engine. Their solution: every asset gets a persistent text sidecar.
player.png β player.png.meta (Unity: engine metadata)
report.pdf β report.pdf.meta (FAR: AI-readable content)
Twenty years later, FAR applies the same insight to AI coding agents.
FAR sits at the file layer of the AI infrastructure stack β complementing, not replacing, existing tools:
| Standard | Scope | Relationship to FAR |
|---|---|---|
AGENTS.md |
Project instructions | Add one FAR rule |
llms.txt |
Site/project summary | FAR is per-file granularity |
| MCP | Tool/resource protocol | FAR can be exposed as MCP resource |
| RAG | Query-time retrieval | FAR provides clean, structured input |
---
far_version: 1
source:
sha256: a1b2c3d4...
mime: application/pdf
size: 129509
extract:
pipeline: far_gen_v14
extracted_at: 2026-02-27T10:00:00Z
---
# report.pdf
## Executive Summary
Revenue grew 23% YoY driven by APAC expansion.
## Table 1 - Revenue by Region
| Region | Q3 2025 | Growth |
|--------------|---------|--------|
| Asia-Pacific | $2.3M | +28% |
| N. America | $1.9M | +12% |File-Augmented Retrieval: Making Every File Readable to Coding Agents via Persistent .meta Sidecars
Kelly Peilin Chan, 2026
Full documentation in skills/far/SKILL.md
MIT License
Built with β€οΈ by Kelly