Crawl

=======================================
                                    888
                                    888
                                    888
 .d8888b888d888 8888b. 888  888  888888
d88P"   888P"      "88b888  888  888888
888     888    .d888888888  888  888888
Y88b.   888    888  888Y88b 888 d88P888
 "Y8888P888    "Y888888 "Y8888888P" 888
=======================================

A fast, concurrent web crawler written in Rust 🦀

Features

🚀 Concurrent crawling with configurable parallelism
🎯 Domain scoping to limit crawls to specific domains or TLDs
🔍 Depth-limited crawling to control how deep to traverse
🛡 Custom User-Agent support
📊 Duplicate detection to avoid revisiting pages
📈 Crawl statistics (total pages, time elapsed, average time per page)
🔇 Quiet mode for piping output (automatic detection or via flag)
💾 File output in JSON or plain text format

Installation

From source

git clone git@github.com:a-mango/crawl.git
cd crawl
cargo build --release

The binary will be available at target/release/crawl. You can move it to a directory in your PATH for easier access:

mv target/release/crawl $HOME/.local/bin/

Usage

Basic usage:

crawl -s https://example.com

Options

-s, --start-url <URL> - Required. The URL to start crawling from
-d, --depth <NUM> - Maximum depth to crawl (default: 2)
--scope <DOMAIN> - Domain to scope crawling to (e.g., example.com, .com)
--delay <MS> - Delay between requests in milliseconds (default: 1000)
--user-agent <STRING> - Custom User-Agent string (default: "rust-crawler/0.1.0")
--concurrency <NUM> - Maximum number of concurrent requests (default: 10)
-o, --output <FILE> - Output file path (supports .json or .txt extensions)
-q, --quiet - Quiet mode - only output URLs (useful for piping)

Examples

Crawl a site with depth 3:

crawl -s https://example.com -d 3

Crawl only within a specific domain:

crawl -s https://example.com --scope example.com

Save results to a JSON file:

crawl -s https://example.com -o results.json

Save results to a text file:

crawl -s https://example.com -o results.txt

Crawl all .com sites with custom concurrency:

crawl -s https://example.com --scope .com --concurrency 20

Crawl with a custom delay and user-agent:

crawl -s https://example.com --delay 2000 --user-agent "CustomBot/1.0"

Piping to other programs (quiet mode activates automatically):

# Filter URLs containing "api"
crawl -s https://example.com | grep "/api/"

# Get first 10 URLs
crawl -s https://example.com | head -n 10

Output

Normal Mode (Terminal)

The crawler prints each discovered URL to stdout as it crawls, with an ASCII banner and statistics at the end:

=======================================
                                    888
                                    888
                                    888
 .d8888b888d888 8888b. 888  888  888888
d88P"   888P"      "88b888  888  888888
888     888    .d888888888  888  888888
Y88b.   888    888  888Y88b 888 d88P888
 "Y8888P888    "Y888888 "Y8888888P" 888
=======================================

Starting crawl: https://example.com
Configuration:
  Max depth:    2
  Delay:        1000ms
  User-Agent:   rust-crawler/0.1.0
  Concurrency:  10
========================================

https://example.com
https://example.com/about
https://example.com/contact

========================================
Crawl complete!
Statistics:
  Total pages:  3
  Time elapsed: 2.45s
  Avg per page: 816.67ms

Quiet Mode (Piping)

When output is piped to another program or when using the -q/--quiet flag, only URLs are printed (one per line):

https://example.com
https://example.com/about
https://example.com/contact

This makes it perfect for Unix-style command chaining and processing with standard tools like grep, sort, wc, etc.

When using -o/--output, results are saved in the specified format:

JSON: Structured format with total count and sorted URL list
Text: One URL per line, sorted alphabetically

Non-HTTP schemes (mailto:, ftp:, etc.) are skipped and logged in normal mode.

Architecture

The crawler is built with:

reqwest - Async HTTP client
tokio - Async runtime
scraper - HTML parsing and link extraction
clap - Command-line argument parsing

The codebase is modular:

crawler.rs - Core crawling logic with concurrency control
extractor.rs - HTML parsing and link extraction
main.rs - CLI interface

Testing

Run the test suite:

cargo test

The project includes:

Unit tests for link extraction and URL normalization
Integration tests for crawling scenarios (circular links, depth limits, scope filtering, etc.)

Limitations

No robots.txt support
No sitemap.xml support
No retry logic for failed requests
No resume/checkpoint functionality

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This software is licensed under the MIT License. See LICENSE for details.

Roadmap

Add retry logic with exponential backoff
Respect robots.txt
Support for sitemap.xml
Resume crawling from checkpoint
Better progress indicators
Handle rate limiting (429 responses)
Add CSV output format
Add option to download crawled content
Store crawled links as a graph structure

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl

Features

Installation

From source

Usage

Options

Examples

Output

Normal Mode (Terminal)

Quiet Mode (Piping)

Architecture

Testing

Limitations

Contributing

License

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawl

Features

Installation

From source

Usage

Options

Examples

Output

Normal Mode (Terminal)

Quiet Mode (Piping)

Architecture

Testing

Limitations

Contributing

License

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages