=======================================
888
888
888
.d8888b888d888 8888b. 888 888 888888
d88P" 888P" "88b888 888 888888
888 888 .d888888888 888 888888
Y88b. 888 888 888Y88b 888 d88P888
"Y8888P888 "Y888888 "Y8888888P" 888
=======================================A fast, concurrent web crawler written in Rust 🦀
- 🚀 Concurrent crawling with configurable parallelism
- 🎯 Domain scoping to limit crawls to specific domains or TLDs
- 🔍 Depth-limited crawling to control how deep to traverse
- 🛡 Custom User-Agent support
- 📊 Duplicate detection to avoid revisiting pages
- 📈 Crawl statistics (total pages, time elapsed, average time per page)
- 🔇 Quiet mode for piping output (automatic detection or via flag)
- 💾 File output in JSON or plain text format
git clone git@github.com:a-mango/crawl.git
cd crawl
cargo build --releaseThe binary will be available at target/release/crawl. You can move it to a directory in your PATH for easier access:
mv target/release/crawl $HOME/.local/bin/Basic usage:
crawl -s https://example.com-s, --start-url <URL>- Required. The URL to start crawling from-d, --depth <NUM>- Maximum depth to crawl (default: 2)--scope <DOMAIN>- Domain to scope crawling to (e.g.,example.com,.com)--delay <MS>- Delay between requests in milliseconds (default: 1000)--user-agent <STRING>- Custom User-Agent string (default: "rust-crawler/0.1.0")--concurrency <NUM>- Maximum number of concurrent requests (default: 10)-o, --output <FILE>- Output file path (supports .json or .txt extensions)-q, --quiet- Quiet mode - only output URLs (useful for piping)
Crawl a site with depth 3:
crawl -s https://example.com -d 3Crawl only within a specific domain:
crawl -s https://example.com --scope example.comSave results to a JSON file:
crawl -s https://example.com -o results.jsonSave results to a text file:
crawl -s https://example.com -o results.txtCrawl all .com sites with custom concurrency:
crawl -s https://example.com --scope .com --concurrency 20Crawl with a custom delay and user-agent:
crawl -s https://example.com --delay 2000 --user-agent "CustomBot/1.0"Piping to other programs (quiet mode activates automatically):
# Filter URLs containing "api"
crawl -s https://example.com | grep "/api/"
# Get first 10 URLs
crawl -s https://example.com | head -n 10The crawler prints each discovered URL to stdout as it crawls, with an ASCII banner and statistics at the end:
=======================================
888
888
888
.d8888b888d888 8888b. 888 888 888888
d88P" 888P" "88b888 888 888888
888 888 .d888888888 888 888888
Y88b. 888 888 888Y88b 888 d88P888
"Y8888P888 "Y888888 "Y8888888P" 888
=======================================
Starting crawl: https://example.com
Configuration:
Max depth: 2
Delay: 1000ms
User-Agent: rust-crawler/0.1.0
Concurrency: 10
========================================
https://example.com
https://example.com/about
https://example.com/contact
========================================
Crawl complete!
Statistics:
Total pages: 3
Time elapsed: 2.45s
Avg per page: 816.67ms
When output is piped to another program or when using the -q/--quiet flag, only URLs are printed (one per line):
https://example.com
https://example.com/about
https://example.com/contact
This makes it perfect for Unix-style command chaining and processing with standard tools like grep, sort, wc, etc.
When using -o/--output, results are saved in the specified format:
- JSON: Structured format with total count and sorted URL list
- Text: One URL per line, sorted alphabetically
Non-HTTP schemes (mailto:, ftp:, etc.) are skipped and logged in normal mode.
The crawler is built with:
- reqwest - Async HTTP client
- tokio - Async runtime
- scraper - HTML parsing and link extraction
- clap - Command-line argument parsing
The codebase is modular:
crawler.rs- Core crawling logic with concurrency controlextractor.rs- HTML parsing and link extractionmain.rs- CLI interface
Run the test suite:
cargo testThe project includes:
- Unit tests for link extraction and URL normalization
- Integration tests for crawling scenarios (circular links, depth limits, scope filtering, etc.)
- No robots.txt support
- No sitemap.xml support
- No retry logic for failed requests
- No resume/checkpoint functionality
Contributions are welcome! Please feel free to submit issues or pull requests.
This software is licensed under the MIT License. See LICENSE for details.
- Add retry logic with exponential backoff
- Respect robots.txt
- Support for sitemap.xml
- Resume crawling from checkpoint
- Better progress indicators
- Handle rate limiting (429 responses)
- Add CSV output format
- Add option to download crawled content
- Store crawled links as a graph structure