A modern, batteries-included, modular scraping framework for Go.
go get github.com/LQR471814/scavenge
- Minimal footprint: ~1k LOC with minimal reliance on third-party libraries.
- Batteries-included: Domain whitelisting, automatic throttling, retries, replaying responses, etc... all included.
- Flexible: Near-full control over the details of scraping, but no need to concern yourself with the details if you don't need to.
- Extensible: Easily add features you need with simple middleware interfaces.
Here's an example that scrapes ~1000 wikipedia pages in ~20 seconds.
To run it:
cd examples/wikipedia
go run .
golang.org/x/net
- Used indownloader.Response
.github.1485827954.workers.dev/PuerkitoBio/purell
- Used only inmiddleware.Dedupe
andmiddleware.Replay
.github.1485827954.workers.dev/gobwas/glob
- Used only inmiddleware.AllowedDomains
.github.1485827954.workers.dev/zeebo/xxh3
- Used only inmiddleware.FSReplayStore
.- All the other dependencies are only used in examples.
- Python's scrapy is a massive influence on this library, many of the design choices in this library are taken straight from here.