Skip to content

LQR471814/scavenge

Repository files navigation

scavenge

A modern, batteries-included, modular scraping framework for Go.

Go Reference

Installation

go get github.com/LQR471814/scavenge

Features

  • Minimal footprint: ~1k LOC with minimal reliance on third-party libraries.
  • Batteries-included: Domain whitelisting, automatic throttling, retries, replaying responses, etc... all included.
  • Flexible: Near-full control over the details of scraping, but no need to concern yourself with the details if you don't need to.
  • Extensible: Easily add features you need with simple middleware interfaces.

Example

Here's an example that scrapes ~1000 wikipedia pages in ~20 seconds.

To run it:

cd examples/wikipedia
go run .

Dependencies

  • golang.org/x/net - Used in downloader.Response.
  • github.com/PuerkitoBio/purell - Used only in middleware.Dedupe and middleware.Replay.
  • github.com/gobwas/glob - Used only in middleware.AllowedDomains.
  • github.com/zeebo/xxh3 - Used only in middleware.FSReplayStore.
  • All the other dependencies are only used in examples.

Credits

  • Python's scrapy is a massive influence on this library, many of the design choices in this library are taken straight from here.

About

A clean and modern scraping framework for go, based on python's scrapy.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages