scavenge

A modern, batteries-included, modular scraping framework for Go.

Installation

go get github.com/LQR471814/scavenge

Features

Minimal footprint: ~1k LOC with minimal reliance on third-party libraries.
Batteries-included: Domain whitelisting, automatic throttling, retries, replaying responses, etc... all included.
Flexible: Near-full control over the details of scraping, but no need to concern yourself with the details if you don't need to.
Extensible: Easily add features you need with simple middleware interfaces.

Example

Here's an example that scrapes ~1000 wikipedia pages in ~20 seconds.

To run it:

cd examples/wikipedia
go run .

Dependencies

golang.org/x/net - Used in downloader.Response.
github.com/PuerkitoBio/purell - Used only in middleware.Dedupe and middleware.Replay.
github.com/gobwas/glob - Used only in middleware.AllowedDomains.
github.com/zeebo/xxh3 - Used only in middleware.FSReplayStore.
All the other dependencies are only used in examples.

Credits

Python's scrapy is a massive influence on this library, many of the design choices in this library are taken straight from here.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
downloader		downloader
examples		examples
items		items
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
navigator.go		navigator.go
scavenger.go		scavenger.go
slog.go		slog.go
telemetry.go		telemetry.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scavenge

Installation

Features

Example

Dependencies

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

LQR471814/scavenge

Folders and files

Latest commit

History

Repository files navigation

scavenge

Installation

Features

Example

Dependencies

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages