A Chromium-backed DOM scraping service. solimen renders a page in a real
Chromium browser, waits for caller-defined CSS selectors to appear, and returns
the fully-rendered DOM as HTML, Markdown, or PDF over a small HTTP API.
Because it drives an actual browser rather than fetching raw HTML, it captures content produced by JavaScript and single-page apps that a plain HTTP client would miss.
solimen launches a persistent Chromium process loaded with a small embedded
Manifest V3 extension and holds a WebSocket to the extension's background worker.
Each scrape flows through that browser:
- A
POST /scraperequest arrives with a URL and optional CSS-selector triggers. - The server sends a scrape command (with a per-request UUID) over the WebSocket.
- The extension opens a background tab for the URL.
- The injected content script watches the page with a
MutationObserveruntil the configured selectors match (or none were given, in which case it exports immediately). - The rendered DOM is posted back to the server, correlated by request ID, and returned in the requested formats.
Running --instances N starts N independent Chromium instances and dispatches
requests round-robin. Concurrent requests for the same URL are de-duplicated so
only one tab is opened and all callers share the result.
- Go 1.25+ to build from source.
- Chromium available on
PATHaschromiumto run the binary directly. - wkhtmltopdf on
PATH, only if you use thepdfoutput format. It is already included in the Docker image; the pure-Gopdf-simplifiedformat needs no external binary.
Alternatively, use the Docker image, which bundles Chromium, wkhtmltopdf, and a virtual display.
With Docker:
docker compose -f docker/docker-compose.yml up --buildThen scrape a page:
curl -s -X POST http://127.0.0.1:5011/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}'The response contains the rendered HTML:
{ "state": "loaded", "html": "<!DOCTYPE html><html>…</html>" }make build # builds ./solimen
# or
go build ./cmd/solimenBuild the Docker image:
docker compose -f docker/docker-compose.yml buildRun the binary (requires chromium on PATH):
./solimen --port 5011 --instances 1Every flag can also be set via a SOLIMEN_-prefixed environment variable.
| Flag | Env var | Default | Description |
|---|---|---|---|
-H, --host |
SOLIMEN_HOST |
0.0.0.0 |
HTTP listen host |
-p, --port |
SOLIMEN_PORT |
5011 |
HTTP listen port |
-n, --instances |
SOLIMEN_INSTANCES |
1 |
Number of parallel Chromium instances |
--ext-dir |
SOLIMEN_EXT_DIR |
(embedded) | Path to a Chromium extension directory, overriding the embedded one |
--use-sandbox |
SOLIMEN_USE_SANDBOX |
false |
Enable the Chromium sandbox (off by default; most container environments require it off) |
--instances controls how many independent Chromium processes are launched.
Requests are dispatched across them round-robin. A single instance can already
handle several scrapes at once (one background tab per request), so one is enough
for light or sequential use. Raising the count helps when you expect many
concurrent requests: it spreads the load across separate browsers, so a slow or
heavy page doesn't hold up others. Each instance is a full browser, so the memory
cost scales with the count. When running several in the container, raise the
compose mem_limit (2 GB by default) accordingly.
Request body:
| Field | Type | Description |
|---|---|---|
url |
string | Required. Page to scrape. |
triggers |
object | Optional CSS-selector triggers (see below). |
formats |
string[] | Output formats to return. Defaults to ["html"]. |
Triggers decide when the page is considered ready. Each list holds CSS selectors, and a state fires only when every selector in the list matches at least one element:
{
"loaded": ["#main-content", ".results"],
"failed": [".error-page"]
}loadedselectors that indicate a successful render.failedselectors that indicate the page failed to load.
If loaded is empty, the DOM is exported as soon as the page finishes loading.
Scraping times out after 30 seconds if no trigger matches.
Formats — any combination of:
| Format | Description |
|---|---|
html |
The raw rendered DOM. |
markdown |
HTML converted to Markdown, with relative links resolved against the page URL. |
pdf |
HTML rendered to PDF with wkhtmltopdf, preserving layout and styling. Requires the wkhtmltopdf binary. |
pdf-simplified |
HTML converted to Markdown, then to a clean, de-styled PDF document. |
Response:
| Field | Type | Description |
|---|---|---|
state |
string | "loaded" or "failed", from the matched trigger. |
html |
string | Present when html was requested. |
markdown |
string | Present when markdown was requested. |
pdf |
string | Base64-encoded PDF bytes, present when pdf was requested. |
pdf_simplified |
string | Base64-encoded PDF bytes, present when pdf-simplified was requested. |
Returns 200 with {"status":"ok","connected":true} when at least one Chromium
instance has an active extension connection, or 503 with "status":"degraded"
otherwise.
Plain HTML scrape:
curl -s -X POST http://127.0.0.1:5011/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}'Wait for a single-page app to render specific elements:
curl -s -X POST http://127.0.0.1:5011/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://app.example.com/dashboard",
"triggers": { "loaded": ["#dashboard", ".widget"] }
}'Request Markdown instead of HTML:
curl -s -X POST http://127.0.0.1:5011/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "formats": ["markdown"]}'The image runs the binary under supervisord alongside an Xvfb virtual
display, since Chromium needs an X server even when headless. Configuration is
passed through the same SOLIMEN_* environment variables:
SOLIMEN_INSTANCES=2 docker compose -f docker/docker-compose.yml up --buildThe compose file drops all Linux capabilities and disables the Chromium sandbox
(SOLIMEN_USE_SANDBOX=false), which is the supported configuration for most
container runtimes.
A commented-out seccomp profile (docker/seccomp/chrome.json) is also included.
The intent was to keep the Chromium sandbox enabled while still running the
container as an unprivileged user, since the sandbox otherwise needs elevated
privileges. That profile is not currently working and is not maintained; it is
left in place only as a starting point for anyone who wants to pursue that setup.
It is currently not a supported feature.
A production compose file using the published ghcr.io/ma111e/solimen image is in
docker/prod/docker-compose.prod.yml.
cmd/solimen/
main.go CLI entry point, browser lifecycle, extension extraction
internal/api/ HTTP server (/scrape, /health)
internal/converter/ HTML → Markdown / PDF conversion
extension/ embedded Manifest V3 extension (background + content scripts)
pkg/chromium/ Chromium scraper and round-robin pool
pkg/models/ shared types (trigger definitions)
docker/ Dockerfile, compose files, supervisor config
Licensed under the MIT License — see LICENSE.