Crawls product reviews from G2.com and saves them as structured JSON. Uses your real Chrome browser (via CDP) so anti-bot protections are handled naturally — no proxies, cookies, or stealth tricks needed.
┌──────────────────────────────────────────────────────────────────┐
│ 1. You launch Chrome with --remote-debugging-port=9222 │
│ 2. The script connects to Chrome via CDP (DevTools Protocol) │
│ 3. It navigates Chrome to each G2 review page │
│ 4. Chrome renders the page normally (handles DataDome/anti-bot) │
│ 5. The script reads the rendered HTML and parses reviews │
│ 6. Reviews are deduplicated and saved incrementally to JSON │
└──────────────────────────────────────────────────────────────────┘
Script (CDP client) Chrome Browser
┌──────────────────┐ ┌──────────────────┐
│ Page.navigate │───────────▶│ loads G2 page │
│ to each review │ │ passes anti-bot │
│ page URL │ │ renders HTML │
│ │◀───────────│ │
│ reads outerHTML │ │ port 9222 │
└──────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ Scrapling parser │
│ CSS selectors │──▶ reviews.json
│ (itemprop based) │
└──────────────────┘
G2 uses DataDome anti-bot protection which blocks headless browsers and HTTP clients. By driving your real Chrome, the script avoids this entirely — Chrome is already a trusted browser with a valid session.
Each review is saved with these fields (see example_output.json):
| Field | Description |
|---|---|
review_id |
G2's internal review identifier |
author |
Reviewer name (as shown on G2) |
job_title |
Reviewer's job title |
company_size |
Company size bucket |
date |
Publication date (ISO format) |
rating |
Star rating (1.0–5.0) |
title |
Review title |
likes |
"What do you like best" section |
dislikes |
"What do you dislike" section |
problems_solving |
"What problems is it solving" section |
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install -r requirements.txt# macOS
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--remote-debugging-port=9222
# Linux
google-chrome --remote-debugging-port=9222python crawl_g2_reviews.pyThe script will:
- Connect to Chrome via CDP on port 9222
- Navigate to each review page (Chrome handles anti-bot naturally)
- Read the rendered HTML and parse reviews with CSS selectors
- Save reviews incrementally to a JSON file (safe to interrupt and resume)
All settings can be overridden with environment variables:
| Variable | Default | Description |
|---|---|---|
G2_PRODUCT_SLUG |
wati |
G2 product slug from the URL |
G2_OUTPUT_FILE |
{slug}_reviews.json |
Output JSON file path |
G2_CDP_PORT |
9222 |
Chrome DevTools Protocol port |
G2_PAGE_WAIT |
4 |
Seconds to wait for page render |
G2_PRODUCT_SLUG=slack python crawl_g2_reviews.pyThe crawler supports incremental crawling. If interrupted, just run it again — it loads previously saved reviews from the output JSON and skips duplicates based on review_id.
| File | Tracked | Description |
|---|---|---|
crawl_g2_reviews.py |
Yes | Main crawler script |
requirements.txt |
Yes | Python dependencies |
example_output.json |
Yes | Sample output showing the data format |
*_reviews.json |
No | Crawled review data |