Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: CI

on:
push:
branches: [main, modernize-packaging]
pull_request:
branches: [main]

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install ruff
- run: ruff check src/ tests/
- run: ruff format --check src/ tests/

test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- run: pip install -e ".[dev]"
- run: pytest
76 changes: 74 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,77 @@
.Python
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
*.egg

# Virtual environments
.venv/
venv/
ENV/
env/
bin/
include/
lib/

# Installer logs
pip-log.txt
pip-delete-this-directory.txt
pip-selfcheck.json

# Unit test / coverage
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# mypy / pyright
.mypy_cache/
.pyright/

# ruff
.ruff_cache/

# Environments
*.env
.env

# IDE
.idea/
.vscode/
*.swp
*.swo
*~

# OS
.DS_Store
Thumbs.db

# Project-specific
hashtags/
*.session
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.6
hooks:
- id: ruff
args: [--fix]
- id: ruff-format
150 changes: 129 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,143 @@
# Instagram Hashtag Crawler
[![HitCount](http://hits.dwyl.io/simonseo/instagram-hashtag-crawler.svg)](http://hits.dwyl.io/simonseo/instagram-hashtag-crawler)

This crawler was made because most of the crawlers out there seems to either require a browser or a developer account. This Instagram crawler utilizes a private API of Instagram and thus no developer account is required.
Crawl Instagram hashtags and collect post metadata (likes, comments, captions, user profiles) without a developer account.

Refer to a similar script I wrote. It might be more helpful in terms of documentation: [simonseo/instacrawler-privateapi](https://github.com/simonseo/instagram-hashtag-crawler)
Uses [instaloader](https://instaloader.github.io/) under the hood.

## Installation
First install [Instagram Private API](https://github.com/ping/instagram_private_api). Kudos for a great project!

```bash
pip install .
```

With browser cookie support (auto-extract session from Chrome, Firefox, etc.):

```bash
pip install ".[browser]"
```
$ pip install git+https://github.com/ping/instagram_private_api.git

For development:

```bash
pip install -e ".[dev,browser]"
```

Now run `__init__.py`. It'll provide you with the command options. If this shows up, everything probably works
## Usage

### Crawl hashtags

```bash
# Using browser cookies (recommended — auto-extracts session from your browser)
instagram-hashtag-crawler --browser chrome -t foodporn

# If logged in on a non-default Chrome profile, specify the cookie file
instagram-hashtag-crawler --browser chrome \
--cookie-file ~/Library/Application\ Support/Google/Chrome/Profile\ 1/Cookies \
-t foodporn

# Using username/password
instagram-hashtag-crawler -u YOUR_USERNAME -p YOUR_PASSWORD -t foodporn

# Multiple hashtags from a file
instagram-hashtag-crawler --browser chrome -f targets.txt

# With options
instagram-hashtag-crawler --browser chrome -t foodporn \
--max-posts 500 \
--output-dir ./data \
-v
```
$ python __init__.py
usage: __init__.py [-h] -u USERNAME -p PASSWORD [-f TARGETFILE] [-t TARGET]

### Multi-hashtag AND search

Pass `-t` multiple times to find posts that contain **all** specified hashtags:

```bash
# Posts tagged with BOTH #foodporn AND #pizza
instagram-hashtag-crawler --browser chrome -t foodporn -t pizza

# Three-way AND
instagram-hashtag-crawler --browser chrome -t food -t pizza -t italy
```

## Get Crawlin'
To get crawlin', you need to provide your Instagram username and password, and either an Instagram Hashtag without the hash (target) or a text file of the hashtags in each row (targetfile).
Wait a bit and a folder will be made with all the hashtags crawled.
Output is saved as `food_AND_pizza.json` (tags sorted alphabetically, joined by `_AND_`).

You can also run it as a module:

## Options
Inside `__init__.py`, there is a config dictionary. Each config option is explained in the comments.
Note that `min_collect_media` and `max_collect_media` is trumped if `min_timestamp` is provided as a number.
```bash
python -m instagram_hashtag_crawler --browser chrome -t foodporn
```
config = {
'profile_path' : './hashtags', # Path where output data gets saved
'min_collect_media' : 1, # how many media items to be collected per hashtag. If time is specified, this is ignored
'max_collect_media' : 2000, # how many media items to be collected per hashtag. If time is specified, this is ignored
# 'min_timestamp' : int(time() - 60*60*24*30*2) # up to how recent you want the posts to be in seconds. If you do not want to use this, put None as value
'min_timestamp' : None
}

### Export to CSV

```bash
instagram-hashtag-export --json-dir ./hashtags --csv-dir ./output
```

### Options

| Flag | Description | Default |
|------|-------------|---------|
| `--browser` | Auto-extract session from browser (chrome, firefox, safari, edge, brave, etc.) | — |
| `--cookie-file` | Path to browser cookie file (for non-default profiles) | — |
| `-u`, `--username` | Instagram username (not needed with `--browser`) | — |
| `-p`, `--password` | Instagram password (not needed with `--browser`) | — |
| `-t`, `--target` | Hashtag to crawl (without `#`). Repeat for AND search. | — |
| `-f`, `--targetfile` | File with hashtags, one per line | — |
| `--output-dir` | Directory for JSON output | `./hashtags` |
| `--max-posts` | Max posts per hashtag | `100` |
| `--min-posts` | Min posts required | `1` |
| `--since` | Unix timestamp — only collect newer posts | — |
| `--session-file` | Path to save/load session (with `-u`/`-p`) | — |
| `-v`, `--verbose` | Debug logging | off |

### Target file format

One hashtag per line, no `#` prefix:

```
delicious
dish
foodpornography
```

See [`examples/targets.txt`](examples/targets.txt) for a sample.

## Output

Each hashtag produces a JSON file in the output directory:

```
hashtags/
delicious.json
dish.json
food_AND_pizza.json # multi-hashtag AND result
```

Each JSON file contains an array of post objects with fields like `shortcode`, `user_id`, `username`, `like_count`, `comment_count`, `caption`, `tags`, `pic_url`, `date`, and profile metadata.

## Development

```bash
# Install dev dependencies
pip install -e ".[dev,browser]"

# Lint
ruff check src/ tests/
ruff format --check src/ tests/

# Test
pytest

# Pre-commit hooks
pre-commit install
```

## Requirements

- Python 3.10+
- An Instagram account (no developer/API access needed)

## License

MIT
75 changes: 0 additions & 75 deletions __init__.py

This file was deleted.

Loading