You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+50-14Lines changed: 50 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,17 @@
1
1
# discrawl 🛰️ — Mirror Discord into SQLite; search server history locally
2
2
3
-
`discrawl` mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search. It can also import classifiable Discord Desktop cache messages for DM recovery/search without using a user token. Teams can publish that archive as a private Git snapshot repo, so readers get fresh org memory without Discord bot credentials.
3
+
`discrawl` mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search. It can also import classifiable Discord Desktop cache messages for DM recovery/search without using a user token.
4
4
5
-
Live guild sync uses real bot tokens. Desktop wiretap mode reads local cache artifacts only; it does not extract credentials or run a selfbot. Data stays local unless you explicitly publish a Git-backed snapshot.
5
+
Teams can publish the archive as a private Git snapshot repo, so readers get fresh org memory without Discord bot credentials.
6
+
7
+
There are two local archive sources:
8
+
9
+
- Discord bot API sync for guilds, channels, members, threads, and message history the configured bot can access
10
+
- Discord Desktop cache import for local, classifiable cached messages, including proven DMs under `@me`
11
+
12
+
Desktop wiretap mode reads local cache artifacts only. It does not extract credentials, use user tokens, call the Discord API as your user, or run a selfbot.
13
+
14
+
Data stays local unless you explicitly publish a Git-backed snapshot.
6
15
7
16
## What It Does
8
17
@@ -104,7 +113,7 @@ Examples below assume `discrawl` is on `PATH`. If you built from source without
104
113
105
114
## Quick Start
106
115
107
-
Reuse an existing OpenClaw Discord bot config:
116
+
Reuse an existing OpenClaw Discord bot config and refresh both bot-visible guild data and local desktop cache data:
Use `discrawl sync --source wiretap` when you only want the local Discord Desktop cache import and do not want bot-token API sync.
128
+
119
129
Multi-account OpenClaw setup:
120
130
121
131
```bash
@@ -169,31 +179,48 @@ When OpenClaw config tokens use `${ENV_VAR}` placeholders, `init` and `doctor` r
169
179
170
180
### `sync`
171
181
172
-
Refreshes guild state into SQLite. Run one explicit `--full` pass when you want a complete historical archive; use plain `sync` afterward for frequent latest-message refreshes.
182
+
Refreshes SQLite from one or both archive sources.
183
+
184
+
By default, `sync` runs both sources:
185
+
186
+
- Discord bot-token sync for bot-visible guild data
187
+
- local Discord Desktop cache import for classifiable cached messages and proven DMs
188
+
189
+
Run one explicit `--full` pass when you want a complete historical guild archive. Use plain `sync` afterward for frequent latest-message and desktop-cache refreshes.
173
190
174
191
```bash
175
192
discrawl sync
176
193
discrawl sync --full
177
194
discrawl sync --full --all
178
195
discrawl sync --guild 123456789012345678
179
196
discrawl sync --guilds 123,456 --concurrency 8
180
-
discrawl sync --source both
181
-
discrawl sync --source discord
182
-
discrawl sync --source wiretap
197
+
discrawl sync --source both# default: bot API + desktop cache
198
+
discrawl sync --source discord# bot API only; aliases: key, bot, api
|`both`| Discord bot API and local Discord Desktop cache | bot-visible guild data plus classifiable cached desktop messages |
209
+
|`discord` / `key`| Discord bot API | guilds, channels, threads, members, and messages the bot can access |
210
+
|`wiretap`| local Discord Desktop cache files | classifiable cached messages; proven DMs are stored under `@me`|
211
+
212
+
Sync modes control the Discord bot API side of a run. When `wiretap` is selected, the desktop cache import runs once alongside the chosen bot sync mode.
213
+
214
+
Bot sync modes:
188
215
189
216
| Command | Use when | Behavior |
190
217
| --- | --- | --- |
191
218
|`discrawl sync`| routine refresh | imports any stale Git snapshot first, skips member refreshes, checks live top-level channels plus active threads, and only fetches new messages for channels with a stored latest cursor |
192
219
|`discrawl sync --all-channels`| repair pass | broad incremental sweep across every stored channel/thread, including archived threads |
193
220
|`discrawl sync --full`| historical backfill | crawls older history until channels are complete; can take a long time on large servers |
194
221
195
-
`sync` already uses parallel channel workers. `--concurrency` overrides the default, and the default is auto-sized from `GOMAXPROCS` with a floor of `8` and a cap of `32`.
196
-
`--source` selects what gets refreshed: `both` (default), `discord`/`key` for bot-token API sync only, or `wiretap` for local Discord Desktop cache import only.
222
+
`sync` already uses parallel channel workers for bot API message crawling.
223
+
`--concurrency` overrides the default, and the default is auto-sized from `GOMAXPROCS` with a floor of `8` and a cap of `32`.
197
224
`--all` ignores `default_guild_id` and fans out across every discovered guild the bot can access.
198
225
`--skip-members` refreshes guild/channel/message data without crawling the full member list, which is useful for frequent Git snapshot publishers that only need latest messages.
199
226
`--latest-only` is still accepted for explicit latest-only runs; it is now the default for untargeted `sync`. Use `--all-channels` to opt out of the fast default without doing a full historical crawl.
Imports classifiable Discord Desktop message payloads into the same local SQLite archive. This is the path for searchable DMs because bot tokens cannot read personal direct messages.
248
+
Imports classifiable Discord Desktop message payloads into the same local SQLite archive.
249
+
250
+
This is the path for searchable DMs because bot tokens cannot read personal direct messages.
251
+
252
+
`wiretap` is also available through `discrawl sync --source wiretap` and is included in the default `discrawl sync --source both` path.
- stores only classifiable cache messages in the normal`guilds` / `channels` /`messages` tables
263
+
- stores classifiable cache messages in the same`guilds`, `channels`, and`messages` tables used by bot sync
233
264
- stores proven DMs under the synthetic guild id `@me`
234
-
- drops message payloads whose channel cannot be classified from cached channel metadata or Discord route URLs
265
+
- drops message payloads whose channel cannot be classified from cached channel metadata or Discord route URLs; dropped rows are counted as `skipped_messages`
266
+
- imports what Discord Desktop has cached locally, not complete live DM history
235
267
- scans local `.ldb`, `.log`, `.json`, and `.txt` artifacts for Discord message JSON
236
268
- does not extract, store, or print Discord auth tokens
237
269
-`--max-file-bytes` skips unusually large files; default is 64 MiB
@@ -565,6 +597,10 @@ With remote providers, message text is sent during `discrawl embed`, and search
565
597
- FTS index rows
566
598
- optional local embedding queue metadata and vectors
567
599
600
+
Messages imported from Discord Desktop use the same message, attachment, mention, and FTS paths as bot-synced messages.
601
+
602
+
Proven DMs use `@me` as their guild id. Unclassifiable desktop-cache payloads are skipped instead of being stored as unknown synthetic data.
603
+
568
604
SQLite schema migrations are versioned with `PRAGMA user_version`. Startup now fails fast when a local DB schema is newer than the supported binary.
0 commit comments