Skip to content

Commit 03f792b

Browse files
authored
feat(skills): add use-local-whisper skill package (#702)
Thanks for the great contribution @glifocat! This is a really well-structured skill — clean package, thorough docs, and solid test coverage. Hope to see more skills like this from you!
1 parent 5e3d8b6 commit 03f792b

7 files changed

Lines changed: 394 additions & 0 deletions

File tree

.claude/skills/add-voice-transcription/modify/src/channels/whatsapp.test.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,10 @@ vi.mock('@whiskeysockets/baileys', () => {
9090
timedOut: 408,
9191
restartRequired: 515,
9292
},
93+
fetchLatestWaWebVersion: vi
94+
.fn()
95+
.mockResolvedValue({ version: [2, 3000, 0] }),
96+
normalizeMessageContent: vi.fn((content: unknown) => content),
9397
makeCacheableSignalKeyStore: vi.fn((keys: unknown) => keys),
9498
useMultiFileAuthState: vi.fn().mockResolvedValue({
9599
state: {

.claude/skills/add-voice-transcription/modify/src/channels/whatsapp.test.ts.intent.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Added mock for the transcription module and 3 new test cases for voice message h
88
### Mocks (top of file)
99
- Added: `vi.mock('../transcription.js', ...)` with `isVoiceMessage` and `transcribeAudioMessage` mocks
1010
- Added: `import { transcribeAudioMessage } from '../transcription.js'` for test assertions
11+
- Updated: Baileys mock to include `fetchLatestWaWebVersion` and `normalizeMessageContent` exports (required by current upstream whatsapp.ts)
1112

1213
### Test cases (inside "message handling" describe block)
1314
- Changed: "handles message with no extractable text (e.g. voice note without caption)" → "transcribes voice messages"
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
name: use-local-whisper
3+
description: Use when the user wants local voice transcription instead of OpenAI Whisper API. Switches to whisper.cpp running on Apple Silicon. WhatsApp only for now. Requires voice-transcription skill to be applied first.
4+
---
5+
6+
# Use Local Whisper
7+
8+
Switches voice transcription from OpenAI's Whisper API to local whisper.cpp. Runs entirely on-device — no API key, no network, no cost.
9+
10+
**Channel support:** Currently WhatsApp only. The transcription module (`src/transcription.ts`) uses Baileys types for audio download. Other channels (Telegram, Discord, etc.) would need their own audio-download logic before this skill can serve them.
11+
12+
**Note:** The Homebrew package is `whisper-cpp`, but the CLI binary it installs is `whisper-cli`.
13+
14+
## Prerequisites
15+
16+
- `voice-transcription` skill must be applied first (WhatsApp channel)
17+
- macOS with Apple Silicon (M1+) recommended
18+
- `whisper-cpp` installed: `brew install whisper-cpp` (provides the `whisper-cli` binary)
19+
- `ffmpeg` installed: `brew install ffmpeg`
20+
- A GGML model file downloaded to `data/models/`
21+
22+
## Phase 1: Pre-flight
23+
24+
### Check if already applied
25+
26+
Read `.nanoclaw/state.yaml`. If `use-local-whisper` is in `applied_skills`, skip to Phase 3 (Verify).
27+
28+
### Check dependencies are installed
29+
30+
```bash
31+
whisper-cli --help >/dev/null 2>&1 && echo "WHISPER_OK" || echo "WHISPER_MISSING"
32+
ffmpeg -version >/dev/null 2>&1 && echo "FFMPEG_OK" || echo "FFMPEG_MISSING"
33+
```
34+
35+
If missing, install via Homebrew:
36+
```bash
37+
brew install whisper-cpp ffmpeg
38+
```
39+
40+
### Check for model file
41+
42+
```bash
43+
ls data/models/ggml-*.bin 2>/dev/null || echo "NO_MODEL"
44+
```
45+
46+
If no model exists, download the base model (148MB, good balance of speed and accuracy):
47+
```bash
48+
mkdir -p data/models
49+
curl -L -o data/models/ggml-base.bin "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"
50+
```
51+
52+
For better accuracy at the cost of speed, use `ggml-small.bin` (466MB) or `ggml-medium.bin` (1.5GB).
53+
54+
## Phase 2: Apply Code Changes
55+
56+
```bash
57+
npx tsx scripts/apply-skill.ts .claude/skills/use-local-whisper
58+
```
59+
60+
This modifies `src/transcription.ts` to use the `whisper-cli` binary instead of the OpenAI API.
61+
62+
### Validate
63+
64+
```bash
65+
npm test
66+
npm run build
67+
```
68+
69+
## Phase 3: Verify
70+
71+
### Ensure launchd PATH includes Homebrew
72+
73+
The NanoClaw launchd service runs with a restricted PATH. `whisper-cli` and `ffmpeg` are in `/opt/homebrew/bin/` (Apple Silicon) or `/usr/local/bin/` (Intel), which may not be in the plist's PATH.
74+
75+
Check the current PATH:
76+
```bash
77+
grep -A1 'PATH' ~/Library/LaunchAgents/com.nanoclaw.plist
78+
```
79+
80+
If `/opt/homebrew/bin` is missing, add it to the `<string>` value inside the `PATH` key in the plist. Then reload:
81+
```bash
82+
launchctl unload ~/Library/LaunchAgents/com.nanoclaw.plist
83+
launchctl load ~/Library/LaunchAgents/com.nanoclaw.plist
84+
```
85+
86+
### Build and restart
87+
88+
```bash
89+
npm run build
90+
launchctl kickstart -k gui/$(id -u)/com.nanoclaw
91+
```
92+
93+
### Test
94+
95+
Send a voice note in any registered group. The agent should receive it as `[Voice: <transcript>]`.
96+
97+
### Check logs
98+
99+
```bash
100+
tail -f logs/nanoclaw.log | grep -i -E "voice|transcri|whisper"
101+
```
102+
103+
Look for:
104+
- `Transcribed voice message` — successful transcription
105+
- `whisper.cpp transcription failed` — check model path, ffmpeg, or PATH
106+
107+
## Configuration
108+
109+
Environment variables (optional, set in `.env`):
110+
111+
| Variable | Default | Description |
112+
|----------|---------|-------------|
113+
| `WHISPER_BIN` | `whisper-cli` | Path to whisper.cpp binary |
114+
| `WHISPER_MODEL` | `data/models/ggml-base.bin` | Path to GGML model file |
115+
116+
## Troubleshooting
117+
118+
**"whisper.cpp transcription failed"**: Ensure both `whisper-cli` and `ffmpeg` are in PATH. The launchd service uses a restricted PATH — see Phase 3 above. Test manually:
119+
```bash
120+
ffmpeg -f lavfi -i anullsrc=r=16000:cl=mono -t 1 -f wav /tmp/test.wav -y
121+
whisper-cli -m data/models/ggml-base.bin -f /tmp/test.wav --no-timestamps -nt
122+
```
123+
124+
**Transcription works in dev but not as service**: The launchd plist PATH likely doesn't include `/opt/homebrew/bin`. See "Ensure launchd PATH includes Homebrew" in Phase 3.
125+
126+
**Slow transcription**: The base model processes ~30s of audio in <1s on M1+. If slower, check CPU usage — another process may be competing.
127+
128+
**Wrong language**: whisper.cpp auto-detects language. To force a language, you can set `WHISPER_LANG` and modify `src/transcription.ts` to pass `-l $WHISPER_LANG`.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
skill: use-local-whisper
2+
version: 1.0.0
3+
description: "Switch voice transcription from OpenAI Whisper API to local whisper.cpp (WhatsApp only)"
4+
core_version: 0.1.0
5+
adds: []
6+
modifies:
7+
- src/transcription.ts
8+
structured: {}
9+
conflicts: []
10+
depends:
11+
- voice-transcription
12+
test: "npx vitest run src/channels/whatsapp.test.ts"
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
import { execFile } from 'child_process';
2+
import fs from 'fs';
3+
import os from 'os';
4+
import path from 'path';
5+
import { promisify } from 'util';
6+
7+
import { downloadMediaMessage, WAMessage, WASocket } from '@whiskeysockets/baileys';
8+
9+
const execFileAsync = promisify(execFile);
10+
11+
const WHISPER_BIN = process.env.WHISPER_BIN || 'whisper-cli';
12+
const WHISPER_MODEL =
13+
process.env.WHISPER_MODEL ||
14+
path.join(process.cwd(), 'data', 'models', 'ggml-base.bin');
15+
16+
const FALLBACK_MESSAGE = '[Voice Message - transcription unavailable]';
17+
18+
async function transcribeWithWhisperCpp(
19+
audioBuffer: Buffer,
20+
): Promise<string | null> {
21+
const tmpDir = os.tmpdir();
22+
const id = `nanoclaw-voice-${Date.now()}`;
23+
const tmpOgg = path.join(tmpDir, `${id}.ogg`);
24+
const tmpWav = path.join(tmpDir, `${id}.wav`);
25+
26+
try {
27+
fs.writeFileSync(tmpOgg, audioBuffer);
28+
29+
// Convert ogg/opus to 16kHz mono WAV (required by whisper.cpp)
30+
await execFileAsync('ffmpeg', [
31+
'-i', tmpOgg,
32+
'-ar', '16000',
33+
'-ac', '1',
34+
'-f', 'wav',
35+
'-y', tmpWav,
36+
], { timeout: 30_000 });
37+
38+
const { stdout } = await execFileAsync(WHISPER_BIN, [
39+
'-m', WHISPER_MODEL,
40+
'-f', tmpWav,
41+
'--no-timestamps',
42+
'-nt',
43+
], { timeout: 60_000 });
44+
45+
const transcript = stdout.trim();
46+
return transcript || null;
47+
} catch (err) {
48+
console.error('whisper.cpp transcription failed:', err);
49+
return null;
50+
} finally {
51+
for (const f of [tmpOgg, tmpWav]) {
52+
try { fs.unlinkSync(f); } catch { /* best effort cleanup */ }
53+
}
54+
}
55+
}
56+
57+
export async function transcribeAudioMessage(
58+
msg: WAMessage,
59+
sock: WASocket,
60+
): Promise<string | null> {
61+
try {
62+
const buffer = (await downloadMediaMessage(
63+
msg,
64+
'buffer',
65+
{},
66+
{
67+
logger: console as any,
68+
reuploadRequest: sock.updateMediaMessage,
69+
},
70+
)) as Buffer;
71+
72+
if (!buffer || buffer.length === 0) {
73+
console.error('Failed to download audio message');
74+
return FALLBACK_MESSAGE;
75+
}
76+
77+
console.log(`Downloaded audio message: ${buffer.length} bytes`);
78+
79+
const transcript = await transcribeWithWhisperCpp(buffer);
80+
81+
if (!transcript) {
82+
return FALLBACK_MESSAGE;
83+
}
84+
85+
console.log(`Transcribed voice message: ${transcript.length} chars`);
86+
return transcript.trim();
87+
} catch (err) {
88+
console.error('Transcription error:', err);
89+
return FALLBACK_MESSAGE;
90+
}
91+
}
92+
93+
export function isVoiceMessage(msg: WAMessage): boolean {
94+
return msg.message?.audioMessage?.ptt === true;
95+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Intent: src/transcription.ts modifications
2+
3+
## What changed
4+
Replaced the OpenAI Whisper API backend with local whisper.cpp CLI execution. Audio is converted from ogg/opus to 16kHz mono WAV via ffmpeg, then transcribed locally using whisper-cpp. No API key or network required.
5+
6+
## Key sections
7+
8+
### Imports
9+
- Removed: `readEnvFile` from `./env.js` (no API key needed)
10+
- Added: `execFile` from `child_process`, `fs`, `os`, `path`, `promisify` from `util`
11+
12+
### Configuration
13+
- Removed: `TranscriptionConfig` interface and `DEFAULT_CONFIG` (no model/enabled/fallback config)
14+
- Added: `WHISPER_BIN` constant (env `WHISPER_BIN` or `'whisper-cli'`)
15+
- Added: `WHISPER_MODEL` constant (env `WHISPER_MODEL` or `data/models/ggml-base.bin`)
16+
- Added: `FALLBACK_MESSAGE` constant
17+
18+
### transcribeWithWhisperCpp (replaces transcribeWithOpenAI)
19+
- Writes audio buffer to temp .ogg file
20+
- Converts to 16kHz mono WAV via ffmpeg
21+
- Runs whisper-cpp CLI with `--no-timestamps -nt` flags
22+
- Cleans up temp files in finally block
23+
- Returns trimmed stdout or null on error
24+
25+
### transcribeAudioMessage
26+
- Same signature: `(msg: WAMessage, sock: WASocket) => Promise<string | null>`
27+
- Same download logic via `downloadMediaMessage`
28+
- Calls `transcribeWithWhisperCpp` instead of `transcribeWithOpenAI`
29+
- Same fallback behavior on error/null
30+
31+
### isVoiceMessage
32+
- Unchanged: `msg.message?.audioMessage?.ptt === true`
33+
34+
## Invariants (must-keep)
35+
- `transcribeAudioMessage` export signature unchanged
36+
- `isVoiceMessage` export unchanged
37+
- Fallback message strings unchanged: `[Voice Message - transcription unavailable]`
38+
- downloadMediaMessage call pattern unchanged
39+
- Error logging pattern unchanged

0 commit comments

Comments
 (0)