-
Notifications
You must be signed in to change notification settings - Fork 48
Description
I've investigated doing transcriptions for JB shows in the past without reaching a satisfying conclusion yet.
What I'd expect from a decent transcription would be:
- A decent correctness of the result.
- Specific terminology and names can be hard to get right. Nice to have would be to be able to correct results and feed them back to improve detection
- Speaker diarisation (=recognising who speaks when)
- Detecting sentences/punctuation
A few services I took a look at:
| Service/framework | quality | Speaker diarisation | punctuation |
|---|---|---|---|
| Youtube transcription (exported with youtube-dl) | B | ❌ | ❌ |
| IBM Watson | C | ✅ | ❌ |
| SpeechBrain | ? | ✅ | ? |
| DeepSpeech | ? | ❌ | ❌ |
| AssemblyAI | B- | ✅ | ✅ |
| Whisper by OpenAI (medium model) | A | ❌ | ✅ |
pyannote.audio |
❌ | ✅ | ❌ |
I tested a combination of Youtube and IBM watson (free tier) in the past: https://gist.github.com/pagdot/3b39187c6e0ca18dedd1f1108338855f
The result was... ok. Not great, but better than nothing.
In my google collab, I further found a test with DeepSpeech by Mozilla
If anyone is interested in also taking a look, Google Colab is great way to test in on a big GPU offered by Google and there often example projects either by the projects themselves or the community for Colab.
Either way a platform to run the transcription on in production would be required and maybe even a way to contribute in their quality.
Could imagine pushing the results in this or another git repository, so that the community can make PRs with fixes
Edit:
2022-08-18: Fixed youtube entry in table (sadly it has no punctuation); added entry for Assembly AI