Transcriptions

I've investigated doing transcriptions for JB shows in the past without reaching a satisfying conclusion yet.

What I'd expect from a decent transcription would be:

- A decent correctness of the result.
  - Specific terminology and names can be hard to get right. Nice to have would be to be able to correct results and feed them back to improve detection
- Speaker diarisation (=recognising who speaks when) 
- Detecting sentences/punctuation

A few services I took a look at:

| Service/framework |  quality | Speaker diarisation | punctuation |
| :-------- | :----------: | :--------------------------: | :-------------: |
| Youtube transcription (exported with youtube-dl) |  B | ❌ | ❌ | 
| [IBM Watson](https://www.ibm.com/cloud/watson-speech-to-text) | C | ✅ | ❌ | 
| [SpeechBrain](https://speechbrain.github.io/) | ? | ✅ | ? |
| [DeepSpeech](https://deepspeech.readthedocs.io/) | ? | ❌ | ❌ |
| [AssemblyAI](https://assemblyai.com) | B- | ✅ | ✅ |
| [Whisper by OpenAI](https://github.com/openai/whisper) (medium model) | A | ❌ | ✅ |
| [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) | ❌ | ✅ | ❌ |

I tested a combination of Youtube and IBM watson (free tier) in the past: https://gist.github.com/pagdot/3b39187c6e0ca18dedd1f1108338855f

The result was... ok. Not great, but better than nothing.

In my google collab, I further found [a test with DeepSpeech by Mozilla](https://colab.research.google.com/drive/1d5pd21oHnS_P-D_lvuLeu8RRWDFbUm7J?usp=sharing)

If anyone is interested in also taking a look, Google Colab is great way to test in on a big GPU offered by Google and there often example projects either by the projects themselves or the community for Colab.

Either way a platform to run the transcription on in production would be required and maybe even a way to contribute in their quality.
Could imagine pushing the results in this or another git repository, so that the community can make PRs with fixes

Edit:

2022-08-18: Fixed youtube entry in table (sadly it has no punctuation); added entry for Assembly AI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transcriptions #301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Service/framework	quality	Speaker diarisation	punctuation
Youtube transcription (exported with youtube-dl)	B	❌	❌
IBM Watson	C	✅	❌
SpeechBrain	?	✅	?
DeepSpeech	?	❌	❌
AssemblyAI	B-	✅	✅
Whisper by OpenAI (medium model)	A	❌	✅
`pyannote.audio`	❌	✅	❌

Transcriptions #301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions