Skip to content

Transcriptions #301

@pagdot

Description

@pagdot

I've investigated doing transcriptions for JB shows in the past without reaching a satisfying conclusion yet.

What I'd expect from a decent transcription would be:

  • A decent correctness of the result.
    • Specific terminology and names can be hard to get right. Nice to have would be to be able to correct results and feed them back to improve detection
  • Speaker diarisation (=recognising who speaks when)
  • Detecting sentences/punctuation

A few services I took a look at:

Service/framework quality Speaker diarisation punctuation
Youtube transcription (exported with youtube-dl) B
IBM Watson C
SpeechBrain ? ?
DeepSpeech ?
AssemblyAI B-
Whisper by OpenAI (medium model) A
pyannote.audio

I tested a combination of Youtube and IBM watson (free tier) in the past: https://gist.github.com/pagdot/3b39187c6e0ca18dedd1f1108338855f

The result was... ok. Not great, but better than nothing.

In my google collab, I further found a test with DeepSpeech by Mozilla

If anyone is interested in also taking a look, Google Colab is great way to test in on a big GPU offered by Google and there often example projects either by the projects themselves or the community for Colab.

Either way a platform to run the transcription on in production would be required and maybe even a way to contribute in their quality.
Could imagine pushing the results in this or another git repository, so that the community can make PRs with fixes

Edit:

2022-08-18: Fixed youtube entry in table (sadly it has no punctuation); added entry for Assembly AI

Metadata

Metadata

Assignees

No one assigned

    Labels

    JB - need informationinformation is needed from the JB teamenhancementNew feature, enhancement, or requestin progresscurrently being worked onlow priority... but not necessarily unimportant

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions