Skip to content

Conversation

bobqianic
Copy link
Collaborator

@bobqianic bobqianic commented Jan 14, 2024

  • Basic functionality
  • Rewrite whisper_wrap_segment
  • Rewrite L5717-L5805
  • Remove print_realtime This is too tricky
  • Remove hallucination by using token_nosp
  • Heuristic hallucination detection (Basic implementation)
  • Disable beam search when $temperature>0$
  • Fix tokenizer
  • Fix audio feature seeking mechanism
  • Use compression ratio instead of entropy Will be addressed in separate PRs
  • Code cleanup

@bobqianic bobqianic added the decoding Decoding related issues label Jan 17, 2024
@bobqianic bobqianic linked an issue Jan 17, 2024 that may be closed by this pull request
@thewh1teagle
Copy link
Contributor

What's the status of this PR? is it safe to use?
I experience decoding issues
thewh1teagle/vibe#34

@jwijffels
Copy link
Contributor

jwijffels commented Apr 5, 2024

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

  • instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.
  • or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (#1854) and a change regarding how to handle non-speech?

@ronyfadel
Copy link

@bobqianic are you pursuing this at the moment?

@bobqianic
Copy link
Collaborator Author

@bobqianic are you pursuing this at the moment?

No, at least not in May. I'm really tied up with a lot of things this month.

@bygreencn
Copy link

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

  • instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.
  • or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (#1854) and a change regarding how to handle non-speech?

The best way to include Silero Voice Activity into whisper.cpp is to add thirdparty package of onnxruntime1.12.1 dll, then call silero onnx model. My branch had added it. Even VAD, the hallucinations on silent intervals is also happenning.

@IntendedConsequence
Copy link

The best way to include Silero Voice Activity into whisper.cpp is to add thirdparty package of onnxruntime1.12.1 dll, then call silero onnx model. My branch had added it. Even VAD, the hallucinations on silent intervals is also happenning.

I recommend considering a previous Silero VAD version, namely v3.1. The current version v4 (at the moment of writing) often hallucinates speech on lengthy chunks of silent or near-silent audio segments.
snakers4/silero-vad#369
snakers4/silero-vad#396

But you have to add a heavyweight dependency like onnxruntime just to run a 750KB model. The smallest size I could possibly reduce onnxruntime.dll to was about 2.2MB, which is still 3x the size of silero weights, and requires a lengthy custom build of onnxruntime from source with reduced operator set configs and other size reduction options. And prebuilt redistributables are easily 5-9 MB or more.

I have a working Silero v3.1 implementation in pure C, but as much as I would like to suggest it as an option, the code is quite bad, I wrote it as a personal project for learning low level neural nets.

@ziegenberg
Copy link
Contributor

@bobqianic, Could you rebase your changes? I'd like to test those improvements of yours with production data on our setup.

@bobqianic
Copy link
Collaborator Author

@ziegenberg I did some testing, and it LGTM. If the CI is mostly green, you can proceed with your testing now.

@ziegenberg
Copy link
Contributor

I already did some testing and fixed some of the errors on my own. Looks promising. I see less hallucinations, but I need to do some more statistics. I will switch to your branch for the next tests.

Is your PR #1854 also related to this improvement?

@bobqianic
Copy link
Collaborator Author

Is your PR #1854 also related to this improvement?

PR #1854 is a subset of this PR, meaning this PR includes everything in PR #1854.

@ziegenberg
Copy link
Contributor

What data/statistics would you need from my side to consider this PR validated and get it merged?

@bobqianic
Copy link
Collaborator Author

What data/statistics would you need from my side to consider this PR validated and get it merged?

Thank you. If you have the ground truth text, please calculate the WER.

@Makememo
Copy link

I tested the output of anime using medium.en and found a problem with time axis recognition in the middle.

file: https://dropover.cloud/f7

2024-06-27 09 20 40 e020

@ziegenberg
Copy link
Contributor

Hi @Makememo,
was this a singular incident or does this happen regularly?

@Makememo
Copy link

Hi @Makememo,

was this a singular incident or does this happen regularly?

I tested three videos and I had this problem.

The common feature of these videos is that there is a music section that begins to mess up the timeline.

@ronyfadel
Copy link

@ziegenberg @bobqianic just wanted to check in on this: is it still needed? Will it land any time soon?

@ziegenberg
Copy link
Contributor

We now extensively tested this patch.

Summary

We see fewer hallucinations and an overall improvement in accuracy.

Details

Hallucinations still happen if there is a period of time with no spoken words or music without words. If this period of time is longer than 30 seconds, it completely messes up the next couple of minutes, and it hallucinates widely. We "fixed" this by processing the input with Silero VAD first and letting whisper.cpp only analyze the parts where speech was recognized using the --offset-t N and --duration N options. This works astonishingly well!

We have no real statistics to show as we have mostly lecture recordings with heavy use of German and Austrian German dialects. This makes generating a validated transcription very difficult. Whisper.cpp mostly corrects the grammar mistakes from the lecturers, which would result in a higher Word Error Rate, but in reality, the transcription is better.

Conclusion

We will use this patch in production from now on. In my opinion, this can be merged.

@itsthisjustin
Copy link

Will test on our end now too. Anything "special" that needs to be done in order to test this using the Swift package? @ziegenberg ?

@ziegenberg
Copy link
Contributor

I have no experience with Swift, sorry.

@ziegenberg
Copy link
Contributor

Hi @bobqianic, would you be willing to rebase this once more?

@ziegenberg
Copy link
Contributor

Hi @ggerganov,

Is this being merged? Or do the changes now differ too much for this PR to be of any use?

Greetings,
Daniel

@ggerganov
Copy link
Member

@ziegenberg It has diverged too much. No plans to merge.

@ziegenberg
Copy link
Contributor

Thx @ggerganov for the prompt response! Much appreciated.

As we are currently running on a year-old version of whisper.cpp with this patch included, is there anything in this patch you would consider worthwhile to bring forward to the current state of whisper.cpp? If yes, I'd start looking for capable resources to update this patch series.

Greetings,
Daniel

@ggerganov
Copy link
Member

Not sure. Maybe there is some value in the tokenizer changes, but this is mostly for non-english languages so I'm not really sure how to test and verify this.

Overall, I think latest version is safe to use in production even without these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

decoding Decoding related issues research🔬

Projects

None yet