Skip to content

whisper-base_timestamped broken with chunk_length_s=30 #1358

@jozefchutka

Description

@jozefchutka

System Info

transformers.js: 3.6.1

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Using chunk_length_s=30 and onnx-community/whisper-base_timestamped produces broken timestamsp

Reproduction

Run the following code and notice the output in console log using the attached src.pcm (in .zip)

<script type="module">
const { env, pipeline } = await import("https://cdn.jsdelivr.net/npm/@huggingface/[email protected]/dist/transformers.min.js");
env.allowLocalModels = false;

const buffer = await (await fetch("src.pcm")).arrayBuffer();
const audio = new Float32Array(buffer);

const pipe = await pipeline("automatic-speech-recognition",
	"onnx-community/whisper-base_timestamped",
	{dtype:{encoder_model:"fp32", decoder_model_merged:"q4"},
	device:"webgpu"});

const result = await pipe(audio, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: "word",
	language: "en"});

console.log(result.chunks.map(chunk => `${chunk.timestamp[0]} -> ${chunk.timestamp[1]} ${chunk.text}`))
</script>

it prints:

"29.98 -> 29.98  every",
"29.98 -> 29.98  day",
"29.98 -> 29.98  style."

Timestamps are invalid and there is also far more speaking.

Changing chunk_length_s to 29 fixes the issue and produces rather valid output:

"0 -> 0.42  everyday",
"0.42 -> 0.86  style.",
"1.38 -> 1.56  - True",
"1.56 -> 2  classic",
"2 -> 2.5  delivers",
"2.5 -> 3.02  premium",
"3.02 -> 3.54  essentials",
"3.54 -> 3.84  built",
"3.84 -> 4.08  for",
"4.08 -> 4.42  real",
"4.42 -> 4.9  life.",
"5.4 -> 5.64  Grab",
"5.64 -> 6.04  yours",
"6.04 -> 6.36  at",
"6.36 -> 6.86  Target,",
"7.3 -> 7.78  Costco,",
"8.28 -> 8.3  or",
"8.3 -> 8.5  head",
"8.5 -> 8.7  to",
"8.7 -> 9.34  TrueClassic",
"9.34 -> 10.04 .com",
"10.04 -> 12.28 /p4p.",
"12.86 -> 13.08  Get",
"13.08 -> 13.38  hooked",
"13.38 -> 13.52  up",
"13.52 -> 13.94  today.",
"14.16 -> 14.24  Now",
"14.24 -> 14.46  before",
"14.46 -> 14.62  we",
"14.62 -> 14.82  go,",
"15.1 -> 15.24  just",
"15.24 -> 15.42  wanna",
"15.42 -> 15.56  give",
"15.56 -> 15.68  a",
"15.68 -> 15.86  big",
"15.86 -> 16.1  shout",
"16.1 -> 16.28  out",
"16.28 -> 16.76  to",
"16.76 -> 16.9  the",
"16.9 -> 17.52  CEO",
"17.52 -> 17.86  and",
"17.86 -> 18.32  founder,",
"18.48 -> 18.6  Ryan",
"18.6 -> 18.92  Frouder,",
"18.98 -> 19.06  for",
"19.06 -> 19.22  coming",
"19.22 -> 19.36  on",
"19.36 -> 19.5  our",
"19.5 -> 19.76  show",
"19.76 -> 20.4  and",
"20.4 -> 20.6  just",
"20.6 -> 20.86  showing",
"20.86 -> 21.08  some",
"21.08 -> 21.28  love.",
"21.46 -> 21.62  Now,",
"21.9 -> 22.36  let's",
"22.36 -> 22.46  get",
"22.46 -> 22.7  back",
"22.7 -> 23.06  to",
"23.06 -> 23.22  the",
"23.22 -> 23.54  episode",
"24.32 -> 24.44  I",
"24.44 -> 24.6  mean",
"24.6 -> 25.4  like",
"25.4 -> 25.52  I",
"25.52 -> 25.68  said",
"25.68 -> 26.12  we're",
"26.12 -> 26.34  going",
"26.34 -> 26.58  through",
"26.58 -> 26.82  that",
"26.82 -> 27.34  we're",
"27.34 -> 27.56  losing",
"27.56 -> 28.02  stars",
"28.02 -> 29.26  and",
"29.26 -> 29.38  then",
"29.38 -> 29.56  we",
"29.56 -> 29.84  kind",
"29.84 -> 29.98  of"

Why is 30 broken in this case? Is 29 safer in all cases or is it just coincidence?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions