-
Notifications
You must be signed in to change notification settings - Fork 165
Kaldi bbc speakers segment #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
not working, needs re-thinking how to break words into segments
💡 idea, could segment by speaker segment, and then set a arbitrary word count limit for the paragraph, if the speaker segment paragraph exceeds that word count it could be split into a new one(?) (unless remaining words is less then x amount -eg you wouldn't want to have just two words in the next paragraph) |
counts if word count is preserved
To verify algo is working as expected, doing the test manually now, but adding these notes in case it gives ideas 💡 on how to do it programmatically ( Added end times attribute to segments to make comparison more striaghtforward. Using segments.map((seg)=>{
seg.end = seg.start+seg.duration;
return seg;
}) And got [ { '@type': 'Segment',
start: 0,
duration: 2.74,
bandwidth: 'S',
speaker: { '@id': 'S0', gender: 'F' },
end: 2.74 },
{ '@type': 'Segment',
start: 9.1,
duration: 3.91,
bandwidth: 'S',
speaker: { '@id': 'S1', gender: 'F' },
end: 13.01 },
{ '@type': 'Segment',
start: 13.01,
duration: 6.75,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 19.759999999999998 },
{ '@type': 'Segment',
start: 19.76,
duration: 1.95,
bandwidth: 'S',
speaker: { '@id': 'S22', gender: 'F' },
end: 21.71 },
{ '@type': 'Segment',
start: 21.71,
duration: 2.63,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 24.34 },
{ '@type': 'Segment',
start: 24.41,
duration: 19.28,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 43.69 },
{ '@type': 'Segment',
start: 46.72,
duration: 7.22,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 53.94 },
{ '@type': 'Segment',
start: 55.05,
duration: 9.31,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 64.36 },
{ '@type': 'Segment',
start: 64.61,
duration: 2.03,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 66.64 },
{ '@type': 'Segment',
start: 67.5,
duration: 1.59,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 69.09 },
{ '@type': 'Segment',
start: 69.75,
duration: 8.71,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 78.46000000000001 },
{ '@type': 'Segment',
start: 78.74,
duration: 2.9,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 81.64 },
{ '@type': 'Segment',
start: 81.79,
duration: 4.25,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 86.04 },
{ '@type': 'Segment',
start: 86.43,
duration: 5.68,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 92.11000000000001 },
{ '@type': 'Segment',
start: 94.79,
duration: 6.42,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 101.21000000000001 },
{ '@type': 'Segment',
start: 101.26,
duration: 3.71,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 104.97 },
{ '@type': 'Segment',
start: 106.17,
duration: 4,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 110.17 },
{ '@type': 'Segment',
start: 110.53,
duration: 4.16,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 114.69 },
{ '@type': 'Segment',
start: 115.97,
duration: 2.11,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 118.08 },
{ '@type': 'Segment',
start: 118.22,
duration: 10.48,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 128.7 },
{ '@type': 'Segment',
start: 128.74,
duration: 8.31,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 137.05 },
{ '@type': 'Segment',
start: 139.01,
duration: 19.69,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 158.7 },
{ '@type': 'Segment',
start: 158.7,
duration: 12.29,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 170.98999999999998 },
{ '@type': 'Segment',
start: 174.73,
duration: 8.29,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 183.01999999999998 },
{ '@type': 'Segment',
start: 183.02,
duration: 19.49,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 202.51000000000002 },
{ '@type': 'Segment',
start: 202.97,
duration: 10.4,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 213.37 },
{ '@type': 'Segment',
start: 214.79,
duration: 18.41,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 233.2 },
{ '@type': 'Segment',
start: 233.39,
duration: 12.76,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 246.14999999999998 },
{ '@type': 'Segment',
start: 246.15,
duration: 12.94,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 259.09000000000003 },
{ '@type': 'Segment',
start: 260.42,
duration: 11.37,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 271.79 },
{ '@type': 'Segment',
start: 274.26,
duration: 15.51,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 289.77 },
{ '@type': 'Segment',
start: 289.77,
duration: 7.44,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 297.21 },
{ '@type': 'Segment',
start: 297.38,
duration: 18.04,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 315.42 },
{ '@type': 'Segment',
start: 315.42,
duration: 4.57,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 319.99 },
{ '@type': 'Segment',
start: 322.8,
duration: 4.09,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 326.89 },
{ '@type': 'Segment',
start: 327.38,
duration: 8.83,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 336.21 },
{ '@type': 'Segment',
start: 337.13,
duration: 19.94,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 357.07 },
{ '@type': 'Segment',
start: 358.3,
duration: 5.56,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 363.86 },
{ '@type': 'Segment',
start: 363.93,
duration: 12.64,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 376.57 },
{ '@type': 'Segment',
start: 377.22,
duration: 14.99,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 392.21000000000004 },
{ '@type': 'Segment',
start: 392.21,
duration: 7.62,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 399.83 },
{ '@type': 'Segment',
start: 404.84,
duration: 4.64,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 409.47999999999996 },
{ '@type': 'Segment',
start: 410.67,
duration: 16.53,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 427.20000000000005 },
{ '@type': 'Segment',
start: 427.21,
duration: 8.23,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 435.44 },
{ '@type': 'Segment',
start: 435.44,
duration: 11.12,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 446.56 },
{ '@type': 'Segment',
start: 446.58,
duration: 2.76,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 449.34 },
{ '@type': 'Segment',
start: 449.52,
duration: 12.38,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 461.9 },
{ '@type': 'Segment',
start: 462.32,
duration: 5.45,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 467.77 },
{ '@type': 'Segment',
start: 468.66,
duration: 13.28,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 481.94 },
{ '@type': 'Segment',
start: 482.06,
duration: 3.71,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 485.77 },
{ '@type': 'Segment',
start: 485.91,
duration: 2.01,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 487.92 },
{ '@type': 'Segment',
start: 488.47,
duration: 5.28,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 493.75 },
{ '@type': 'Segment',
start: 494.13,
duration: 15.43,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 509.56 },
{ '@type': 'Segment',
start: 509.85,
duration: 5.05,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 514.9 },
{ '@type': 'Segment',
start: 515.42,
duration: 9.01,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 524.43 },
{ '@type': 'Segment',
start: 525.43,
duration: 10.46,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 535.89 },
{ '@type': 'Segment',
start: 536.18,
duration: 5.89,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 542.0699999999999 },
{ '@type': 'Segment',
start: 542.75,
duration: 3.82,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 546.57 },
{ '@type': 'Segment',
start: 546.57,
duration: 19.63,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 566.2 },
{ '@type': 'Segment',
start: 566.2,
duration: 6.61,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 572.8100000000001 },
{ '@type': 'Segment',
start: 572.81,
duration: 11.85,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 584.66 },
{ '@type': 'Segment',
start: 585.47,
duration: 10.72,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 596.19 },
{ '@type': 'Segment',
start: 596.53,
duration: 5.69,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 602.22 },
{ '@type': 'Segment',
start: 602.97,
duration: 6.28,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 609.25 },
{ '@type': 'Segment',
start: 610.26,
duration: 13.9,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 624.16 },
{ '@type': 'Segment',
start: 625.43,
duration: 6.44,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 631.87 },
{ '@type': 'Segment',
start: 632.46,
duration: 2.16,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 634.62 },
{ '@type': 'Segment',
start: 635.07,
duration: 2.76,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 637.83 },
{ '@type': 'Segment',
start: 642.38,
duration: 13.65,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 656.03 },
{ '@type': 'Segment',
start: 656.03,
duration: 3.54,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 659.5699999999999 },
{ '@type': 'Segment',
start: 659.87,
duration: 3.57,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 663.44 },
{ '@type': 'Segment',
start: 664.65,
duration: 3.09,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 667.74 },
{ '@type': 'Segment',
start: 668.68,
duration: 3.96,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 672.64 },
{ '@type': 'Segment',
start: 674.01,
duration: 10.48,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 684.49 },
{ '@type': 'Segment',
start: 684.49,
duration: 17.09,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 701.58 },
{ '@type': 'Segment',
start: 702.42,
duration: 2.1,
bandwidth: 'S',
speaker: { '@id': 'S10', gender: 'F' },
end: 704.52 },
{ '@type': 'Segment',
start: 706.32,
duration: 4.03,
bandwidth: 'S',
speaker: { '@id': 'S1', gender: 'F' },
end: 710.35 } ] interpolating with words can establish
|
import groupWordsInParagraphsBySpeakers from "./src/lib/Util/adapters/bbc-kaldi/group-words-by-speakers"; { '@type': 'Segment',
start: 0,
duration: 2.74,
bandwidth: 'S',
speaker: { '@id': 'S0', gender: 'F' },
end: 2.74 }, No words { '@type': 'Segment',
start: 9.1,
duration: 3.91,
bandwidth: 'S',
speaker: { '@id': 'S1', gender: 'F' },
end: 13.01 }, No words { '@type': 'Segment',
start: 13.01,
duration: 6.75,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 19.759999999999998 },
{ '@type': 'Segment',
start: 19.76,
duration: 1.95,
bandwidth: 'S',
speaker: { '@id': 'S22', gender: 'F' },
end: 21.71 }, No words { '@type': 'Segment',
start: 21.71,
duration: 2.63,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 24.34 },
Orphan word: That { '@type': 'Segment',
start: 24.41,
duration: 19.28,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 43.69 },
Orphan word:
{ '@type': 'Segment',
start: 46.72,
duration: 7.22,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 53.94 },
... got a little tedious, so stopping here for now, but so far seems like it was holding on fine |
surprised to see orphan words tho. An example
{
"start": 24.2,
"confidence": 1,
"end": 24.6,
"word": "that",
"punct": "that",
"index": 28
}, Looking at: segment before { '@type': 'Segment',
start: 21.71,
duration: 2.63,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 24.34 }, Segment after { '@type': 'Segment',
start: 24.41,
duration: 19.28,
bandwidth: 'S',
speaker: { '@id': 'S12', gender: 'F' },
end: 43.69 },
The problem might be with the segmentation, the word seems to start in the segment before and end in the segment after ( |
segmentation by previous php algo
|
Ready for review Other things to consider, for this or for another PR?
|
Turns out it is an artifact of the system
|
Another option, used by @MathieuLoutre in another project, if there is punctuation, is
🙌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a bit of a cleanup - there seems to be a few unfinished thoughts in here that maybe you'd want to address?
Other than that, linting and some code-writing could be beneficial.
fixed eslint conflict
done changes, and updated with master, ready to merge. |
making a note, outstanding for separate PR is perhaps within a speaker segment, if over a minute in length it could split into a new paragraph (with same speaker) to avoid very long paragraphs (also to keep draftJs performance up, as long paragraphs are less performant with virtual dom) TBC - out of scope for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't hit request changes but can you put the lint rule back in and perhaps change the naming of that function if you agree?
Think the logic could be refactored to be more understandable to others but don't want to block. This is a common thing with the PRs 🙃need some dedicated, effective refactor time soon.
also, this file: the file naming is absolutely nuts. really need to address it cause it's getting out of hand 😓 |
cleaned out some of the transcript json example files. |
as they are now contained in the adapters
Is your Pull Request request related to another issue in this repository ?
Optional Speaker segmentation for Kaldi #80
Describe what the PR does
State whether the PR is ready for review or whether it needs extra work
-->Do not merge, still working on it, + a few things to discuss/decide<--
Additional context
Some considerations
TL;DR
It's good to have paragraphs. As at @Laurian pointed out, for performance reason. Eg worste case scenario, if the paragraph are only an indication of speaker change in the case where it's only one speaker and therefore you would be editing all the text as as one paragraph draftJs performance would struggle in less performing machines (coz React and virtual dom).
There should be paragraphs within the same continuous speaker "segment". I believe Kaldi might identify continuous speakers as distinct even if belonging to same speaker. See segmentation json in project repo.
To accommodate this the parsing of segments should be done differently. where continuous segments are treated as a paragraph each.
Altho it be good to know the logic Kaldi uses to identify the segments as being distinct (pauses? tone change?)
Used wordcounter.net to check text in current demo without speakers and timecodes info(settings toggles) to check the parsing was producing the right amount of text. (at the moment it is missing last paragraph.)
Re programmatically identifying speakers boundaries, see paper Automatic Paragraph Identification: A Study across Languages and Domains
Outstanding