Skip to content

Stt adapter awstranscribe #110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Gribbs
Copy link

@Gribbs Gribbs commented Mar 13, 2019

Is your Pull Request request related to another issue in this repository ?

Yes. Original issue #108

Describe what the PR does

The PR add an Adapter to support Amazon Transcribe https://aws.amazon.com/transcribe/

State whether the PR is ready for review or whether it needs extra work

Ready to review for merging.

Additional context
The PR doesn't cater for Speakers at this point (although the Transcribe service does support Speakers).
I haven't done an example-usage.js file in there or or .test.js but I do have an example Transcribe json file in there.

Steps to test the changes:

yarn start or npm start. I have added the option 'Amazon Transcribe' in src/select-stt-json-type.js

@pietrop
Copy link
Contributor

pietrop commented Mar 13, 2019

Awesome, thanks @Gribbs ,
I'll find some time to have a look at this by the of the week.

for now I'd say:

  • ok to leave speaker identification for later stage - eg as separate PR.
  • You can probably considerexample-usage.js optional, as that can be added later.
  • But would you be able to add the.test.js test file so that the Travis CI doesn't fail? Let me know if you need any help with that.

cc @chrishutchinson

@chrishutchinson
Copy link

Hey @Gribbs - I'll take a skim through the implementation too, but having run it locally with a test video / transcript from AWS, it seems to be working really well, nice one!

I don't seem to be able to download the resultant edited transcript, but I'm guessing that's either a bug with Firefox, or something else, not specific to this code change. cc @pietrop are you aware of anything? I'll do further digging if not and file a separate issue.

@pietrop
Copy link
Contributor

pietrop commented Mar 13, 2019

Thanks @chrishutchinson ,

haven't had a chance to try it in the PR yet but I did a quick test in the demo app https://bbc.github.io/react-transcript-editor (load demo)

I am able to download draftJs and plain text of current demo in Chrome but not in firefox.

Firefox console says

TypeError: t.refs.timedTextEditor is undefined

which is something to do with getEditorContent in the example/demo app.

Which at the moment is implemented this way

  getEditorContent = (exportFormat) => {
    return this.refs.timedTextEditor.getEditorContent(exportFormat);
  }

But could be refactored as a callback? and not user refs - which is not recommended by React and see if it also fixes it in Firefox? - this could be a separate PR.

Do you get the same error? (I am aware the demo is a different version number compared to master, and compared to the PR)

@pietrop
Copy link
Contributor

pietrop commented Mar 15, 2019

Thanks for this PR @Gribbs!

generateEntitiesRanges
Glad it worked out to use the shared helper function

Test
The amazon transcribe test fails, because the test file amazonTranscribe.sample.js needs this setup step.

In src/lib/Util/adapters/amazon-transcribe/sample/amazonTranscribe.sample.js all the entity ranges are as

"entityRanges": [{
		"start": 13.03,
		"end": 13.22,
		"confidence": 1,
		"text": "There",
		"offset": 0,
		"length": 5,
		"key": "ico58i6"
	}, {

and because the key is randomly generated it cannot be deterministically tested by jest.
So instead the attribute should be just expecting to be a string

"key": expect.any(String)

As explained in the guide, you could do a find and replace in your code editor, if it supports regex for "key": "([a-zA-Z0-9]*)" and replace with the above. if that makes sense?

you should be able to see detailed Travis CI log of the test failing here

Your tests should then pass.

Console errors

I get this error in the console when adding a AWS transcript

index.js:1446 Warning: Received NaN for the `data-start` attribute. If this is expected, cast the value to a string.
    in span (at Word.js:36)

as a quick fix might be enough to do some type casting and add a .toString() to src/lib/TranscriptEditor/TimedTextEditor/Word.js at line 37-38

  <span
        data-start={ data.start.toString() }
        data-end={ data.end.toString() }

Adding transcript first and then media - error
If i test by adding adding media first and then transcript, it works fine but wen adding transcript first and then media it crashes and gives some errors

Uncaught TypeError: Cannot read property 'isPresentInLocalStorage' of null
    at TranscriptEditor._this.ifPresentRetrieveTranscriptFromLocalStorage 

Altho to be fair, might not be specific of this branch - just tested on master and got the same.
Altho it works fine in current demo page - which is a few versions behind, so might need to do some investigating.

I think it might be connected with issue #87 and PR #88 and somehow most likely the bug might have been re-introduced since.

But since it's effecting master as well, I wouldn't necessary aim to figure this out as part of this PR. can be a separate issue/PR.

example usage
You could add this in example usage

import amazonTranscribeToDraft from'./index';
import amazonTranscribeTedTalkTranscript from './sample/autoEdit2TedTalkTranscript.sample.json';

console.log(amazonTranscribeToDraft(amazonTranscribeTedTalkTranscript));

eslint
It be good to re-introduce ESLint in the dev dependencies. did you followed the instructions you got with the error that you copy and pasted up there? do you have a copy of ESLint installed globally?


TL;DR:

would you be able to:

  • fix the test
  • add the example usage
  • restore the the linting

The other errors, since they are across the master branch can look into it with @jamesdools

@Gribbs
Copy link
Author

Gribbs commented Mar 15, 2019

Thanks @pietrop pietrop. Ive added the eslint back in to dev dependencies.

I also was getting sporadic "Received NaN for the data-start attribute" too and couldn't work out what was causing it. I did notice the json returned from amazon transcribe provides the numbers wrapped in quotes like:

"start_time": "13.22"

compared to bbck-kaldi for example, which looks like:

"start": 13.44

and thought this might've been causing it originally even though I was casting to a number. I've had to do some awkward processing with punctuation start and end times since there is no start and end time provided by Transcribe. I've now added an addition .toFixed() method so my decimal places don't go on too long:

start: /punctuation/.test(currentWord.type) ? (parseFloat(previousWord.end_time) + 0.05).toFixed(2) : parseFloat(currentWord.start_time)

For now I'm not getting the error.

For the testing, I've added the "key": expect.any(String) to the amazonTranscribe.sample.js but I still can't get the test to pass. I've tried to debug it but I can't work out how to get any visibility into what is making the test fail (or run forever). I've tried:

yarn test index.test.js

and

yarn test-ci

and

yarn test-ci --verbose

But nothing seems to give me an indication what the problem is. If you can help with that it would be appreciated

@pietrop
Copy link
Contributor

pietrop commented Mar 15, 2019

Thank for the changes @Gribbs ,

eslint
ah, my bad, I think you were right, eslint doesn't need to be installed in create react app as apparently it's there under the hood (cc @jamesdools for some of the other PRs eg #115)

I had missed that in the top part error message - noticed it in the travis error message

There might be a problem with the project dependency tree.
It is likely not a bug in Create React App, but something you need to fix locally.
The react-scripts package provided by Create React App requires a dependency:
  "eslint": "5.12.0"
Don't try to install it manually: your package manager does it automatically.

tests
No worries, I can have a look into it and see what might be the problem. Then I reckon it'd be good to merge to master.

@Gribbs
Copy link
Author

Gribbs commented Mar 15, 2019

Alright sounds good! Thanks again for your help!

@pietrop pietrop changed the base branch from master to stt-adapter-awstranscribe-review March 15, 2019 21:30
@pietrop pietrop merged commit b22dfb0 into bbc:stt-adapter-awstranscribe-review Mar 15, 2019
@pietrop
Copy link
Contributor

pietrop commented Mar 15, 2019

I noticed a bug, seems like punctuation such as commas is added is as it's own word?

It might be more straightforward to append it to the previous word as it doesn't come with start and end time informations.

eg

{
        "alternatives": [
          {
            "confidence": null,
            "content": ","
          }
        ],
        "type": "punctuation"
      },

if double clicking the punctuation, eg , it crashes, as time is not associated to it?

index.js:79 Uncaught TypeError: Failed to set the 'currentTime' property on 'HTMLMediaElement': The provided double value is non-finite.

and just to confirm and I can see it get renders as with start and end time as NaN

<span data-start="NaN" data-end="NaN" data-confidence="high" data-prev-times="" data-entity-key="c0a6o9g" class="Word">
	<span data-offset-key="64k16-22-0"><span data-text="true">,</span>
</span>

What could be an easy fix for this @Gribbs ?

@Gribbs
Copy link
Author

Gribbs commented Mar 15, 2019

I thought I’d tested that. Maybe my last change with toFixed() has caused it? I was accounting for this by using the previous word end-time as the the next punctuation word start time + a tiny amount of time.

@Gribbs
Copy link
Author

Gribbs commented Mar 15, 2019

I’ll take another look. I’ve just left the house so will look at it again this evening

@pietrop
Copy link
Contributor

pietrop commented Mar 15, 2019

ok no worries, with the change below, commas are also included into having start and end time

const groupWordsInParagraphs = (words) => {
  const results = [];
  let paragraph = {
    words: [],
    text: []
  };
  words.forEach((word, index) => {
    // if word type is punctuation
    const content = word.alternatives[0].content;
    let previousWord = {};
    if (word.type === 'punctuation' && /[.?!]/.test(content)) {
      previousWord = words[index - 1]; //assuming here the very first word is never punctuation
      paragraph.words.push(normalizedWord(word, previousWord));
      paragraph.text.push(content);
      results.push(paragraph);
      // reset paragraph
      paragraph = {
        words: [],
        text: []
      };
    } else if (word.type === 'punctuation' && /[,?!]/.test(content)) {
      previousWord = words[index - 1]; //assuming here the very first word is never punctuation
      paragraph.words.push(normalizedWord(word, previousWord));
      paragraph.text.push(content);
    } else {
      paragraph.words.push(normalizedWord(word, previousWord));
      paragraph.text.push(content);
    }
  });

  return results;
};

altho it looks a little off, coz treating them like words there are then extra spaces before commas and full stops

Screen Shot 2019-03-15 at 22 04 10

so there might be a better way to do it, and append punctuation to the previous word.

@pietrop
Copy link
Contributor

pietrop commented Mar 15, 2019

I am going to have closer look next week

@Gribbs
Copy link
Author

Gribbs commented Mar 15, 2019

I think your right. The spaces around the punctuation have been annoying me too. I’ll work on appending it to the previous word. That should simplify the code a bit too

@Gribbs
Copy link
Author

Gribbs commented Mar 17, 2019

Hi @pietrop I've completed the work to append punctuation items to the previous word which seems to work well. I've pushed the changes to this branch. I've also added some additional tests in index.test.js file so it should pass ok. I'm skipping the original problematic test for now with describe.skip until I can work out what the issue with that test was.

@Gribbs
Copy link
Author

Gribbs commented Mar 17, 2019

oh! I see this pull request has now been closed. Would you like me to put it on separate pull request?

@pietrop
Copy link
Contributor

pietrop commented Mar 17, 2019

Hi @Gribbs, yes, was reviewing it in branch stt-adapter-awstranscribe-reviewbut since you made new changes feel free to setup a PR from your fork against that branch. if that works for you?

@Gribbs
Copy link
Author

Gribbs commented Mar 18, 2019

Hi @Gribbs, yes, was reviewing it in branch stt-adapter-awstranscribe-reviewbut since you made new changes feel free to setup a PR from your fork against that branch. if that works for you?

I've created PR #116 for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants