Stt adapter awstranscribe #110

Gribbs · 2019-03-13T13:31:44Z

Is your Pull Request request related to another issue in this repository ?

Yes. Original issue #108

Describe what the PR does

The PR add an Adapter to support Amazon Transcribe https://aws.amazon.com/transcribe/

State whether the PR is ready for review or whether it needs extra work

Ready to review for merging.

Additional context
The PR doesn't cater for Speakers at this point (although the Transcribe service does support Speakers).
I haven't done an example-usage.js file in there or or .test.js but I do have an example Transcribe json file in there.

Steps to test the changes:

yarn start or npm start. I have added the option 'Amazon Transcribe' in src/select-stt-json-type.js

pietrop · 2019-03-13T14:52:47Z

Awesome, thanks @Gribbs ,
I'll find some time to have a look at this by the of the week.

for now I'd say:

ok to leave speaker identification for later stage - eg as separate PR.
You can probably considerexample-usage.js optional, as that can be added later.
But would you be able to add the.test.js test file so that the Travis CI doesn't fail? Let me know if you need any help with that.

cc @chrishutchinson

package.json

chrishutchinson · 2019-03-13T16:30:15Z

Hey @Gribbs - I'll take a skim through the implementation too, but having run it locally with a test video / transcript from AWS, it seems to be working really well, nice one!

I don't seem to be able to download the resultant edited transcript, but I'm guessing that's either a bug with Firefox, or something else, not specific to this code change. cc @pietrop are you aware of anything? I'll do further digging if not and file a separate issue.

pietrop · 2019-03-13T17:17:54Z

Thanks @chrishutchinson ,

haven't had a chance to try it in the PR yet but I did a quick test in the demo app https://bbc.github.io/react-transcript-editor (load demo)

I am able to download draftJs and plain text of current demo in Chrome but not in firefox.

Firefox console says

TypeError: t.refs.timedTextEditor is undefined

which is something to do with getEditorContent in the example/demo app.

Which at the moment is implemented this way

  getEditorContent = (exportFormat) => {
    return this.refs.timedTextEditor.getEditorContent(exportFormat);
  }

But could be refactored as a callback? and not user refs - which is not recommended by React and see if it also fixes it in Firefox? - this could be a separate PR.

Do you get the same error? (I am aware the demo is a different version number compared to master, and compared to the PR)

pietrop · 2019-03-15T09:23:05Z

Thanks for this PR @Gribbs!

generateEntitiesRanges
Glad it worked out to use the shared helper function

Test
The amazon transcribe test fails, because the test file amazonTranscribe.sample.js needs this setup step.

In src/lib/Util/adapters/amazon-transcribe/sample/amazonTranscribe.sample.js all the entity ranges are as

"entityRanges": [{
		"start": 13.03,
		"end": 13.22,
		"confidence": 1,
		"text": "There",
		"offset": 0,
		"length": 5,
		"key": "ico58i6"
	}, {

and because the key is randomly generated it cannot be deterministically tested by jest.
So instead the attribute should be just expecting to be a string

"key": expect.any(String)

As explained in the guide, you could do a find and replace in your code editor, if it supports regex for "key": "([a-zA-Z0-9]*)" and replace with the above. if that makes sense?

you should be able to see detailed Travis CI log of the test failing here

Your tests should then pass.

Console errors

I get this error in the console when adding a AWS transcript

index.js:1446 Warning: Received NaN for the `data-start` attribute. If this is expected, cast the value to a string.
    in span (at Word.js:36)

as a quick fix might be enough to do some type casting and add a .toString() to src/lib/TranscriptEditor/TimedTextEditor/Word.js at line 37-38

  <span
        data-start={ data.start.toString() }
        data-end={ data.end.toString() }

Adding transcript first and then media - error
If i test by adding adding media first and then transcript, it works fine but wen adding transcript first and then media it crashes and gives some errors

Uncaught TypeError: Cannot read property 'isPresentInLocalStorage' of null
    at TranscriptEditor._this.ifPresentRetrieveTranscriptFromLocalStorage

Altho to be fair, might not be specific of this branch - just tested on master and got the same.
Altho it works fine in current demo page - which is a few versions behind, so might need to do some investigating.

I think it might be connected with issue #87 and PR #88 and somehow most likely the bug might have been re-introduced since.

But since it's effecting master as well, I wouldn't necessary aim to figure this out as part of this PR. can be a separate issue/PR.

example usage
You could add this in example usage

import amazonTranscribeToDraft from'./index';
import amazonTranscribeTedTalkTranscript from './sample/autoEdit2TedTalkTranscript.sample.json';

console.log(amazonTranscribeToDraft(amazonTranscribeTedTalkTranscript));

eslint
It be good to re-introduce ESLint in the dev dependencies. did you followed the instructions you got with the error that you copy and pasted up there? do you have a copy of ESLint installed globally?

TL;DR:

would you be able to:

fix the test
add the example usage
~~restore the the linting~~

The other errors, since they are across the master branch can look into it with @jamesdools

Gribbs · 2019-03-15T21:11:24Z

Thanks @pietrop pietrop. Ive added the eslint back in to dev dependencies.

I also was getting sporadic "Received NaN for the data-start attribute" too and couldn't work out what was causing it. I did notice the json returned from amazon transcribe provides the numbers wrapped in quotes like:

"start_time": "13.22"

compared to bbck-kaldi for example, which looks like:

"start": 13.44

and thought this might've been causing it originally even though I was casting to a number. I've had to do some awkward processing with punctuation start and end times since there is no start and end time provided by Transcribe. I've now added an addition .toFixed() method so my decimal places don't go on too long:

start: /punctuation/.test(currentWord.type) ? (parseFloat(previousWord.end_time) + 0.05).toFixed(2) : parseFloat(currentWord.start_time)

For now I'm not getting the error.

For the testing, I've added the "key": expect.any(String) to the amazonTranscribe.sample.js but I still can't get the test to pass. I've tried to debug it but I can't work out how to get any visibility into what is making the test fail (or run forever). I've tried:

yarn test index.test.js

and

yarn test-ci

and

yarn test-ci --verbose

But nothing seems to give me an indication what the problem is. If you can help with that it would be appreciated

pietrop · 2019-03-15T21:15:24Z

Thank for the changes @Gribbs ,

eslint
ah, my bad, I think you were right, eslint doesn't need to be installed in create react app as apparently it's there under the hood (cc @jamesdools for some of the other PRs eg #115)

I had missed that in the top part error message - noticed it in the travis error message

There might be a problem with the project dependency tree.
It is likely not a bug in Create React App, but something you need to fix locally.
The react-scripts package provided by Create React App requires a dependency:
  "eslint": "5.12.0"
Don't try to install it manually: your package manager does it automatically.

tests
No worries, I can have a look into it and see what might be the problem. Then I reckon it'd be good to merge to master.

Gribbs · 2019-03-15T21:23:11Z

Alright sounds good! Thanks again for your help!

pietrop · 2019-03-15T21:46:41Z

I noticed a bug, seems like punctuation such as commas is added is as it's own word?

It might be more straightforward to append it to the previous word as it doesn't come with start and end time informations.

eg

{
        "alternatives": [
          {
            "confidence": null,
            "content": ","
          }
        ],
        "type": "punctuation"
      },

if double clicking the punctuation, eg , it crashes, as time is not associated to it?

index.js:79 Uncaught TypeError: Failed to set the 'currentTime' property on 'HTMLMediaElement': The provided double value is non-finite.

and just to confirm and I can see it get renders as with start and end time as NaN

<span data-start="NaN" data-end="NaN" data-confidence="high" data-prev-times="" data-entity-key="c0a6o9g" class="Word">
	<span data-offset-key="64k16-22-0"><span data-text="true">,</span>
</span>

What could be an easy fix for this @Gribbs ?

Gribbs · 2019-03-15T21:57:57Z

I thought I’d tested that. Maybe my last change with toFixed() has caused it? I was accounting for this by using the previous word end-time as the the next punctuation word start time + a tiny amount of time.

Gribbs · 2019-03-15T22:04:22Z

I’ll take another look. I’ve just left the house so will look at it again this evening

pietrop · 2019-03-15T22:05:57Z

ok no worries, with the change below, commas are also included into having start and end time

const groupWordsInParagraphs = (words) => {
  const results = [];
  let paragraph = {
    words: [],
    text: []
  };
  words.forEach((word, index) => {
    // if word type is punctuation
    const content = word.alternatives[0].content;
    let previousWord = {};
    if (word.type === 'punctuation' && /[.?!]/.test(content)) {
      previousWord = words[index - 1]; //assuming here the very first word is never punctuation
      paragraph.words.push(normalizedWord(word, previousWord));
      paragraph.text.push(content);
      results.push(paragraph);
      // reset paragraph
      paragraph = {
        words: [],
        text: []
      };
    } else if (word.type === 'punctuation' && /[,?!]/.test(content)) {
      previousWord = words[index - 1]; //assuming here the very first word is never punctuation
      paragraph.words.push(normalizedWord(word, previousWord));
      paragraph.text.push(content);
    } else {
      paragraph.words.push(normalizedWord(word, previousWord));
      paragraph.text.push(content);
    }
  });

  return results;
};

altho it looks a little off, coz treating them like words there are then extra spaces before commas and full stops

so there might be a better way to do it, and append punctuation to the previous word.

pietrop · 2019-03-15T22:10:00Z

I am going to have closer look next week

Gribbs · 2019-03-15T22:12:03Z

I think your right. The spaces around the punctuation have been annoying me too. I’ll work on appending it to the previous word. That should simplify the code a bit too

Gribbs · 2019-03-17T22:57:50Z

Hi @pietrop I've completed the work to append punctuation items to the previous word which seems to work well. I've pushed the changes to this branch. I've also added some additional tests in index.test.js file so it should pass ok. I'm skipping the original problematic test for now with describe.skip until I can work out what the issue with that test was.

Gribbs · 2019-03-17T22:58:32Z

oh! I see this pull request has now been closed. Would you like me to put it on separate pull request?

pietrop · 2019-03-17T23:37:50Z

Hi @Gribbs, yes, was reviewing it in branch stt-adapter-awstranscribe-reviewbut since you made new changes feel free to setup a PR from your fork against that branch. if that works for you?

Gribbs · 2019-03-18T00:16:49Z

Hi @Gribbs, yes, was reviewing it in branch stt-adapter-awstranscribe-reviewbut since you made new changes feel free to setup a PR from your fork against that branch. if that works for you?

I've created PR #116 for this

Gribbs added 9 commits March 10, 2019 15:24

added aws transcribe sample without channel enabled

d327cbf

latest

a71b4af

fixed up offset

e6eefbf

fixed up punctuation errors

ac8c7c3

removed debugger statement

d8baa7f

fixing few things

408b017

Merge branch 'stt-adapter-awstranscribe'

ea3581d

finalized 1st draft

3c2ed9e

removed my redundant generateEntitiesRanges method

bcc8466

Gribbs mentioned this pull request Mar 13, 2019

Clarification on how to create an Adapter for AWS Transcribe #108

Closed

pietrop reviewed Mar 13, 2019

View reviewed changes

package.json Outdated Show resolved Hide resolved

added .test.js file

0022ca4

pietrop mentioned this pull request Mar 15, 2019

Fix 87 initialising transcript media v2 #114

Merged

awsgjoel added 2 commits March 16, 2019 07:18

reintroduced linting, added example usage

d0f5182

tried to fix the test

b22dfb0

pietrop changed the base branch from master to stt-adapter-awstranscribe-review March 15, 2019 21:30

pietrop merged commit b22dfb0 into bbc:stt-adapter-awstranscribe-review Mar 15, 2019

Gribbs mentioned this pull request Mar 18, 2019

Add punctuation changes - AWS Adapter #116

Closed

This was referenced Mar 18, 2019

Stt adapter awstranscribe review #118

Closed

Add punctuation changes #119

Merged

Stt adapter aws transcribe - Gribbs #120

Merged

Stt adapter awstranscribe #110

Stt adapter awstranscribe #110

Uh oh!

Conversation

Gribbs commented Mar 13, 2019

Uh oh!

pietrop commented Mar 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chrishutchinson commented Mar 13, 2019

Uh oh!

pietrop commented Mar 13, 2019

Uh oh!

pietrop commented Mar 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR:

Uh oh!

Gribbs commented Mar 15, 2019

Uh oh!

pietrop commented Mar 15, 2019

Uh oh!

Gribbs commented Mar 15, 2019

Uh oh!

pietrop commented Mar 15, 2019

Uh oh!

Gribbs commented Mar 15, 2019

Uh oh!

Gribbs commented Mar 15, 2019

Uh oh!

pietrop commented Mar 15, 2019

Uh oh!

pietrop commented Mar 15, 2019

Uh oh!

Gribbs commented Mar 15, 2019

Uh oh!

Gribbs commented Mar 17, 2019

Uh oh!

Gribbs commented Mar 17, 2019

Uh oh!

pietrop commented Mar 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gribbs commented Mar 18, 2019

Uh oh!

Uh oh!

pietrop commented Mar 13, 2019 •

edited

Loading

pietrop commented Mar 15, 2019 •

edited

Loading

pietrop commented Mar 17, 2019 •

edited

Loading