Skip to content

Stt adapter IBM - andrew d anderson #123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 22, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,11 @@ import { TranscriptEditor } from "@bbc/react-transcript-editor";

<!-- _High level overview of system architecture_ -->

Uses [`create-component-lib`](https://www.npmjs.com/package/create-component-lib) as explaied in this [blog post](https://hackernoon.com/creating-a-library-of-react-components-using-create-react-app-without-ejecting-d182df690c6b) to setup the environment to develop this React component.
uses [`create-component-lib`](https://www.npmjs.com/package/create-component-lib) as explaied in this [blog post](https://hackernoon.com/creating-a-library-of-react-components-using-create-react-app-without-ejecting-d182df690c6b) to setup the environment to develop this React.

This uses [Create React App 2.0](https://reactjs.org/blog/2018/10/01/create-react-app-v2.html) so we are using [CSS Modules](https://github.com/css-modules/css-modules) to contain the scope of the css for this component.
<!--
Uses CSS grid-layout https://medium.com/samsung-internet-dev/common-responsive-layouts-with-css-grid-and-some-without-245a862f48df -->

> Place everything you want to publish to npm inside `src/lib`.

Expand Down Expand Up @@ -175,7 +177,7 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md) guidelines and [CODE_OF_CONDUCT.md](./C

## Licence

See [LICENCE.md](./LICENCE.md)
<!-- mention MIT Licence -->

## Legal Disclaimer

Expand Down
4 changes: 2 additions & 2 deletions docs/features-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,10 @@ Import Transcript Json - Adapters
- [x] News Labs API - BBC Kaldi
- [x] autoEdit 2
- [x] AWS Transcriber
- [x] IBM Watson STT
- [X] Speechmatics
- [ ] Gentle Transcription
- [ ] Gentle Alignment Json
- [ ] IBM Watson STT
- [X] Speechmatics
- [ ] AssemblyAI
- [ ] Rev
- [ ] Srt
Expand Down
56 changes: 29 additions & 27 deletions src/lib/Util/adapters/amazon-transcribe/index.js
Original file line number Diff line number Diff line change
@@ -1,35 +1,41 @@
import generateEntitiesRanges from '../generate-entities-ranges/index.js';

/**
* Helper function to generate draft.js entities,
* see unit test for example data structure
* it adds offset and length to recognise word in draftjs
* Converts AWS Transcribe Json to DraftJs
* see `sample` folder for example of input and output as well as `example-usage.js`
*/

import generateEntitiesRanges from '../generate-entities-ranges/index.js';

export const stripLeadingSpace = word => {
return word.replace(/^\s/, '');
};

/**
* @param {json} words - List of words
* @param {string} wordAttributeName - eg 'punct' or 'text' or etc.
* attribute for the word object containing the text. eg word ={ punct:'helo', ... }
* or eg word ={ text:'helo', ... }
*/

export const getBestAlternativeForWord = (word) => {
export const getBestAlternativeForWord = word => {
if (/punctuation/.test(word.type)) {
return Object.assign(word.alternatives[0], { confidence: 1 }); //Transcribe doesn't provide a confidence for punctuation
}
const wordWithHighestConfidence = word.alternatives.reduce(function(prev, current) {
return (parseFloat(prev.confidence) > parseFloat(current.confidence)) ? prev : current;
const wordWithHighestConfidence = word.alternatives.reduce(function(
prev,
current
) {
return parseFloat(prev.confidence) > parseFloat(current.confidence)
? prev
: current;
});

return wordWithHighestConfidence;
};

/**
Normalizes words so they can be used in
the generic generateEntitiesRanges() method
**/

const normalizeWord = (currentWord, previousWord) => {
* Normalizes words so they can be used in
* the generic generateEntitiesRanges() method
**/
const normalizeWord = currentWord => {
const bestAlternative = getBestAlternativeForWord(currentWord);

return {
Expand All @@ -52,7 +58,7 @@ export const appendPunctuationToPreviousWord = (punctuation, previousWord) => {
};
};

export const mapPunctuationItemsToWords = (words) => {
export const mapPunctuationItemsToWords = words => {
const itemsToRemove = [];
const dirtyArray = words.map((word, index) => {
let previousWord = {};
Expand All @@ -61,8 +67,7 @@ export const mapPunctuationItemsToWords = (words) => {
previousWord = words[index - 1];

return appendPunctuationToPreviousWord(word, previousWord);
}
else {
} else {
return word;
}
});
Expand All @@ -72,17 +77,12 @@ export const mapPunctuationItemsToWords = (words) => {
});
};

export const stripLeadingSpace = (word) => {
return word.replace(/^\s/, '');
};

/**
* groups words list from amazon transcribe transcript based on punctuation.
* @todo To be more accurate, should introduce an honorifics library to do the splitting of the words.
* @param {array} words - array of words opbjects from kaldi transcript
* @param {array} words - array of words objects from kaldi transcript
*/

const groupWordsInParagraphs = (words) => {
const groupWordsInParagraphs = words => {
const results = [];
let paragraph = {
words: [],
Expand All @@ -106,11 +106,13 @@ const groupWordsInParagraphs = (words) => {
return results;
};

const amazonTranscribeToDraft = (amazonTranscribeJson) => {
const amazonTranscribeToDraft = amazonTranscribeJson => {
const results = [];
const tmpWords = amazonTranscribeJson.results.items;
const wordsWithRemappedPunctuation = mapPunctuationItemsToWords(tmpWords);
const wordsByParagraphs = groupWordsInParagraphs(wordsWithRemappedPunctuation);
const wordsByParagraphs = groupWordsInParagraphs(
wordsWithRemappedPunctuation
);
wordsByParagraphs.forEach((paragraph, i) => {
const draftJsContentBlockParagraph = {
text: paragraph.text.join(' '),
Expand All @@ -122,7 +124,7 @@ const amazonTranscribeToDraft = (amazonTranscribeJson) => {
},
// the entities as ranges are each word in the space-joined text,
// so it needs to be compute for each the offset from the beginning of the paragraph and the length
entityRanges: generateEntitiesRanges(paragraph.words, 'text'), // wordAttributeName
entityRanges: generateEntitiesRanges(paragraph.words, 'text') // wordAttributeName
};
results.push(draftJsContentBlockParagraph);
});
Expand Down
29 changes: 2 additions & 27 deletions src/lib/Util/adapters/autoEdit2/index.js
Original file line number Diff line number Diff line change
@@ -1,31 +1,6 @@
/**
* Convert autoEdit2 Json
*
* into
*
```
const blocks = [
{
text: 'Hello',
type: 'paragraph',
data: {
speaker: 'Foo',
},
entityRanges: [],
},
{
text: 'World',
type: 'paragraph',
data: {
speaker: 'Bar',
},
entityRanges: [],
},
];
```
*
* See samples folder and test file
* for reference data structures
* Convert autoEdit2 Json to draftJS
* see `sample` folder for example of input and output as well as `example-usage.js`
*/

import generateEntitiesRanges from '../generate-entities-ranges/index';
Expand Down
116 changes: 2 additions & 114 deletions src/lib/Util/adapters/bbc-kaldi/index.js
Original file line number Diff line number Diff line change
@@ -1,118 +1,6 @@
/**
* Convert BBC Kaldi json
```
{
"action": "audio-transcribe",
"retval": {
"status": true,
"wonid": "octo:2692ea33-d595-41d8-bfd5-aa7f2d2f89ee",
"punct": "There is a day. About ten years ago when ...",
"words": [
{
"start": 13.02,
"confidence": 0.68,
"end": 13.17,
"word": "there",
"punct": "There",
"index": 0
},
{
"start": 13.17,
"confidence": 0.61,
"end": 13.38,
"word": "is",
"punct": "is",
"index": 1
},
...
```
*
* into
*
```
const blocks = [
{
"text": "There is a day.",
"type": "paragraph",
"data": {
"speaker": "TBC 0",
"words": [
{
"start": 13.02,
"confidence": 0.68,
"end": 13.17,
"word": "there",
"punct": "There",
"index": 0
},
{
"start": 13.17,
"confidence": 0.61,
"end": 13.38,
"word": "is",
"punct": "is",
"index": 1
},
{
"start": 13.38,
"confidence": 0.99,
"end": 13.44,
"word": "a",
"punct": "a",
"index": 2
},
{
"start": 13.44,
"confidence": 1,
"end": 13.86,
"word": "day",
"punct": "day.",
"index": 3
}
],
"start": 13.02
},
"entityRanges": [
{
"start": 13.02,
"end": 13.17,
"confidence": 0.68,
"text": "There",
"offset": 0,
"length": 5,
"key": "li6c6ld"
},
{
"start": 13.17,
"end": 13.38,
"confidence": 0.61,
"text": "is",
"offset": 6,
"length": 2,
"key": "pcgzkp6"
},
{
"start": 13.38,
"end": 13.44,
"confidence": 0.99,
"text": "a",
"offset": 9,
"length": 1,
"key": "ngomd9"
},
{
"start": 13.44,
"end": 13.86,
"confidence": 1,
"text": "day.",
"offset": 11,
"length": 4,
"key": "sgmfl4f"
}
]
},
...
```
* Convert BBC Kaldi json to draftJs
* see `sample` folder for example of input and output as well as `example-usage.js`
*
*/

Expand Down
6 changes: 6 additions & 0 deletions src/lib/Util/adapters/ibm/example-usage.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import ibmToDraft from './index.js';
import ibmTedTalkTranscript from './sample/ibmTedTalkTranscript.sample.json';

const result = ibmToDraft(ibmTedTalkTranscript);

console.log(JSON.stringify(result, null, 2));
Loading