Skip to content

Adapters - Guide Docs #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 18, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
language: node_js
node_js:
- "node"
- "10"

install:
- node --version
Expand Down
115 changes: 115 additions & 0 deletions docs/guides/adapters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Guide: How to Create an Adapter - Draft

_this is a draft. we'd like this guide to be relatively easy to read for newcomers, so [feel free to raise an issue](https://github.com/bbc/react-transcript-editor/issues/new?template=question.md) if you think anything is unclear and we'd be happy to address that._

Adapters are used to enable the `TranscriptEditor` component to convert various STT transcripts into a format draftJS can understand to provide data for the `TimedTextEditor`.

## How to create a new adapter
If you want to create a new adapter for a new STT service that is not yet supported by the component, we welcome [PRs](https://help.github.com/articles/about-pull-requests/).

[Feel free to begin by raising an issue](https://github.com/bbc/react-transcript-editor/issues/new?template=feature_request.md) so that others can be aware that there is active development for that specific STT service, and if needed we can synchronies the effort.

[Fork the repo](https://help.github.com/articles/fork-a-repo/) and
create a branch with the name of the stt service, eg `stt-adapter-speechmatics`.

<!-- TODO: adjust link -->

## Context

To see this in the larger context when we call `sttJsonAdapter` with `transcriptData` and a `sttJsonType` we expect it to return an object with two attributes `blocks` and `entityMap`.

This is then used within TimedTextEditor with the help of draftJs function [convertFromRaw](https://draftjs.org/docs/api-reference-data-conversion#convertfromraw) to create a new content state for the editor.

So in order to convert a json from STT from service to draftJs json we need to create:
- a data [block](https://draftjs.org/docs/api-reference-content-block#docsNav)
- [entityRanges](https://draftjs.org/docs/advanced-topics-entities)
- `entityMap`

Note that `entityMap` and `entityRanges` will get generated programmatically by dedicated functions.

checkout [a quick side note on how the DraftJS `block`, `entityRanges` and `entityMap` works, in the context of the TranscriptEditor component](./draftjs-blocks-entityrange-entitmap.md). Or feel free to skip this and come back later to it, if you are not interested in the underlying implementation.

## Steps

In your branch

- [ ] Create a folder with the name of the STT service - eg `speechmatics`
- [ ] add a `adapters/${sttServiceName}/sample` folder
- [ ] add a sample json file from the STT service in this last folder - this will be useful for testing. Name it `${name of the stt service}.sample.json`
<!-- TODO: we should check these json are excluded from the bundle -->
- [ ] add option in [adapters/index.js](adapters/index.js)

In the adapters [adapters/index.js](adapters/index.js) in the `sttJsonAdapter` function switch statement add a new `case` with the new STT service type eg `speechmatics`

<!-- TODO: modify import path if module is moved/refactored -->
```js
import speechmaticsToDraft from './speechmatics/index';

...

case 'speechmatics':
blocks = speechmaticsToDraft(transcriptData);
return { blocks, entityMap: createEntityMap(blocks) };
```

- [ ] add an adapter function.

as shown in the example you'd also need to add a function with the stt provider name +`ToDraft` eg `speechmaticsToDraft`that takes in the transcript data.

- [ ] create a function to convert the STT data structure into draftJs blocks and entityRanges.

You can see examples from `bbc-kaldi` and `autoEdit2` adapters.

In pseudocode it's reccomended to follow this approach:

1. Expose one function call that takes in the stt json data
2. Have a helper function `groupWordsInParagraphs` that as the name suggests groups words list from the STT provider transcript based on punctuation. and returns an array of words objects.

The underlying details for this will vary depending on how the STT json of the provider present the data, and how the attributes are named etc..

3. Iterate over the paragraphs to create draftJS content blocks (see `bbc-kaldi` and `autoEdit2` example).

```js
wordsByParagraphs.forEach((paragraph, i) => {
const draftJsContentBlockParagraph = {
text: paragraph.text.join(' '),
type: 'paragraph',
data: {
speaker: `TBC ${ i }`,
words: paragraph.words, //
start: paragraph.words[0].start//
},
// the entities as ranges are each word in the space-joined text,
// so it needs to be compute for each the offset from the beginning of the paragraph and the length
entityRanges: generateEntitiesRanges(paragraph.words, 'text'), // wordAttributeName
};
// console.log(JSON.stringify(draftJsContentBlockParagraph,null,2))
results.push(draftJsContentBlockParagraph);
});

```

4. And use the helper function `generateEntitiesRanges` to add the `entityRanges` to each block. - see above

5. If you have speaker diarization info you can also add this to the block info - _optional_


## Tests

This project uses jest. and once you submit the PR the tests are run by TravisCI. It is recommended to write some basic tests at a minimum so that you can see at a glance if the adapter is working as expected.

In order to write your tests, you want to have a `sample` folder with transcript data from stt and expected draftJs data output with file extensions `.sample.json` and `.sample.js` - see `bbc-kaldi` and `autoEdit2` example. This is so that those stub/example files are not bundled with the component when packaging for npm.

_If you don't have much experience with automated testing don't let this put you off tho, feel free to raise it as an issue and we can help out._

**top tip**: the draftJs block key attributes are randomly generated, and therefore cannot be tested in a deterministic way. However there is a well established workaround, you can replace them in the json with a type definition. eg instead of expecting it to be a specific number, you just expect it to be a string.

In practice, for instance In Visual code you can search using a regex (option `*`). So you could search for

```js
"key": "([a-zA-Z0-9]*)"
```
And replace all with
```js
"key": expect.any(String)//"ss8pm4p"
```
196 changes: 196 additions & 0 deletions docs/guides/draftjs-blocks-entityrange-entitmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@

### DraftJS block, entityRanges and entityMap

A quick side note on how the DraftJS block, entityRanges and entityMap works, in the context of the TranscriptEditor component. For the [adapters](./adapters.md) guide.


#### Data Block

TL;DR: a block is a representation of a paragraph (as an Immutable Record) in draftJs and you can have some custom data associated to it.

But see the docs notes on [draftjs basics](https://github.com/bbc/react-transcript-editor/blob/master/docs/notes/draftjs/2018-10-01-draftjs-1-basics.md) to better understand the role of content block within the editor. As well as the draftJs official docs.

Here's an example of a block, you can see it can contain some custom data, eg speaker name, list of words, and start time (which would be the start time of the first word).

```js
[
{
"text": "There is a day.", // text
"type": "paragraph", // type of block
"data": { //optional custom data
"speaker": "TBC 0",
"words": [
...
],
"start": 13.02
},
"entityRanges": [ // <-- entity ranges
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you really need this + line 34 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this? as in the custom data? or the example code?

...
]
},
...
```

It also contains a list of `entityRanges`.

### Entity Ranges

`entityRanges` are part of individual blocks.

<!-- See the docs notes on [draftjs entity ranges](https://github.com/bbc/react-transcript-editor/blob/master/docs/notes/draftjs/2018-10-02-drafjs-2-entity-range.md) -->

From draftJs docs on [entity](https://draftjs.org/docs/advanced-topics-entities)

> the Entity system, which Draft uses for annotating ranges of text with metadata. Entities introduce levels of richness beyond styled text. Links, mentions, and embedded content can all be implemented using entities.

This is what we use to identify the words, from a list of characters, and associate data to it, such as start and end time information.

It sets the foundations for features such as click on a word can jump the player play-head to the corresponding time for that word.

Here's an example of `entityRanges` in the context of a data block.

Required fields are the `offset`, and `length`, which are used to identify the entity within the characters of the `text` attribute of the block.

This, combined with the `entityMap` has the advantage that if you type or delete some text before a certain entity, draftJs will do the ground work of adjusting the offsets and keeping these info in sync.

```js
[
{
"text": "There is a day.",
"type": "paragraph",
"data": {
...
},
"entityRanges": [
{
"start": 13.02, // Custom fields
"end": 13.17, // Custom fields
"confidence": 0.68, // Custom fields
"text": "There", // Custom fields - to detect what has changed
"offset": 0, // Required by Draft.js to know start of "selection"
"length": 5, //Required by Draft.js to know end of "selection" - in our case a word
"key": "ctavu0r" // can also be provided by draftjs if not provided. But providing your own gives more flexibility
},
...
```

### Entity Map

`entityMap` defines how to render the entities for the draftJs content state.

See draftJs docs for more on [entities](https://draftjs.org/docs/advanced-topics-entities#introduction)

And keeps in sync `entityRanges` through the `offset` and `length` attribute.

Here's an example
```js
{
"ayx62lj": {
"type": "WORD",
"mutability": "MUTABLE",
"data": {
"start": 13.02,
"end": 13.17,
"confidence": 0.68,
"text": "There",
"offset": 0,
"length": 5,
"key": "ayx62lj"
}
},
```

To see this in the larger context when we call `sttJsonAdapter` with `transcriptData` and a `sttJsonType` we expect it to return an object with two attributes `blocks` and `entityMap`.

```js
{
"blocks": [
{
"key": "500r2",
"text": "There is a day.",
"type": "paragraph",
"depth": 0,
"inlineStyleRanges": [],
"entityRanges": [
{
"offset": 0,
"length": 5,
"key": 0
},
{
"offset": 6,
"length": 2,
"key": 1
},
{
"offset": 9,
"length": 1,
"key": 2
},
{
"offset": 11,
"length": 4,
"key": 3
}
],
"data": {
"speaker": "test4",
"words": [
{
"start": 13.02,
"confidence": 0.68,
"end": 13.17,
"word": "there",
"punct": "There",
"index": 0
},
{
"start": 13.17,
"confidence": 0.61,
"end": 13.38,
"word": "is",
"punct": "is",
"index": 1
},
{
"start": 13.38,
"confidence": 0.99,
"end": 13.44,
"word": "a",
"punct": "a",
"index": 2
},
{
"start": 13.44,
"confidence": 1,
"end": 13.86,
"word": "day",
"punct": "day.",
"index": 3
}
],
"start": 13.02
}
},
...
],
"entityMap": {
"0": {
"type": "WORD",
"mutability": "MUTABLE",
"data": {
"start": 13.02,
"end": 13.17,
"confidence": 0.68,
"text": "There",
"offset": 0,
"length": 5,
"key": "1mgy3gm"
}
},
....
}
```


The good news, is that given the blocks and the entityRanges, we can programmatically generate the entityMap. Which means you don't have to worry about creating the entityMap when making an adapter.
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"test": "react-scripts test --env=jsdom",
"eject": "react-scripts eject",
"build:example": "react-scripts build",
"build:component": "rimraf dist && NODE_ENV=production babel src/lib --out-dir dist --copy-files --ignore __tests__,spec.js,test.js,__snapshots__",
"build:component": "rimraf dist && NODE_ENV=production babel src/lib --out-dir dist --copy-files --ignore __tests__,spec.js,test.js,__snapshots__,sample.json,sample.js ",
"deploy:ghpages": "npm run build:example && gh-pages -d build",
"test-ci": "CI=true react-scripts test --env=jsdom --verbose",
"lint": "eslint --ignore-path .eslintignore .",
Expand Down
2 changes: 1 addition & 1 deletion src/lib/Util/adapters/autoEdit2/example-usage.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
const autoEdit2ToDraft = require('./index');
const autoEdit2TedTalkTranscript = require('./sample/autoEdit2TedTalkTranscript-sample.json');
const autoEdit2TedTalkTranscript = require('./sample/autoEdit2TedTalkTranscript.sample.json');

console.log(autoEdit2ToDraft(autoEdit2TedTalkTranscript));
10 changes: 6 additions & 4 deletions src/lib/Util/adapters/autoEdit2/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@
import generateEntitiesRanges from '../generate-entities-ranges/index';

/**
* groups words list from kaldi transcript based on punctuation.
* groups words list from autoEdit transcript based on punctuation.
* @todo To be more accurate, should introduce an honorifics library to do the splitting of the words.
* @param {array} words - array of words opbjects from kaldi transcript
* @param {array} words - array of words objects from autoEdit transcript
*/

const groupWordsInParagraphs = (autoEditText) => {
Expand Down Expand Up @@ -74,12 +74,14 @@ const autoEdit2ToDraft = (autoEdit2Json) => {
const tmpWords = autoEdit2Json.text;
const wordsByParagraphs = groupWordsInParagraphs(tmpWords);

wordsByParagraphs.forEach((paragraph) => {
wordsByParagraphs.forEach((paragraph, i) => {
const draftJsContentBlockParagraph = {
text: paragraph.text.join(' '),
type: 'paragraph',
data: {
speaker: 'TBC',
speaker: `TBC ${ i }`,
words: paragraph.words,
start: paragraph.words[0].start
},
// the entities as ranges are each word in the space-joined text,
// so it needs to be compute for each the offset from the beginning of the paragraph and the length
Expand Down
4 changes: 2 additions & 2 deletions src/lib/Util/adapters/autoEdit2/index.test.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import autoEdit2ToDraft from './index';
// TODO: could make this test run faster by shortning the two sample to one or two paragraphs?
import draftTranscriptExample from './sample/autoEdit2ToDraft-sample';
import autoEdit2TedTalkTranscript from './sample/autoEdit2TedTalkTranscript-sample.json';
import draftTranscriptExample from './sample/autoEdit2ToDraft.sample.js';
import autoEdit2TedTalkTranscript from './sample/autoEdit2TedTalkTranscript.sample.json';

describe('bbcKaldiToDraft', () => {
const result = autoEdit2ToDraft(autoEdit2TedTalkTranscript, 'text');
Expand Down
Loading