-
Notifications
You must be signed in to change notification settings - Fork 165
Adapters - Guide Docs #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
2a08e4c
Added words to autoEdit blocks
d7cab7b
Adjusted kaldi adapter
86d13ab
added guides for adapters
557db9f
fixed tests for adapters
c442ce8
Removed unecessaery console.log
73e334d
fixed some typos
240ccb6
headings capitalization
46291f7
made changes from comments on PR
f546483
merge master changes
dfe5e65
added node version number travis config
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
language: node_js | ||
node_js: | ||
- "node" | ||
- "10" | ||
|
||
install: | ||
- node --version | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Guide: How to Create an Adapter - Draft | ||
|
||
_this is a draft. we'd like this guide to be relatively easy to read for newcomers, so [feel free to raise an issue](https://github.com/bbc/react-transcript-editor/issues/new?template=question.md) if you think anything is unclear and we'd be happy to address that._ | ||
|
||
Adapters are used to enable the `TranscriptEditor` component to convert various STT transcripts into a format draftJS can understand to provide data for the `TimedTextEditor`. | ||
|
||
## How to create a new adapter | ||
If you want to create a new adapter for a new STT service that is not yet supported by the component, we welcome [PRs](https://help.github.com/articles/about-pull-requests/). | ||
|
||
[Feel free to begin by raising an issue](https://github.com/bbc/react-transcript-editor/issues/new?template=feature_request.md) so that others can be aware that there is active development for that specific STT service, and if needed we can synchronies the effort. | ||
|
||
[Fork the repo](https://help.github.com/articles/fork-a-repo/) and | ||
create a branch with the name of the stt service, eg `stt-adapter-speechmatics`. | ||
|
||
<!-- TODO: adjust link --> | ||
|
||
## Context | ||
|
||
To see this in the larger context when we call `sttJsonAdapter` with `transcriptData` and a `sttJsonType` we expect it to return an object with two attributes `blocks` and `entityMap`. | ||
|
||
This is then used within TimedTextEditor with the help of draftJs function [convertFromRaw](https://draftjs.org/docs/api-reference-data-conversion#convertfromraw) to create a new content state for the editor. | ||
|
||
So in order to convert a json from STT from service to draftJs json we need to create: | ||
- a data [block](https://draftjs.org/docs/api-reference-content-block#docsNav) | ||
- [entityRanges](https://draftjs.org/docs/advanced-topics-entities) | ||
- `entityMap` | ||
|
||
Note that `entityMap` and `entityRanges` will get generated programmatically by dedicated functions. | ||
|
||
checkout [a quick side note on how the DraftJS `block`, `entityRanges` and `entityMap` works, in the context of the TranscriptEditor component](./draftjs-blocks-entityrange-entitmap.md). Or feel free to skip this and come back later to it, if you are not interested in the underlying implementation. | ||
|
||
## Steps | ||
|
||
In your branch | ||
|
||
- [ ] Create a folder with the name of the STT service - eg `speechmatics` | ||
- [ ] add a `adapters/${sttServiceName}/sample` folder | ||
- [ ] add a sample json file from the STT service in this last folder - this will be useful for testing. Name it `${name of the stt service}.sample.json` | ||
<!-- TODO: we should check these json are excluded from the bundle --> | ||
- [ ] add option in [adapters/index.js](adapters/index.js) | ||
|
||
In the adapters [adapters/index.js](adapters/index.js) in the `sttJsonAdapter` function switch statement add a new `case` with the new STT service type eg `speechmatics` | ||
|
||
<!-- TODO: modify import path if module is moved/refactored --> | ||
```js | ||
import speechmaticsToDraft from './speechmatics/index'; | ||
|
||
... | ||
|
||
case 'speechmatics': | ||
blocks = speechmaticsToDraft(transcriptData); | ||
return { blocks, entityMap: createEntityMap(blocks) }; | ||
``` | ||
|
||
- [ ] add an adapter function. | ||
|
||
as shown in the example you'd also need to add a function with the stt provider name +`ToDraft` eg `speechmaticsToDraft`that takes in the transcript data. | ||
|
||
- [ ] create a function to convert the STT data structure into draftJs blocks and entityRanges. | ||
|
||
You can see examples from `bbc-kaldi` and `autoEdit2` adapters. | ||
|
||
In pseudocode it's reccomended to follow this approach: | ||
|
||
1. Expose one function call that takes in the stt json data | ||
2. Have a helper function `groupWordsInParagraphs` that as the name suggests groups words list from the STT provider transcript based on punctuation. and returns an array of words objects. | ||
|
||
The underlying details for this will vary depending on how the STT json of the provider present the data, and how the attributes are named etc.. | ||
|
||
3. Iterate over the paragraphs to create draftJS content blocks (see `bbc-kaldi` and `autoEdit2` example). | ||
|
||
```js | ||
wordsByParagraphs.forEach((paragraph, i) => { | ||
const draftJsContentBlockParagraph = { | ||
text: paragraph.text.join(' '), | ||
type: 'paragraph', | ||
data: { | ||
speaker: `TBC ${ i }`, | ||
words: paragraph.words, // | ||
start: paragraph.words[0].start// | ||
}, | ||
// the entities as ranges are each word in the space-joined text, | ||
// so it needs to be compute for each the offset from the beginning of the paragraph and the length | ||
entityRanges: generateEntitiesRanges(paragraph.words, 'text'), // wordAttributeName | ||
}; | ||
// console.log(JSON.stringify(draftJsContentBlockParagraph,null,2)) | ||
results.push(draftJsContentBlockParagraph); | ||
}); | ||
|
||
``` | ||
|
||
4. And use the helper function `generateEntitiesRanges` to add the `entityRanges` to each block. - see above | ||
|
||
5. If you have speaker diarization info you can also add this to the block info - _optional_ | ||
|
||
|
||
## Tests | ||
|
||
This project uses jest. and once you submit the PR the tests are run by TravisCI. It is recommended to write some basic tests at a minimum so that you can see at a glance if the adapter is working as expected. | ||
|
||
In order to write your tests, you want to have a `sample` folder with transcript data from stt and expected draftJs data output with file extensions `.sample.json` and `.sample.js` - see `bbc-kaldi` and `autoEdit2` example. This is so that those stub/example files are not bundled with the component when packaging for npm. | ||
|
||
_If you don't have much experience with automated testing don't let this put you off tho, feel free to raise it as an issue and we can help out._ | ||
|
||
**top tip**: the draftJs block key attributes are randomly generated, and therefore cannot be tested in a deterministic way. However there is a well established workaround, you can replace them in the json with a type definition. eg instead of expecting it to be a specific number, you just expect it to be a string. | ||
|
||
In practice, for instance In Visual code you can search using a regex (option `*`). So you could search for | ||
|
||
```js | ||
"key": "([a-zA-Z0-9]*)" | ||
``` | ||
And replace all with | ||
```js | ||
"key": expect.any(String)//"ss8pm4p" | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
|
||
### DraftJS block, entityRanges and entityMap | ||
|
||
A quick side note on how the DraftJS block, entityRanges and entityMap works, in the context of the TranscriptEditor component. For the [adapters](./adapters.md) guide. | ||
|
||
|
||
#### Data Block | ||
|
||
TL;DR: a block is a representation of a paragraph (as an Immutable Record) in draftJs and you can have some custom data associated to it. | ||
|
||
But see the docs notes on [draftjs basics](https://github.com/bbc/react-transcript-editor/blob/master/docs/notes/draftjs/2018-10-01-draftjs-1-basics.md) to better understand the role of content block within the editor. As well as the draftJs official docs. | ||
|
||
Here's an example of a block, you can see it can contain some custom data, eg speaker name, list of words, and start time (which would be the start time of the first word). | ||
|
||
```js | ||
[ | ||
{ | ||
"text": "There is a day.", // text | ||
"type": "paragraph", // type of block | ||
"data": { //optional custom data | ||
"speaker": "TBC 0", | ||
"words": [ | ||
... | ||
], | ||
"start": 13.02 | ||
}, | ||
"entityRanges": [ // <-- entity ranges | ||
... | ||
] | ||
}, | ||
... | ||
``` | ||
|
||
It also contains a list of `entityRanges`. | ||
|
||
### Entity Ranges | ||
|
||
`entityRanges` are part of individual blocks. | ||
|
||
<!-- See the docs notes on [draftjs entity ranges](https://github.com/bbc/react-transcript-editor/blob/master/docs/notes/draftjs/2018-10-02-drafjs-2-entity-range.md) --> | ||
|
||
From draftJs docs on [entity](https://draftjs.org/docs/advanced-topics-entities) | ||
|
||
> the Entity system, which Draft uses for annotating ranges of text with metadata. Entities introduce levels of richness beyond styled text. Links, mentions, and embedded content can all be implemented using entities. | ||
|
||
This is what we use to identify the words, from a list of characters, and associate data to it, such as start and end time information. | ||
|
||
It sets the foundations for features such as click on a word can jump the player play-head to the corresponding time for that word. | ||
|
||
Here's an example of `entityRanges` in the context of a data block. | ||
|
||
Required fields are the `offset`, and `length`, which are used to identify the entity within the characters of the `text` attribute of the block. | ||
|
||
This, combined with the `entityMap` has the advantage that if you type or delete some text before a certain entity, draftJs will do the ground work of adjusting the offsets and keeping these info in sync. | ||
|
||
```js | ||
[ | ||
pietrop marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{ | ||
"text": "There is a day.", | ||
"type": "paragraph", | ||
"data": { | ||
... | ||
}, | ||
"entityRanges": [ | ||
{ | ||
"start": 13.02, // Custom fields | ||
"end": 13.17, // Custom fields | ||
"confidence": 0.68, // Custom fields | ||
"text": "There", // Custom fields - to detect what has changed | ||
"offset": 0, // Required by Draft.js to know start of "selection" | ||
"length": 5, //Required by Draft.js to know end of "selection" - in our case a word | ||
"key": "ctavu0r" // can also be provided by draftjs if not provided. But providing your own gives more flexibility | ||
}, | ||
... | ||
``` | ||
|
||
### Entity Map | ||
|
||
`entityMap` defines how to render the entities for the draftJs content state. | ||
|
||
See draftJs docs for more on [entities](https://draftjs.org/docs/advanced-topics-entities#introduction) | ||
|
||
And keeps in sync `entityRanges` through the `offset` and `length` attribute. | ||
|
||
Here's an example | ||
```js | ||
{ | ||
"ayx62lj": { | ||
"type": "WORD", | ||
"mutability": "MUTABLE", | ||
"data": { | ||
"start": 13.02, | ||
"end": 13.17, | ||
"confidence": 0.68, | ||
"text": "There", | ||
"offset": 0, | ||
"length": 5, | ||
"key": "ayx62lj" | ||
} | ||
}, | ||
``` | ||
|
||
To see this in the larger context when we call `sttJsonAdapter` with `transcriptData` and a `sttJsonType` we expect it to return an object with two attributes `blocks` and `entityMap`. | ||
|
||
```js | ||
{ | ||
"blocks": [ | ||
{ | ||
"key": "500r2", | ||
"text": "There is a day.", | ||
"type": "paragraph", | ||
"depth": 0, | ||
"inlineStyleRanges": [], | ||
"entityRanges": [ | ||
{ | ||
"offset": 0, | ||
"length": 5, | ||
"key": 0 | ||
}, | ||
{ | ||
"offset": 6, | ||
"length": 2, | ||
"key": 1 | ||
}, | ||
{ | ||
"offset": 9, | ||
"length": 1, | ||
"key": 2 | ||
}, | ||
{ | ||
"offset": 11, | ||
"length": 4, | ||
"key": 3 | ||
} | ||
], | ||
"data": { | ||
"speaker": "test4", | ||
"words": [ | ||
{ | ||
"start": 13.02, | ||
"confidence": 0.68, | ||
"end": 13.17, | ||
"word": "there", | ||
"punct": "There", | ||
"index": 0 | ||
}, | ||
{ | ||
"start": 13.17, | ||
"confidence": 0.61, | ||
"end": 13.38, | ||
"word": "is", | ||
"punct": "is", | ||
"index": 1 | ||
}, | ||
{ | ||
"start": 13.38, | ||
"confidence": 0.99, | ||
"end": 13.44, | ||
"word": "a", | ||
"punct": "a", | ||
"index": 2 | ||
}, | ||
{ | ||
"start": 13.44, | ||
"confidence": 1, | ||
"end": 13.86, | ||
"word": "day", | ||
"punct": "day.", | ||
"index": 3 | ||
} | ||
], | ||
"start": 13.02 | ||
} | ||
}, | ||
... | ||
], | ||
"entityMap": { | ||
"0": { | ||
"type": "WORD", | ||
"mutability": "MUTABLE", | ||
"data": { | ||
"start": 13.02, | ||
"end": 13.17, | ||
"confidence": 0.68, | ||
"text": "There", | ||
"offset": 0, | ||
"length": 5, | ||
"key": "1mgy3gm" | ||
} | ||
}, | ||
.... | ||
} | ||
``` | ||
|
||
|
||
The good news, is that given the blocks and the entityRanges, we can programmatically generate the entityMap. Which means you don't have to worry about creating the entityMap when making an adapter. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
const autoEdit2ToDraft = require('./index'); | ||
const autoEdit2TedTalkTranscript = require('./sample/autoEdit2TedTalkTranscript-sample.json'); | ||
const autoEdit2TedTalkTranscript = require('./sample/autoEdit2TedTalkTranscript.sample.json'); | ||
|
||
console.log(autoEdit2ToDraft(autoEdit2TedTalkTranscript)); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you really need this + line 34 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this? as in the custom data? or the example code?