bbc · pietrop · Dec 18, 2018 · Dec 14, 2018 · Dec 14, 2018 · Dec 14, 2018
diff --git a/.travis.yml b/.travis.yml
@@ -1,6 +1,7 @@
 language: node_js
 node_js:
  - "node"
+ - "10"
 
 install:
   - node --version

diff --git a/docs/guides/adapters.md b/docs/guides/adapters.md
@@ -0,0 +1,115 @@
+# Guide: How to Create an Adapter - Draft
+
+_this is a draft. we'd like this guide to be relatively easy to read for newcomers, so [feel free to raise an issue](https://github.com/bbc/react-transcript-editor/issues/new?template=question.md) if you think anything is unclear and we'd be happy to address that._
+
+Adapters are used to enable the `TranscriptEditor` component to convert various STT transcripts into a format draftJS can understand to provide data for the `TimedTextEditor`.
+
+## How to create a new adapter
+If you want to create a new adapter for a new STT service that is not yet supported by the component, we welcome [PRs](https://help.github.com/articles/about-pull-requests/).
+
+[Feel free to begin by raising an issue](https://github.com/bbc/react-transcript-editor/issues/new?template=feature_request.md) so that others can be aware that there is active development for that specific STT service, and if needed we can synchronies the effort.
+
+[Fork the repo](https://help.github.com/articles/fork-a-repo/) and 
+create a branch with the name of the stt service, eg `stt-adapter-speechmatics`.
+
+<!-- TODO: adjust link -->
+
+## Context
+
+To see this in the larger context when we call `sttJsonAdapter` with `transcriptData` and a `sttJsonType` we expect it to return an object with two attributes `blocks` and `entityMap`.  
+
+This is then used within TimedTextEditor with the help of draftJs function [convertFromRaw](https://draftjs.org/docs/api-reference-data-conversion#convertfromraw) to create a new content state for the editor.
+
+So in order to convert a json from STT from service to draftJs json we need to create:
+- a data [block](https://draftjs.org/docs/api-reference-content-block#docsNav)
+- [entityRanges](https://draftjs.org/docs/advanced-topics-entities) 
+- `entityMap` 
+
+Note that  `entityMap` and `entityRanges` will get generated programmatically by dedicated functions.
+
+checkout [a quick side note on how the DraftJS `block`, `entityRanges` and `entityMap` works, in the context of the TranscriptEditor component](./draftjs-blocks-entityrange-entitmap.md). Or feel free to skip this and come back later to it, if you are not interested in the underlying implementation.
+
+## Steps
+
+In your branch 
+
+- [ ] Create a folder with the name of the STT service - eg `speechmatics`
+- [ ] add a `adapters/${sttServiceName}/sample` folder
+- [ ] add a sample json file from the STT service in this last folder - this will be useful for testing. Name it `${name of the stt service}.sample.json`
+<!-- TODO: we should check these json are excluded from the bundle -->
+- [ ] add option in [adapters/index.js](adapters/index.js)
+
+In the adapters [adapters/index.js](adapters/index.js) in the  `sttJsonAdapter` function switch statement add a new `case` with the new STT service type eg `speechmatics`
+
+<!-- TODO: modify import path if module is moved/refactored -->
+```js
+import speechmaticsToDraft from './speechmatics/index';
+
+...
+
+case 'speechmatics':
+      blocks = speechmaticsToDraft(transcriptData);
+      return { blocks, entityMap: createEntityMap(blocks) };
+```
+
+- [ ] add an adapter function.
+
+as shown in the example you'd also need to add a function with the stt provider name +`ToDraft` eg `speechmaticsToDraft`that takes in the transcript data.
+
+- [ ] create a function to convert the STT data structure into draftJs blocks and entityRanges.
+
+You can see examples from `bbc-kaldi` and `autoEdit2` adapters.
+
+In pseudocode it's reccomended to follow this approach:
+
+1. Expose one function call that takes in the stt json data
+2. Have a helper function `groupWordsInParagraphs` that as the name suggests groups words list from the STT provider transcript based on punctuation. and returns an array of words objects.
+
+The underlying details for this will vary depending on how the STT json of the provider present the data, and how the attributes are named etc..
+
+3. Iterate over the paragraphs to create draftJS content blocks (see `bbc-kaldi` and `autoEdit2` example).
+
+```js
+wordsByParagraphs.forEach((paragraph, i) => {
+    const draftJsContentBlockParagraph = {
+      text: paragraph.text.join(' '),
+      type: 'paragraph',
+      data: {
+        speaker: `TBC ${ i }`,
+        words: paragraph.words, //
+        start: paragraph.words[0].start//
+      },
+      // the entities as ranges are each word in the space-joined text,
+      // so it needs to be compute for each the offset from the beginning of the paragraph and the length
+      entityRanges: generateEntitiesRanges(paragraph.words, 'text'), // wordAttributeName
+    };
+    // console.log(JSON.stringify(draftJsContentBlockParagraph,null,2))
+    results.push(draftJsContentBlockParagraph);
+  });
+
+```
+
+4. And use the helper function `generateEntitiesRanges` to  add the `entityRanges` to each block. - see above
+
+5. If you have speaker diarization info you can also add this to the block info - _optional_
+
+
+## Tests
+
+This project uses jest. and once you submit the PR the tests are run by TravisCI. It is recommended to write some basic tests at a minimum so that you can see at a glance if the adapter is working as expected. 
+
+In order to write your tests, you want to have a `sample` folder with transcript data from stt and expected draftJs data output with file extensions `.sample.json` and `.sample.js` - see `bbc-kaldi` and `autoEdit2` example. This is so that those stub/example files are not bundled with the component when packaging for npm.
+
+_If you don't have much experience with automated testing don't let this put you off tho, feel free to raise it as an issue and we can help out._
+
+**top tip**: the draftJs block key attributes are randomly generated, and therefore cannot be tested in a deterministic way. However there is a well established workaround, you can replace them in the json with a type definition. eg instead of expecting it to be a specific number, you just expect it to be a string.
+
+In practice, for instance In Visual code you can search using a regex (option `*`). So you could search for 
+
+```js
+"key": "([a-zA-Z0-9]*)"
+```
+And replace all with 
+```js
+"key": expect.any(String)//"ss8pm4p"
+```
diff --git a/docs/guides/draftjs-blocks-entityrange-entitmap.md b/docs/guides/draftjs-blocks-entityrange-entitmap.md
@@ -0,0 +1,196 @@
+
+### DraftJS block, entityRanges and entityMap
+
+A quick side note on how the DraftJS block, entityRanges and entityMap works, in the context of the TranscriptEditor component. For the [adapters](./adapters.md) guide.
+
+
+#### Data Block
+
+TL;DR: a block is a representation of a paragraph (as an Immutable Record) in draftJs and you can have some custom data associated to it.
+
+But see the docs notes on [draftjs basics](https://github.com/bbc/react-transcript-editor/blob/master/docs/notes/draftjs/2018-10-01-draftjs-1-basics.md) to better understand the role of content block within the editor. As well as the draftJs official docs.
+
+Here's an example of a block, you can see it can contain some custom data, eg speaker name, list of words, and start time (which would be the start time of the first word).
+
+```js
+[
+  {
+    "text": "There is a day.", // text 
+    "type": "paragraph", // type of block 
+    "data": { //optional custom data
+      "speaker": "TBC 0",
+      "words": [
+       ...
+      ],
+      "start": 13.02
+    },
+    "entityRanges": [ // <-- entity ranges
+   ...
+    ]
+  },
+  ...
+```
+
+It also contains a list of `entityRanges`.
+
+### Entity Ranges
+
+`entityRanges` are part of individual blocks. 
+
+<!-- See the docs notes on [draftjs entity ranges](https://github.com/bbc/react-transcript-editor/blob/master/docs/notes/draftjs/2018-10-02-drafjs-2-entity-range.md) -->
+
+From draftJs docs on [entity](https://draftjs.org/docs/advanced-topics-entities) 
+
+> the Entity system, which Draft uses for annotating ranges of text with metadata. Entities introduce levels of richness beyond styled text. Links, mentions, and embedded content can all be implemented using entities.
+
+This is what we use to identify the words, from a list of characters, and associate data to it, such as start and end time information. 
+
+It sets the foundations for features such as click on a word can jump the player play-head to the corresponding time for that word.
+
+Here's an example of `entityRanges` in the context of a data block.
+
+Required fields are the `offset`, and `length`, which are used to identify the entity within the characters of the `text` attribute of the block.
+
+This, combined with the `entityMap` has the advantage that if you type or delete some text before a certain entity, draftJs will do the ground work of adjusting the offsets and keeping these info in sync.
+
+```js
+[
+  {
+    "text": "There is a day.",
+    "type": "paragraph",
+    "data": {
+      ...
+    },
+    "entityRanges": [
+      {
+        "start": 13.02, // Custom fields
+        "end": 13.17, // Custom fields
+        "confidence": 0.68, // Custom fields
+        "text": "There", // Custom fields - to detect what has changed
+        "offset": 0,  // Required by Draft.js to know start of "selection" 
+        "length": 5, //Required by Draft.js to know end of "selection" -  in our case a word 
+        "key": "ctavu0r" // can also be provided by draftjs if not provided. But providing your own gives more flexibility 
+      },
+      ...
+```
+
+### Entity Map 
+
+`entityMap` defines how to render the entities for the draftJs content state.
+
+See draftJs docs for more on [entities](https://draftjs.org/docs/advanced-topics-entities#introduction)
+
+And keeps in sync `entityRanges` through the `offset` and `length` attribute.
+
+Here's an example
+```js
+{
+  "ayx62lj": {
+    "type": "WORD",
+    "mutability": "MUTABLE",
+    "data": {
+      "start": 13.02,
+      "end": 13.17,
+      "confidence": 0.68,
+      "text": "There",
+      "offset": 0,
+      "length": 5,
+      "key": "ayx62lj"
+    }
+  },
+```
+
+To see this in the larger context when we call `sttJsonAdapter` with `transcriptData` and a `sttJsonType` we expect it to return an object with two attributes `blocks` and `entityMap`.  
+
+```js
+{
+  "blocks": [
+    {
+      "key": "500r2",
+      "text": "There is a day.",
+      "type": "paragraph",
+      "depth": 0,
+      "inlineStyleRanges": [],
+      "entityRanges": [
+        {
+          "offset": 0,
+          "length": 5,
+          "key": 0
+        },
+        {
+          "offset": 6,
+          "length": 2,
+          "key": 1
+        },
+        {
+          "offset": 9,
+          "length": 1,
+          "key": 2
+        },
+        {
+          "offset": 11,
+          "length": 4,
+          "key": 3
+        }
+      ],
+      "data": {
+        "speaker": "test4",
+        "words": [
+          {
+            "start": 13.02,
+            "confidence": 0.68,
+            "end": 13.17,
+            "word": "there",
+            "punct": "There",
+            "index": 0
+          },
+          {
+            "start": 13.17,
+            "confidence": 0.61,
+            "end": 13.38,
+            "word": "is",
+            "punct": "is",
+            "index": 1
+          },
+          {
+            "start": 13.38,
+            "confidence": 0.99,
+            "end": 13.44,
+            "word": "a",
+            "punct": "a",
+            "index": 2
+          },
+          {
+            "start": 13.44,
+            "confidence": 1,
+            "end": 13.86,
+            "word": "day",
+            "punct": "day.",
+            "index": 3
+          }
+        ],
+        "start": 13.02
+      }
+    },
+...
+  ],
+  "entityMap": {
+    "0": {
+      "type": "WORD",
+      "mutability": "MUTABLE",
+      "data": {
+        "start": 13.02,
+        "end": 13.17,
+        "confidence": 0.68,
+        "text": "There",
+        "offset": 0,
+        "length": 5,
+        "key": "1mgy3gm"
+      }
+    },
+....
+}
+```
+
+
+The good news, is that given the blocks and the entityRanges, we can programmatically generate the entityMap. Which means you don't have to worry about creating the entityMap when making an adapter.
diff --git a/package.json b/package.json
@@ -20,7 +20,7 @@
     "test": "react-scripts test --env=jsdom",
     "eject": "react-scripts eject",
     "build:example": "react-scripts build",
-    "build:component": "rimraf dist && NODE_ENV=production babel src/lib --out-dir dist --copy-files --ignore __tests__,spec.js,test.js,__snapshots__",
+    "build:component": "rimraf dist && NODE_ENV=production babel src/lib --out-dir dist --copy-files --ignore __tests__,spec.js,test.js,__snapshots__,sample.json,sample.js ",
     "deploy:ghpages": "npm run build:example && gh-pages -d build",
     "test-ci": "CI=true react-scripts test --env=jsdom --verbose",
     "lint": "eslint --ignore-path .eslintignore .",

diff --git a/src/lib/Util/adapters/autoEdit2/example-usage.js b/src/lib/Util/adapters/autoEdit2/example-usage.js
@@ -1,4 +1,4 @@
 const autoEdit2ToDraft = require('./index');
-const autoEdit2TedTalkTranscript = require('./sample/autoEdit2TedTalkTranscript-sample.json');
+const autoEdit2TedTalkTranscript = require('./sample/autoEdit2TedTalkTranscript.sample.json');
 
 console.log(autoEdit2ToDraft(autoEdit2TedTalkTranscript));
diff --git a/src/lib/Util/adapters/autoEdit2/index.js b/src/lib/Util/adapters/autoEdit2/index.js
@@ -31,9 +31,9 @@
 import generateEntitiesRanges from '../generate-entities-ranges/index';
 
 /**
- * groups words list from kaldi transcript based on punctuation.
+ * groups words list from autoEdit transcript based on punctuation.
  * @todo To be more accurate, should introduce an honorifics library to do the splitting of the words.
- * @param {array} words - array of words opbjects from kaldi transcript
+ * @param {array} words - array of words objects from autoEdit transcript
  */
 
 const groupWordsInParagraphs = (autoEditText) => {
@@ -74,12 +74,14 @@ const autoEdit2ToDraft = (autoEdit2Json) => {
   const tmpWords = autoEdit2Json.text;
   const wordsByParagraphs = groupWordsInParagraphs(tmpWords);
 
-  wordsByParagraphs.forEach((paragraph) => {
+  wordsByParagraphs.forEach((paragraph, i) => {
     const draftJsContentBlockParagraph = {
       text: paragraph.text.join(' '),
       type: 'paragraph',
       data: {
-        speaker: 'TBC',
+        speaker: `TBC ${ i }`,
+        words: paragraph.words, 
+        start: paragraph.words[0].start
       },
       // the entities as ranges are each word in the space-joined text,
       // so it needs to be compute for each the offset from the beginning of the paragraph and the length

diff --git a/src/lib/Util/adapters/autoEdit2/index.test.js b/src/lib/Util/adapters/autoEdit2/index.test.js
@@ -1,7 +1,7 @@
 import autoEdit2ToDraft from './index';
 // TODO: could make this test run faster by shortning the two sample to one or two paragraphs?
-import draftTranscriptExample from './sample/autoEdit2ToDraft-sample';
-import autoEdit2TedTalkTranscript from './sample/autoEdit2TedTalkTranscript-sample.json';
+import draftTranscriptExample from './sample/autoEdit2ToDraft.sample.js';
+import autoEdit2TedTalkTranscript from './sample/autoEdit2TedTalkTranscript.sample.json';
 
 describe('bbcKaldiToDraft', () => {
   const result = autoEdit2ToDraft(autoEdit2TedTalkTranscript, 'text');

diff --git a/...le/autoEdit2TedTalkTranscript-sample.json → ...le/autoEdit2TedTalkTranscript.sample.json b/...le/autoEdit2TedTalkTranscript-sample.json → ...le/autoEdit2TedTalkTranscript.sample.json