[CBR-471] Fix Restoration Performance Regression by KtorZ · Pull Request #3786 · input-output-hk/cardano-sl

KtorZ · 2018-10-24T18:11:58Z

Description

~~I am still testing this (I had an issue with the very last block application causing the checkpoints to diverge in a non expected way). I believe it's now fixed but I have to wait full restoration.~~
edit: Now tested, everything seems fine (see comment below)

We use to perform a db transaction per block which turns out to slow down the process quite a lot. Instead, we can apply k (.e.g 1000) updates inside one transaction and move on to the next batch. There's a bit of fiddling to do to get things working, but it turns out to be pretty efficient (~600 block/s on my setup, vs 80 block/s without the patch).

The main trick is located in the new applyHistoricalBlocks which applies update in sequence, stopping at the first error encountered.

Besides, we should have cought the performance regression earlier. We can't realistically perform restoration as part of our unit or main integration tests because of the time it takes. However, it is perfectly realistic to have it ran nightly, and monitor its output to be inside accepted boundaries.

Linked issue

[CBR-471]

Type of change

🐞 Bug fix (non-breaking change which fixes an issue)
🛠 New feature (non-breaking change which adds functionality)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🏭 Refactoring that does not change existing functionality but does improve things like code readability, structure etc
🔨 New or improved tests for existing code
⛑ git-flow chore (backport, hotfix, etc)

Developer checklist

I have read the style guide document, and my code follows the code style of this project.
If my code deals with exceptions, it follows the guidelines.
I have updated any documentation accordingly, if needed. Documentation changes can be reflected in opening a PR on cardanodocs.com, amending the inline Haddock comments, any relevant README file or one of the document listed in the docs directory.
CHANGELOG entry has been added and is linked to the correct PR on GitHub.

Testing checklist

I have added tests to cover my changes.
All new and existing tests passed.

QA Steps

nix-build -A acceptanceTests.mainnet.full -o acceptance-tests && ./acceptance-tests

... then go for a long coffee ☕

Screenshots (if available)

Launching wallet node: /nix/store/19ls1ww691cq0z32196alq82js761pj4-wallet-connect-to-mainnet
22:05:09 Blockchain syncing...
22:05:19 Blockchain syncing: 81%  1399744/1709971 blocks
22:05:32 Blockchain syncing: 82%  1403775/1709971 blocks
22:05:43 Blockchain syncing: 82%  1409791/1709972 blocks
...
22:15:55 Blockchain syncing: 98%  1691570/1710002 blocks
22:16:05 Blockchain syncing: 99%  1697522/1710003 blocks
22:16:15 Blockchain syncing: 99%  1703218/1710003 blocks
22:16:25 Blockchain syncing: 99%  1705969/1710004 blocks
Blockchain synced: 100%  1710004 blocks

Blockchain sync complete!
Blockchain height: 1710004 blocks
Elapsed time: 686 seconds

Going to restore a wallet
[[{"data":{"id":"Ae2tdPwUPEZ1UMga1bnFFD7FoFN9ZFoPLCWLaGcqNnfovcrcT3oMSuAz8DD","name":"Acceptance Wallet","balance":0,"hasSpendingPassword":false,"spendingPasswordLastUpdate":"2018-10-24T20:16:35.570007","createdAt":"2018-10-24T20:16:50.106258","assuranceLevel":"normal","syncState":{"tag":"restoring","data":{"throughput":{"quantity":0,"unit":"blocksPerSecond"},"percentage":{"quantity":0,"unit":"percent"},"estimatedCompletionTime":{"quantity":1710454000,"unit":"milliseconds"}}},"type":"regular"},"status":"success","meta":{"pagination":{"totalPages":1,"page":1,"perPage":1,"totalEntries":1}}}

22:16:50 Wallet restoring: 0% (0 blocks/s)
22:17:00 Wallet restoring: 0% (415 blocks/s)
22:17:10 Wallet restoring: 1% (622 blocks/s)
22:17:20 Wallet restoring: 1% (627.4 blocks/s)
...
22:56:46 Wallet restoring: 97% (715.975 blocks/s)
22:56:56 Wallet restoring: 98% (716.302 blocks/s)
22:57:06 Wallet restoring: 98% (716.7 blocks/s)
22:57:16 Wallet restoring: 99% (717.754 blocks/s)
22:57:26 Wallet restoring: 99% (718.192 blocks/s)
22:57:36 Wallet restoring: null% (715.272 blocks/s)
Restoration complete!
Elapsed time: 2484 seconds

Note: There was a typo in the script when I ran it causing it not to terminate properly once synced. I've verified the log file and no errors related to sync (some lookup network failures but nothing unusual).
Plus, checking the wallet manually at the end:

{
   "data" : [
      {
         "id" : "Ae2tdPwUPEZ1UMga1bnFFD7FoFN9ZFoPLCWLaGcqNnfovcrcT3oMSuAz8DD",
         "spendingPasswordLastUpdate" : "2018-10-24T20:16:50.106258",
         "hasSpendingPassword" : false,
         "type" : "regular",
         "syncState" : {
            "data" : null,
            "tag" : "synced"
         },
         "createdAt" : "2018-10-24T20:16:50.106258",
         "balance" : 0,
         "name" : "Acceptance Wallet",
         "assuranceLevel" : "normal"
      }
   ],
   "meta" : {
      "pagination" : {
         "totalEntries" : 1,
         "totalPages" : 1,
         "page" : 1,
         "perPage" : 10
      }
   },
   "status" : "success"
}

dcoutts

Looks good to me, though I've not fully checked everything. Some suggestions below.

dcoutts · 2018-10-24T22:17:49Z

    wId :: WalletId
    wId = WalletIdHdRnd rootId

    -- Process the restoration of the block with the given 'HeaderHash'.


This comment would now be out of date right? If I followed correctly, it's restore starting from the block with the given hash.

100% right.

dcoutts · 2018-10-24T22:22:31Z

+                    lift (nextHistoricalHash currentHash) >>= \case
+                        Nothing  -> throwM (RestorationFinishUnreachable tgtHash currentHash)
+                        Just hh' -> put (hh', False)
+                lift $ getPrefilteredBlockOrThrow genesisHash currentHash


The use of a state monad here seems a bit odd to me. This is just keeping the currentHash as the running value right? I think I'd just write this as a local recursive action with the currentHash as a local arg.

I've tried both approach actually and found the StateMonad the most readable of the two. The "edge" case makes the recursive-function approach a bit ugly :/

erikd · 2018-10-25T01:50:53Z

tried running the Nix scipt myself and first off, got a few locale related errors:

$ nix-build -A acceptanceTests.mainnet.full -o acceptance-tests && ./acceptance-tests
warning: dumping very large path (> 256 MiB); this may run out of memory
/nix/store/73pfw2r6c9g9v5rnyzwg38fm2ap3j8fd-acceptance-tests-mainnet
/nix/store/0pc9pzaqdbygxr2nyng5h71mkwbwv2fc-bash-4.4-p12/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_AU.UTF-8)
./acceptance-tests: line 23: warning: setlocale: LC_ALL: cannot change locale (en_GB.UTF-8): No such file or directory
Launching wallet node: /nix/store/g5m0r33dxw9zjv88zypbbgb437jfixam-wallet-connect-to-mainnet

but it did seem to sync:

....
12:37:16 Blockchain syncing: 97%  1673651/1710959 blocks
12:37:29 Blockchain syncing: 98%  1684402/1710959 blocks
12:37:40 Blockchain syncing: 98%  1692786/1710959 blocks
12:37:50 Blockchain syncing: 99%  1704754/1710959 blocks
Blockchain synced: 100%  1710959 blocks

Blockchain sync complete!
Blockchain height: 1710959 blocks
Elapsed time: 1846 seconds

but then during wallet restore I got:

Going to restore a wallet
{"data":{"id":"Ae2tdPwUPEZ1UMga1bnFFD7FoFN9ZFoPLCWLaGcqNnfovcrcT3oMSuAz8DD","name":"Acceptance Wallet","balance":0,"hasSpendingPassword":false,"spendingPasswordLastUpdate":"2018-10-25T01:38:00.112482","createdAt":"2018-10-25T01:38:07.108056","assuranceLevel":"normal","syncState":{"tag":"restoring","data":{"throughput":{"quantity":0,"unit":"blocksPerSecond"},"percentage":{"quantity":0,"unit":"percent"},"estimatedCompletionTime":{"quantity":1711414000,"unit":"milliseconds"}}},"type":"regular"},"status":"success","meta":{"pagination":{"totalPages":1,"page":1,"perPage":1,"totalEntries":1}}}jq: error (at <stdin>:1): Cannot index string with string "data"
jq: error (at <stdin>:1): Cannot index string with string "data"
12:38:07 Wallet restoring...
jq: error (at <stdin>:1): Cannot index string with string "data"
jq: error (at <stdin>:1): Cannot index string with string "data"

KtorZ · 2018-10-25T06:27:09Z

@erikd Yes indeed, my little patch yesterday just made it worse 😅 ... Now fixed properly.

edsko · 2018-10-25T06:42:21Z

+
+        -- Flush to Acid-State & SQLite
+        k <- getSecurityParameter (wallet ^. walletNode)
+        let (blocks, txMetas, slotIds) = unzip3 (catMaybes updates)


I think I'd rewrite this code slightly; after discussing this code with @KtorZ , the catMaybes here has a dual purpose: any EBBs will be Nothing, and once we reach the final block halfway the batch, all the elements after that final block will also be Nothing. I find the style a bit confusing, and would probably prefer an explicit recursion style with an accumulator. That said, I don't think it's wrong what's here; but I think it does at least need a comment.

adinapoli-iohk

LGTM -- I don't have anything major to add on top of what Edsko & Duncan already said. Nice work!

adinapoli-iohk · 2018-10-25T06:54:27Z

+    curl --silent --cacert ${stateDir}/tls/client/ca.crt --cert ${stateDir}/tls/client/client.pem -H "Content-Type: application/json" https://${wallet.walletListen}/api/v1/wallets \
+      -d '{
+      "operation": "restore",
+      "backupPhrase": ["session", "ring", "phone", "arrange", "notice", "gap", "media", "olympic", "water", "road", "spider", "rate"],


Hopefully people won't try to use this one for production wallets 😉

(I always feel like we should have some mechanism for preventing mainnet to accept any of the mnemonics we use in tests etc).

Haha, I thought the same things 😶 ... I hope it's hidden enough in the nix entrails that no one will dig this out... right ? 😨

adinapoli-iohk · 2018-10-25T06:56:21Z

        Nothing -> withNode $ getFirstGenesisBlockHash genesisHash
        Just sh -> nextHistoricalHash sh >>= maybe (throwM $ RestorationSuccessorNotFound sh) pure
-    restore genesisHash startingPoint NoTimingData
+    restore genesisHash startingPoint 1000


Minor one -- might be nice to explicitly have a where clause for this "magic number", to clarify its purpose.

KtorZ · 2018-10-25T07:07:22Z

    getBatch genesisHash hh batchSize = go [] batchSize hh
        where
            go !updates !n !currentHash | n <= 0 =
                return (updates, currentHash)

            go !updates _ !currentHash | currentHash == tgtHash =
                getPrefilteredBlockOrThrow genesisHash currentHash >>= \case
                    Nothing -> go updates 0 currentHash
                    Just u  -> go (updates ++ [u]) 0 currentHash

            go !updates !n !currentHash = do
                nextHash <- nextHistoricalHash currentHash >>= \case
                    Nothing  -> throwM (RestorationFinishUnreachable tgtHash currentHash)
                    Just hh' -> return hh'

                getPrefilteredBlockOrThrow genesisHash currentHash >>= \case
                    Nothing -> go updates (n - 1) nextHash
                    Just u  -> go (updates ++ [u]) (n - 1) nextHash

Currently trying out this. I found a nice way to put this. A bit afraid about the boundary case so I'll go through a full restoration once again and see :)

KtorZ · 2018-10-25T08:11:28Z

Addressed comments above; went for the algorithm described above with accumulators and recursion instead of the StateMonad. I've re-ran the entire script, caught another error in the acceptance bash script (god I hate bash) and fixed it as well.

We use to perform a db transaction per block which turns out to slow down the process quite a lot. Instead, we can apply k (.e.g 1000) updates inside one transaction and move on to the next batch. There's a bit of fiddling to do to get things working, but it turns out to be pretty efficient. The main trick is located in the new `applyHistoricalBlocks` which applies update in sequence, stopping at the first error encountered.

We should have cought the performance regression earlier. We can't realistically perform restoration as part of our unit or main integration tests because of the time it takes. However, it is perfectly realistic to have it ran nightly, and monitor its output to be inside accepted boundaries.

dcoutts

LGTM now.

rvl

Looks good. Bash is bash.
The build timeouts are set in .buildkite/nightly.yml.

KtorZ · 2018-10-25T13:27:54Z

Integration tests still failing for an unrelated reason:

  integration/TransactionSpecs.hs:102:22: 
  1) Transactions asset-locked wallets can receive funds and transactions are confirmed in index
       Falsifiable (after 1 test):
       
       not expected: 0

Merging anyway 👍

…ration-perf-regression [CBR-471] Fix Restoration Performance Regression

…hk/KtorZ/CBR-471/fix-restoration-perf-regression [CBR-471] Fix Restoration Performance Regression

KtorZ requested review from dcoutts, edsko, erikd and karknu October 24, 2018 18:11

KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch 2 times, most recently from 1a89260 to 96965af Compare October 24, 2018 21:57

dcoutts reviewed Oct 24, 2018

View reviewed changes

KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 96965af to 257f610 Compare October 25, 2018 06:13

KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 257f610 to 6fdffdd Compare October 25, 2018 06:29

edsko approved these changes Oct 25, 2018

View reviewed changes

edsko requested a review from adinapoli-iohk October 25, 2018 06:43

adinapoli-iohk approved these changes Oct 25, 2018

View reviewed changes

KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 6fdffdd to 9cb1a58 Compare October 25, 2018 08:09

KtorZ added 2 commits October 25, 2018 10:49

KtorZ force-pushed the KtorZ/CBR-471/fix-restoration-perf-regression branch from 9cb1a58 to 80c4b5b Compare October 25, 2018 08:49

KtorZ removed the request for review from karknu October 25, 2018 09:58

dcoutts approved these changes Oct 25, 2018

View reviewed changes

rvl approved these changes Oct 25, 2018

View reviewed changes

KtorZ merged commit 150fb11 into develop Oct 25, 2018

KtorZ deleted the KtorZ/CBR-471/fix-restoration-perf-regression branch October 25, 2018 13:28

KtorZ added a commit that referenced this pull request Nov 9, 2018

Merge pull request #3786 from input-output-hk/KtorZ/CBR-471/fix-resto…

13545a7

…ration-perf-regression [CBR-471] Fix Restoration Performance Regression

KtorZ added a commit to input-output-hk/cardano-wallet-legacy that referenced this pull request Nov 9, 2018

Merge pull request input-output-hk/cardano-sl#3786 from input-output-…

3642171

…hk/KtorZ/CBR-471/fix-restoration-perf-regression [CBR-471] Fix Restoration Performance Regression

KtorZ mentioned this pull request Nov 23, 2018

Restoration Performance Regression: 20x Longer Than Byron Data-Layer input-output-hk/cardano-wallet-legacy#92

Closed

Conversation

KtorZ commented Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Linked issue

Type of change

Developer checklist

Testing checklist

QA Steps

Screenshots (if available)

Uh oh!

dcoutts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erikd commented Oct 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KtorZ commented Oct 25, 2018

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adinapoli-iohk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KtorZ commented Oct 25, 2018

Uh oh!

KtorZ commented Oct 25, 2018

Uh oh!

dcoutts left a comment

Choose a reason for hiding this comment

Uh oh!

rvl left a comment

Choose a reason for hiding this comment

Uh oh!

KtorZ commented Oct 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

KtorZ commented Oct 24, 2018 •

edited

Loading

erikd commented Oct 25, 2018 •

edited

Loading