Skip to content

Why does dgraph not respond to any requests with "Roll up " or " LOG Compact" logs every so often when I perform an insert test #4552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wangweidonghb opened this issue Jan 13, 2020 · 3 comments
Labels
kind/question Something requiring a response.

Comments

@wangweidonghb
Copy link

wangweidonghb commented Jan 13, 2020

Dgraph version : v1.1.1
Dgraph SHA-256 : 20a10e9db316b565464e0a04cea09657eebbaea3b91460ab651997ada63243cc
Commit SHA-1 : 8994a57
Commit timestamp : 2019-12-16 18:24:50 -0800
Branch : HEAD
Go version : go1.13.5

OS INFO:
8Cores, 15GB Memory, Linux ss1.bessystem.com 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

mutation client:
Upsert data: 1 billion. Thread number: 1. One batch for upsert: 100

When I perform a performance test on Dgraph, I start a client to continuously write data to dgraph。 After a period of time(10~20minutes)。The client is blocked by dgraph server with a long time and the server of "Alpha" output the following logs


I0111 15:41:20.345895   18569 log.go:34] Rolling up Created batch of size: 11 MB in 2.088280354s.
I0111 15:41:22.001260   18569 log.go:34] Rolling up Created batch of size: 6.6 MB in 1.641390541s.
I0111 15:41:22.001338   18569 log.go:34] Rolling up Sent 1796032 keys
I0111 15:41:22.084420   18569 draft.go:1076] Rolled up 1796003 keys. Done
I0111 15:41:22.084526   18569 draft.go:425] List rollup at Ts 163069: OK.
I0111 15:41:22.087491   18569 predicate_move.go:191] Was instructed to delete tablet: FriendTest
I0111 15:41:22.089882   18569 index.go:1000] Dropping predicate: [FriendTest]
I0111 15:41:22.089948   18569 log.go:34] Writes flushed. Stopping compactions now...
W0111 15:41:32.085107   18569 draft.go:1107] While sending membership to Zero. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0111 15:41:43.357438   18569 log.go:34] Got compaction priority: {level:0 score:1.74 dropPrefix:[0 0 10 70 114 105 101 110 100 84 101 115 116]}
I0111 15:41:43.357609   18569 log.go:34] Running for level: 0
I0111 15:45:50.311154   18569 log.go:34] LOG Compact 0->1, del 7 tables, add 3 tables, took 4m6.95344879s
I0111 15:45:50.311272   18569 log.go:34] Compaction for level: 0 DONE
I0111 15:46:14.950379   18569 draft.go:1168] Found 1 old transactions. Acting to abort them.
I0111 15:46:14.955593   18569 draft.go:1129] TryAbort 1 txns with start ts. Error: <nil>
I0111 15:46:14.955694   18569 draft.go:1152] TryAbort selectively proposing only aborted txns: txns:<start_ts:163406 >
I0111 15:47:14.950493   18569 draft.go:1168] Found 1 old transactions. Acting to abort them.
I0111 15:47:14.955148   18569 draft.go:1129] TryAbort 1 txns with start ts. Error: <nil>
I0111 15:47:14.955191   18569 draft.go:1152] TryAbort selectively proposing only aborted txns: txns:<start_ts:163406 >
I0111 15:47:17.256523   18569 log.go:34] LOG Compact 1->1, del 1 tables, add 1 tables, took 1m26.945152819s

and the server of "zero" output the following logs

I0111 15:18:25.568738   17864 oracle.go:107] Purged below ts:155885, len(o.commits):211, len(o.rowCommit):148400
W0111 15:18:25.568943   17864 raft.go:729] Raft.Ready took too long to process: Timer Total: 618ms. Breakdown: [{proposals 616ms} {sync 2ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
I0111 15:18:26.320982   17864 raft.go:616] Writing snapshot at index: 76710, applied mark: 76939
I0111 15:21:22.039188   17864 oracle.go:107] Purged below ts:163069, len(o.commits):34, len(o.rowCommit):24500
W0111 15:21:22.039719   17864 raft.go:729] Raft.Ready took too long to process: Timer Total: 1.041s. Breakdown: [{proposals 1.035s} {sync 6ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
I0111 15:21:26.328691   17864 raft.go:616] Writing snapshot at index: 80340, applied mark: 80399
I0111 15:21:45.202544   17864 zero.go:696] Tablet: FriendTest does not belong to group: 1. Sending delete instruction.
W0111 15:21:55.196113   17864 zero.go:660] While deleting predicates: rpc error: code = Canceled desc = context canceled
I0111 15:41:22.085781   17864 zero.go:696] Tablet: FriendTest does not belong to group: 1. Sending delete instruction.
W0111 15:41:32.085464   17864 zero.go:660] While deleting predicates: rpc error: code = Canceled desc = context canceled

During this time, Clients for performance testing cannot continue to insert data and are in a blocked state. Server is not processing any requests as if it is hung.
After a few minutes, when the server of "Aplha" output the following log

alpha-stdout-stderr.log

I0111 15:47:17.256605   18569 log.go:34] DropPrefix done
I0111 15:47:17.256697   18569 log.go:34] Resuming writes
I0111 15:47:17.256764   18569 schema.go:79] Deleting schema for predicate: [FriendTest]

The server resumes from the suspended state to the normal state and starts processing insert data requests.
My question is why the server is in a suspended state for a long time? Is it a problem I use with dgraph? The following commands are my commands to start dgraph

nohup ./dgraph zero >> zero.log &
nohup ./dgraph alpha --lru_mb 4096 --zero localhost:5080 >> alpha.log &

When the above problem appears, I try to modify the commands like follows

nohup ./dgraph zero >> zero.log
nohup ./dgraph alpha --max_retries 10 --pending_proposals 20 --snapshot_after 5000 --lru_mb 4096 --zero localhost:5080 >> alpha.log &

The problem also exists, How should i solve this problem? It seems like a serious problem with server hung.

@wangweidonghb
Copy link
Author

It seems like that the server does not response any requests of mutations client is related on the parameter of "--snapshot_after ". when I start "alpha" server and set "--snapshot_after " parameter with a large value to 500000

nohup ./dgraph alpha --max_retries 3 --pending_proposals 20 --snapshot_after 5000000 --lru_mb 4096 --zero localhost:5080 >> alpha.log &

The problem that the server does not response any requests for a long time still exists but the probability of occurrence is reduced

I think that the server can reduce the speed of processing requests, but it cannot fail to respond to any requests for a long time and eventually cause the client's requests to fail.

@wangweidonghb
Copy link
Author

@danielmai

@MichelDiz MichelDiz added the kind/question Something requiring a response. label Jan 23, 2020
@MichelDiz
Copy link
Contributor

I'm closing this issue due the discussion is happening at https://discuss.dgraph.io/t/why-does-dgraph-not-respond-to-any-requests-with-roll-up-or-log-compact-logs-every-so-often-when-i-perform-an-insert-test/5832
And it seems that the problem is already being tracked and fixed.

If there are new facts, we can reopen the issue.

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Something requiring a response.
Development

No branches or pull requests

2 participants