Description
first of all: Thank you for making Litestream. It really helped to get our product off the ground with a very simple setup 🙂
Unfortunately I'm starting to see a super weird behavior in Litestream. Disclaimer: Our application connects to ~18 DBs all replicated.
It started when the db client reported "DB Timeout" errors, when users tried to login (or perform any writing op). In a panic, I restarted the app container. The new container tried to restore the databases, but got stuck.
Checking S3 showed me that some WAL directories had thousands of wal files with only 91B, which were created by Litestream every second. I stopped the container, cleaned all "empty" wal files, started the container again and everything went back to normal.
What I also saw, were DBs that did not receive writes while in this condition, did not had those empty WAL files. Only after I tried to write something to those dbs, litestream started to generate those empty WAL files. This also rooted out my speculation about some loop that would constantly write to the DB.
The container was running for 28 days, so I thought this might be one reason for this behavior, but now it happened again on a staging container, that was created 2 days ago. Right now I have no idea how to reproduce the issue.
My current litestream version is 0.3.8 and I will upgrade to 0.3.9 now.
Is this a known bug? Can I provide you with more information, that would help you investigating this?
Here is my litestream config:
access-key-id: XXXXXXXXXXXXXXX
secret-access-key: XXXXXXXXXXXXXXXX
addr: :9090
dbs:
- path: /XXXXXXX.db
snapshot-interval: 2h
rentention: 2d
replicas:
- url: s3://XXXXX
# there are 18 more dbs with the same snapshot options in this config