Resurrecting old @ptabor idea that we had discussed some time ago but didn't have time to implement.
Problem
- Bbolt page management still works like Windows 95. To be efficient and it requires to be periodically defragmented by calling a bbolt compact function. This is problematic from both reliability and maintenance perspective. Defragmenting etcd blocks writes, which in large clusters can take in tens of seconds. In high churn clusters users need to manage a separate process that periodically defragments etcd
Goal
- Reduce the need for heavy defragment operation by doing small incremental work during transactions. This should should allow us to maintain good enough fragmentation level without a need for a periodic "stop the world" operation.
Proposal:
During each transaction we make a decision whether we want do additional work. If storage overhead (file size / active page size) is more then 20% we do additional operations during a transaction.
The naive concept is that we look into the free-pages map and find the last allocated block. I think we currently maintain the lists of free blocks sorted from the beginning to the end for different sizes of the 'continuous' space. We might need to maintain a link to the last used page... Maybe we maintain a few links for not only 'the' last... but also to the last not too big page.
On each transaction we rewrite the last page to the first 'suitable' position from the free list... and move the reference to the last page down… The challenge is when the last page is to huge... and we don't have continuous space in the 'lower' part of the file to accept it... in such case there is heuristic needed to ignore it... and keep going up to the beginning of the file (hoping that it will generate more space for the 'bigger' chunk eventually)... And after some time (number of transactions) start the process from the end again.
The assumption (but it requires studies), is that even the biggest page is still relatively small in relationship to the whole file for k8s use-cases -> so it will work reasonably well.
Additional notes:
This is a note dump of my discussion with @ptabor. High level idea should stand, however I expect that there might be already existing continuous defrag algorithms that are better, then naive one presented here.
Expect that this can be implemented as a configurable optional behavior
Resurrecting old @ptabor idea that we had discussed some time ago but didn't have time to implement.
Problem
Goal
Proposal:
During each transaction we make a decision whether we want do additional work. If storage overhead (file size / active page size) is more then 20% we do additional operations during a transaction.
The naive concept is that we look into the free-pages map and find the last allocated block. I think we currently maintain the lists of free blocks sorted from the beginning to the end for different sizes of the 'continuous' space. We might need to maintain a link to the last used page... Maybe we maintain a few links for not only 'the' last... but also to the last not too big page.
On each transaction we rewrite the last page to the first 'suitable' position from the free list... and move the reference to the last page down… The challenge is when the last page is to huge... and we don't have continuous space in the 'lower' part of the file to accept it... in such case there is heuristic needed to ignore it... and keep going up to the beginning of the file (hoping that it will generate more space for the 'bigger' chunk eventually)... And after some time (number of transactions) start the process from the end again.
The assumption (but it requires studies), is that even the biggest page is still relatively small in relationship to the whole file for k8s use-cases -> so it will work reasonably well.
Additional notes:
This is a note dump of my discussion with @ptabor. High level idea should stand, however I expect that there might be already existing continuous defrag algorithms that are better, then naive one presented here.
Expect that this can be implemented as a configurable optional behavior