Conversation
This commit changes order of bucket structure fields and mu access
in order to improve CPU cache locality. Previously, mu and atomic
counters were accessed with different CPU cache lines and strongly hit
performance at older CPUs.
The following benchmark shows up to 30% performance increase:
go test -run=^$ -bench=BenchmarkCache
goos: linux
goarch: amd64
pkg: github.com/VictoriaMetrics/fastcache
cpu: AMD EPYC 7B12
│ master │ opt_v3 │
│ sec/op │ sec/op vs base │
CacheSet-8 4.666m ± 2% 4.674m ± 2% ~ (p=0.739 n=10)
CacheGet-8 2.364m ± 4% 1.732m ± 7% -26.74% (p=0.000 n=10)
CacheHas-8 2.171m ± 6% 1.652m ± 2% -23.89% (p=0.000 n=10)
CacheSetGet-8 11.65m ± 11% 12.03m ± 3% ~ (p=0.481 n=10)
geomean 4.087m 3.562m -12.84%
│ master │ opt_v3 │
│ B/s │ B/s vs base │
CacheSet-8 13.39Mi ± 2% 13.37Mi ± 2% ~ (p=0.755 n=10)
CacheGet-8 26.44Mi ± 5% 36.09Mi ± 7% +36.50% (p=0.000 n=10)
CacheHas-8 28.80Mi ± 6% 37.83Mi ± 2% +31.38% (p=0.000 n=10)
CacheSetGet-8 10.73Mi ± 10% 10.39Mi ± 3% ~ (p=0.469 n=10)
geomean 18.19Mi 20.87Mi +14.73%
│ master │ opt_v3 │
│ B/op │ B/op vs base │
CacheSet-8 19.04Ki ± 10% 19.11Ki ± 7% ~ (p=0.542 n=10)
CacheGet-8 9.598Ki ± 20% 7.006Ki ± 5% -27.00% (p=0.000 n=10)
CacheHas-8 8.700Ki ± 23% 6.499Ki ± 3% -25.31% (p=0.000 n=10)
CacheSetGet-8 41.59Ki ± 9% 40.87Ki ± 10% ~ (p=0.362 n=10)
geomean 16.04Ki 13.73Ki -14.36%
│ master │ opt_v3 │
│ allocs/op │ allocs/op vs base │
CacheSet-8 32.00 ± 9% 32.50 ± 8% ~ (p=0.418 n=10)
CacheGet-8 16.00 ± 19% 12.00 ± 8% -25.00% (p=0.000 n=10)
CacheHas-8 15.00 ± 20% 11.00 ± 9% -26.67% (p=0.000 n=10)
CacheSetGet-8 71.00 ± 10% 70.00 ± 10% ~ (p=0.396 n=10)
geomean 27.17 23.41 -13.85%
Signed-off-by: f41gh7 <nik@victoriametrics.com>
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes CPU cache locality by reordering fields in the bucket struct and adjusting mutex acquisition order to improve performance of bucket.Get requests. Based on benchmark results, this change delivers up to 30% performance improvement for cache operations.
Key changes:
- Reordered atomic counter fields (getCalls, setCalls, misses) to be adjacent to the mutex
- Moved mutex acquisition before accessing chunks field in Set and Get methods
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #92 +/- ##
=======================================
Coverage 76.68% 76.68%
=======================================
Files 4 4
Lines 549 549
=======================================
Hits 421 421
Misses 73 73
Partials 55 55 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
rtm0
approved these changes
Aug 5, 2025
2 tasks
makasim
approved these changes
Aug 5, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit changes order of bucket structure fields and mu access
in order to improve CPU cache locality. Previously, mu and atomic counters were accessed with different CPU cache lines and strongly hit performance at older CPUs.
The following benchmark shows up to 30% performance increase: