-
Notifications
You must be signed in to change notification settings - Fork 3
[6.12] Track btrfs patches #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: base-6.12
Are you sure you want to change the base?
Conversation
Add the following flags to give an hint about which chunk should be allocated in which a disk. The following flags are created: - BTRFS_DEV_ALLOCATION_PREFERRED_DATA preferred data chunk, but metadata chunk allowed - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA preferred metadata chunk, but data chunk allowed - BTRFS_DEV_ALLOCATION_METADATA_ONLY only metadata chunk allowed - BTRFS_DEV_ALLOCATION_DATA_ONLY only data chunk allowed Signed-off-by: Goffredo Baroncelli <[email protected]>
Signed-off-by: Goffredo Baroncelli <[email protected]>
Signed-off-by: Kai Krakow <[email protected]>
When this mode is enabled, the chunk allocation policy is modified as follow. Each disk may have a different tag: - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA - BTRFS_DEV_ALLOCATION_METADATA_ONLY - BTRFS_DEV_ALLOCATION_DATA_ONLY - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default) Where: - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for the X chunk type (the other type may be allowed when the space is low) - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type. This means also that it is a preferred choice. Each time the allocator allocates a chunk of type X , first it takes the disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY; if the space is not enough, it uses also the other disks, with the exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other type of chunk (i.e. not X). Signed-off-by: Goffredo Baroncelli <[email protected]>
This is useful where you want to prevent new allocations of chunks on a disk which is going to removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <[email protected]>
This is useful where you want to prevent new allocations of chunks to a set of multiple disks which are going to be removed from the pool. This acts as a multiple `btrfs dev remove` on steroids that can remove multiple disks in parallel without moving data to disks which would be removed in the next round. In such cases, it will avoid moving the same data multiple times, and thus avoid placing it on potentially bad disks. Thanks to @Zygo for the explanation and suggestion. Link: kdave/btrfs-progs#907 (comment) Signed-off-by: Kai Krakow <[email protected]>
Hi. What's the status of these patches? Are these something that's going be upstream in a reasonable amount of time or a long-term external patch series? |
These won't go in into the kernel as-is and may be replaced by some different implementation in the kernel sooner or later. But I keep those safe to use - aka they don't create incompatibilities with future kernels and can just be dropped from your kernel without posing any danger to your btrfs. @Forza-tng has some explanations why those patches won't go into the kernel: https://wiki.tnonline.net/w/Btrfs/Allocator_Hints |
Yes, I think this is intentional behavior of the initial version of the patches: The type numbers are generally used as a priority sort with the I'm not sure if it would be useful to put data on I think my idea of using chunk size classes for tiering may be more useful than this side-effect (what I mentioned in a report over at btrfs-todo). But in theory, type 0 and type 3 should be treated equally as soon as the remaining unallocated space is identical... Did you reach that point? (looks like your loop dev example did exactly that if I followed correctly) But in the end: Well, "preferred" means "preferred", doesn't it? ;-) |
The loop test did the opposite of this, where the type 0 device was filled before the type 3, even though it was smaller/had less unallocated. I had expected types 3 and 0 were treated equally, but we see that this isn't the case? It isn't wrong or bad, just something I hadn't thought would happen.
Indeed 😁 |
Refactor the logic in btrfs_read_policy_show() to streamline the formatting of read policies output. Streamline the space and bracket handling around the active policy without altering the functional output. This is in preparation to add more methods. Signed-off-by: Anand Jain <[email protected]>
Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(), but this occurs too late for find_live_mirror(), which is invoked by load_super_root() much earlier than btrfs_init_devices_late(). Fix this by moving the initialization to open_ctree(), before load_super_root(). Reviewed-by: Naohiro Aota <[email protected]> Signed-off-by: Anand Jain <[email protected]>
…store Introduce the `btrfs_read_policy_to_enum` helper function to simplify the conversion of a string read policy to its corresponding enum value. This reduces duplication and improves code clarity in `btrfs_read_policy_store`. The `btrfs_read_policy_store` function has been refactored to use the new helper. The parameter is copied locally to allow modification, enabling the separation of the method and its value. This prepares for the addition of more functionality in subsequent patches. Signed-off-by: Anand Jain <[email protected]>
Added RAID1 read balance patches, see PR description. |
@Forza-tng Looking forward to some benchmark numbers if you want to do them. :-) |
I made a different suggestion when those first came out. The current approach with arbitrary, use-case-based names ("data only", "data preferred", "metadata only", "metadata preferred", and "none") is going to be very confusing, especially with the implied sorting they have to do. I suggest that the hints should be a bitmask of the kinds of allocation that would be allowed on the device. When allocation fails with the first preference, the order in which new drives are added to the free space search should be specified separately. Splitting it into two parts gives us clean options for "what is allowed" and "when is it allowed". What is allowed
It's much clearer what to expect when the options are expressed this way: you get metadata on a device, or you don't. There's no possibility of data spilling onto a device that you didn't ask for. When it is allowedTo get the "preferred" options, there must be multiple allocation passes, one for each preference level. We need to specify priority for each device within each pass for each allocation type. So we expand the above to more than one bit. e.g. with two bits for each type, you have 4 levels for each:
The allocator would then run multiple passes, with each pass adding more drives from the next preference level to search for free space. This loop would stop one level before the lowest, so any device at the lowest preference level would never be used, giving us the "no data" or "no metadata" cases. Implementation notes
Use casesAn array of SSDs and HDDs split into metadata and data respectively
A pair of SSD and HDD (as you might find in a laptop) with bidirectional overflow
A pair of SSD and HDD which allows metadata to overflow to HDD, but no data to overflow to SSD
A multi-device remove with no overflow allowed to removed devices
This config prevents any new data or metadata from being allocated on the to-remove devices even if that would result in ENOSPC, e.g. if the devices are being removed because they are failing. A multi-device remove with overflow allowed to removed devices
This config allows allocations to fall back to to-be-removed devices if other devices run out of space, making it a deliberate safety valve to prevent ENOSPC in case the user failed to estimate space correctly. Multiple tiers
If the slowest drive is the largest one, the default allocator behavior will try to put all the metadata on the slowest drive. This config flips the allocation order in that scenario. Future considerationsWhy do we have more than one level between "always" and "never"? We only need 3 to support the proposal for "preferred", "only", and "none". 3 levels require at least 2 bits, but 2 bits have at least 4 values, so we get a second middle level for free. We can lean into that: If the user has some complex multi-level tiered storage, or a mashup of old drives with various performance and reliability, or they're doing a complicated reshape, then that second middle level -- or a third bit to extend to a total of 6 levels + always + never -- could be useful. If we expanded to 8 bits, we'd be able to provide drive-by-drive customized allocation order. I can't think of a use case for this off the top of my head, but on the other hand, I didn't know about the "none" use case until I found myself in immediate need of it. Maybe someone else will run into it, or it will become part of a more general tiered storage solution. |
@Zygo I like this bitfield suggestion, and it should still be easy to use for simple ordering: Just mask out the bits not used for a request, then compare the remaining. And as you already pointed out, we should rather use 3 bits for tiering: There may be much more complex scenarios of different drives (5400, 7200, 10k rpm) or maybe even network storage (iSCSI, DRBD) involved which have very different performance characteristics. So we could have RDDDRMMM (reserved, data, metadata) in an 8-bit-field. I'd still need to figure out why we need multiple passes or how that is different from the current implementation, and I probably need some time for it. I think your multipass idea differs in how free space is considered. We should also look into how we could properly migrate the old settings to the new ones automatically. And instead of writing raw decimal numbers to the |
Currently we have two semantic preference levels: one for "only", then another for "preferred". IIRC there's no explicit nested loop in the current code--we're just doing a sort by size and preference level, then we loop once through the sorted list of drives, and we cut off the search at two different points to get the preference levels. So my proposed change is to make the outer loop explicit, and run it for as many iterations as there are preference levels in use (or do it in a single loop, but order it properly so it has the same effect as a nested loop). Making the outer loop explicit might also help clean up some weirdness that currently happens with out-of-size-order preferences and striped profile (e.g. raid5 or raid10) allocations when the device sizes don't line up with preferences the right way, e.g. you can get narrow raid5 stripes if some devices are "preferred" and some "only", because there's no way to separate metadata order from data order in the current implementation--metadata order is strictly the opposite of data order, and that's not always what we want. The preference data type doesn't have to be bitfields, either in storage or interface. Ordinary integers (one for data, one for metadata, for each drive) will work fine. Thinking of it as generalizations layered on a single-bit yes/no preference concept may be helpful...or it may not. Currently the patches store everything in a single integer field. One commenter the first time around suggested moving the whole thing into the filesystem tree, e.g. BTRFS_PERSISTENT_ITEM_KEY with an objectid dedicated to allocation preferences, and the offset of each key identifying the device the preferences apply to. That would allow for versioning of the parameter schema and indefinite extension of the fields (e.g. to add a migration policy). Currently we're using the dev type field in the device superblocks because it's simpler, not because we need to bootstrap allocation directly from superblocks. btrfs has to load the trees before it can allocate anything, so the trees will be available in time to retrieve the allocation preferences. |
@Zygo Thanks, that helped me get the idea... |
First conclusion using latency vs round-robin: My system uses bcache (mdraid NVMe) backed by four 7200rpm HDDs. Turns out that latency is at an advantage here. I found that the last drive in the setup is never used for reads in latency mode. Overall, latency mode gives a slightly higher throughput in game loading screens (probably due to bcache, not because I apparently use fewer disks for reading). Investigating the behavior, I found that this last drive only claims to be 7200rpm when it is 5400rpm in reality. fio clearly shows results typical for 5400 rpm drives: fio --rw=randread --name=IOPS-read --bs=4k --direct=1 --filename=/dev/DEV --numjobs=1 --ioengine=libaio --iodepth=1 --refill_buffers --group_reporting --runtime=60 --time_based Tested:
Due to the latency balancer excluding the last disk from most read operations, bcache can be used better because that last disk will be avoided for bcache read caching. So in a scenario with bcache and/or varying disk types, latency is a clear winner. Without bcache, it would still provide better latency. But in a scenario where throughput matters, you should probably be using round-robin. I wonder if we can make a hybrid balancer which uses round-robin but weighted/grouped by latency... Because the latency balancer will clearly fail in scenarios where disk latency is only slightly off between each member: It would then prefer to read from fewer devices than it should. |
Added a new read balancer It tries to combine round-robin and latency into one hybrid approach by using round-robin across a set of stripes within a 120% margin of the best latency. I am currently testing this and have not yet discovered the benefits or downsides but in theory it should prefer the fastest stripes for small requests while it switches over to using all stripes for large continuous requests. Note: The latency calculation currently uses an average of the full history of requests only - which is bad because it will cancel out changing variations over time. A better approach would be to use an EMA (exponential moving average) with an alpha of 1/8 or 1/16. This requires to sample individual bio latency and thus requires to change structs and code in other parts of btrfs. I'm not very familiar with all the internal structures yet, and the feature is still guarded by This is also why I won't try to eliminate the code duplication yet (to avoid double calculations). |
Add fs_devices::read_cnt_blocks to track read blocks, initialize it in open_fs_devices() and clean it up in close_fs_devices(). btrfs_submit_dev_bio() increments it for reads when stats tracking is enabled. Stats tracking is disabled by default and is enabled through fs_devices::fs_stats when required. The code is not under the EXPERIMENTAL define, as stats can be expanded to include write counts and other performance counters, with the user interface independent of its internal use. This is an in-memory-only feature, different to the dev error stats. Signed-off-by: Anand Jain <[email protected]>
CONFIG_BTRFS_EXPERIMENTAL is needed by the RAID1 balancing patches but we don't want to use the full scope of the 6.13 patch because it also affects features currently masked via CONFIG_BTRFS_DEBUG. TODO: Drop during rebase to 6.13 or later. Original-author: Qu Wenruo <[email protected]> Signed-off-by: Kai Krakow <[email protected]>
050cdab
to
2df0dd2
Compare
Rebased to newer read policy patchset, my initial merge used an old version from Jan'25. Important: The module parameter |
f1d7497
to
e51ca31
Compare
Signed-off-by: Kai Krakow <[email protected]>
Signed-off-by: Kai Krakow <[email protected]>
To get some more insights, we can count how often a stripe has been ignored relative to its neighbors. We simply increase the counter for all candidates, then decrease it after selection. This should show how evenly distributed one of the read balancing algorithms is. Signed-off-by: Kai Krakow <[email protected]>
Signed-off-by: Kai Krakow <[email protected]>
Select the preferred stripe based on a mirror with the least in-flight requests. Signed-off-by: Kai Krakow <[email protected]>
Link: #36 (comment) Signed-off-by: Kai Krakow <[email protected]>
Signed-off-by: Kai Krakow <[email protected]>
@Forza-tng I think this is rather a 32 bit integer and it wraps quite easily for you. Making it unsigned will not prevent it from wrapping but at least it won't become negative. I'll test the revised patchset now and will be pushing it after a reboot. Great testing, thanks. |
Signed int * 4k is 8TiB, which seems plausible. However, why not make total_reads a s64/u64? |
I want to stay as close to the original upstream patches as possible to make future rebases to next LTS easier. |
That makes sense. The fix (if it is), should rather be done upstream. But let's test this and see how it goes first :)
Server is using ECC and rasdaemond has not logged any errors so far. |
The fix is using unsigned ints (which upstream does but I used an outdated patch). We are working on block boundaries: it should make no difference if it wraps at 32 bit or 64 bit. It will probably make a difference if you use values like "3 blocks" or "7 blocks" for |
What happens if |
7f297c4
to
740955e
Compare
HEADS UP: Here's an important update which fixes an integer wrap to negative resulting in out of bounds access if more than 8 TB of data have been read. It also adds a slight optimization to lower
@Forza-tng It will just be clipped to the lowest 32 bits. Since the calculation is only about cycling through the stripes (e.g. read 4 blocks in a row before switching to the next stripe), this will be perfectly okay unless you use very large numbers for I think the idea here is to let the CPU use calculations within one native word instead of, e.g., forcing a 32 bit CPU to do 64 bit math. It can do that but that requires more cycles. |
Thinking about this: The cycle counter works only for single-threaded sequential readers anyways, because random reads from other threads would randomly advance the cycle. This is probably why the benchmarks observe a worse performance with round-robin during concurrent random IO. This is one more reason to migrate this to a more intelligent selection like mdraid1 does. Also, if you have 3 stripes, the cycle will be "wrong" once every wrap. So it's not a big issue compared to the above behavior. Adding my latency selection into round-robin ( |
@kakra It seems that using Btrfs Read Policy Benchmark Results.HDD RAID1
SSD RAID10
|
@Forza-tng Thanks, I edited to table to better fit into the text column, and I marked the best values. I think you did those without any concurrent background workload? It's really interesting to see how well On SSD, So we can probably say that |
I was surprised we couldn't rech around 500MiB/s (2x single device) read speeds. I also did a rw test on the SSD setup. Did not use the |
Either let it defragment the whole pool, or create a data subdir inside the benchmark directory and add some substantial data to it. I didn't want to spoil my btrfs root so I created a subdir and added 50 GB of random data via dd for defrag.
Yep, that doesn't work on btrfs and that's why I see no point in using raid10: It just makes head movement more pronounced and thus lowers throughput and increases latency, but btrfs doesn't read stripes in parallel. OTOH, I'm not sure if mdraid10 would be better here. I think this needs some tuning of the round-robin size to match the read size of your sequential readers and enough queue depth, so it alternates one and the other mirror/stripe each request. It probably works better with multiple parallel readers, and that's why Part of the problem may be how latency works for HDD because it has two kinds of latency: rotational latency (the device needs to wait half a disk revolution on average, which is around 4.16ms for 7200rpm) and seek latency (which seems to be around 2-3ms for my disks according to fio latency tests showing 11ms for 99% which should contain the worst cases including full rotation). Manufacturers either show one or the other latency in their specs, Seagate IronWolf Pro seems to spec the rotational latency of 4.16ms. Theoretically, there's also data transfer latency but I think we can ignore that. Louder HDDs tend so seek faster. If your HDD supports different power profiles, setting a higher/louder profile may speedup head movement and reduce seek time. I'm not sure how my HGST Deskstar works: It mostly behaves like a 5400rpm drive but sometimes it easily beats Seagate Ironwolf Pro and WD Black, especially in seeky tests. That's very strange. Maybe it has two independent head mounts? |
The HDD benchmark was RAID1 while the SSD was RAID10. I will see if I can set up a HDD RAID10 array later for more tests. |
At least when I tested raid10 a few years back, it performed worse than simple raid1 or even single spanning multiple devices. Of course, that was with pid policy back then: Processes bothered just one disk mostly, and left the others alone, causing less inter-process disturbance and thus lower latency. |
@kakra I think using Has the issue and fix been submitted to the mailing list? |
The issue has already been resolved in the patch merged into the kernel. I just happened to base my work off an older version of the patch, so I can still have the latency patch (which was dropped before the merge). My updated patch here aligns with upstream kernel. |
FWIW, someone mentioned that there are some patches on the linux-btrfs mailing list: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles. |
I've spent far too many hours fooling around with this exact issue on OpenZFS. With a small group of mirrors I always hit a wall somewhere around 1.6-1.7x the sum total of the individual HDD throughputs. Unlike a true/classic RAID-0 where one just reads all drives at max speed and "zippers" the data together, long seq. reads from mirrors require data hopscotch. There's unpredictable interactions with HDD internal read-ahead, how much data sits on a given cylinder, rotational latency, "blown" revolutions because you didn't quite seek to the new cylinder in time to grab the next LBA, etc. I did discover that how the data is written and distributed across spindles matters when reading it back. If there are any knobs to influence the write allocator "pseudo stripe size" (for lack of a better term) it might be worth twisting them. I love this kind of stuff and wish I had a btrfs mirror to join the fun. Hopefully soon... |
You're welcome :-) |
> kernel: rcu: INFO: rcu_sched self-detected stall on CPU > kernel: rcu: 10-....: (2100 ticks this GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187 fqs=1052 > kernel: rcu: (t=2100 jiffies g=358306033 q=2241752 ncpus=16) > kernel: CPU: 10 UID: 0 PID: 1524681 Comm: map_0x178e45670 Not tainted 6.12.21-gentoo #1 > kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > kernel: RIP: 0010:btrfs_get_64+0x65/0x110 > kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d 49 08 44 8b 75 0c 48 > kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202 > kernel: RAX: 0000000000001f15 RBX: fffff437ea382200 RCX: fffff437cb891200 > kernel: RDX: 000001922b68df2a RSI: 0000000000000000 RDI: ffffa434c3e66d20 > kernel: RBP: ffffa434c3e66d20 R08: 000001922b68c000 R09: 0000000000000015 > kernel: R10: 6c0000000000000a R11: 0000000009fe7000 R12: 0000000000000f2a > kernel: R13: 0000000000000001 R14: ffffa43192e6d230 R15: ffffa43160c4c800 > kernel: FS: 000055d07085e6c0(0000) GS:ffffa4452bc80000(0000) knlGS:0000000000000000 > kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > kernel: CR2: 00007fff204ecfc0 CR3: 0000000121a0b000 CR4: 00000000001506f0 > kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? rcu_dump_cpu_stacks+0xd3/0x100 > kernel: ? rcu_sched_clock_irq+0x4ff/0x920 > kernel: ? update_process_times+0x6c/0xa0 > kernel: ? tick_nohz_handler+0x82/0x110 > kernel: ? tick_do_update_jiffies64+0xd0/0xd0 > kernel: ? __hrtimer_run_queues+0x10b/0x190 > kernel: ? hrtimer_interrupt+0xf1/0x200 > kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50 > kernel: ? sysvec_apic_timer_interrupt+0x60/0x80 > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > kernel: ? btrfs_get_64+0x65/0x110 > kernel: find_parent_nodes+0x1b84/0x1dc0 > kernel: btrfs_find_all_leafs+0x31/0xd0 > kernel: ? queued_write_lock_slowpath+0x30/0x70 > kernel: iterate_extent_inodes+0x6f/0x370 > kernel: ? update_share_count+0x60/0x60 > kernel: ? extent_from_logical+0x139/0x190 > kernel: ? release_extent_buffer+0x96/0xb0 > kernel: iterate_inodes_from_logical+0xaa/0xd0 > kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150 > kernel: __x64_sys_ioctl+0x84/0xc0 > kernel: do_syscall_64+0x47/0x100 > kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53 > kernel: RIP: 0033:0x55d07617eaaf > kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 > kernel: RSP: 002b:000055d07085bc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > kernel: RAX: ffffffffffffffda RBX: 000055d0402f8550 RCX: 000055d07617eaaf > kernel: RDX: 000055d07085bca0 RSI: 00000000c038943b RDI: 0000000000000003 > kernel: RBP: 000055d07085bea0 R08: 00007fee46c84080 R09: 0000000000000000 > kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > kernel: R13: 000055d07085bf80 R14: 000055d07085bf48 R15: 000055d07085c0b0 > kernel: </TASK> The RCU stall could be because there's a large number of backrefs for some extents and we're spending too much time looping over them without ever yielding the cpu. Link: https://lore.kernel.org/linux-btrfs/CAMthOuP_AE9OwiTQCrh7CK73xdTZvHsLTB1JU2WBK6cCc05JYg@mail.gmail.com/T/#md2e3504a1885c63531f8eefc70c94cff571b7a72 Signed-off-by: Kai Krakow <[email protected]>
Added a test patch that may fix an RCU stall logged to dmesg during heavy meta data operations (e.g. snapshot cleanup during backups). |
Export patch series: https://github.com/kakra/linux/pull/36.patch
Here's a good guide by @Forza-tng: https://wiki.tnonline.net/w/Btrfs/Allocator_Hints. Please leave them a nice comment. Thanks. :-)
Allocator hints
To make use of the allocator hints, add these to your kernel. Then run
btrfs device usage /path/to/btrfs
and take note of which device IDs are SSDs and which are HDDs.Go to
/sys/fs/btrfs/BTRFS-UUID/devinfo
and run:echo 0 | sudo tee HDD-ID/type
to prefer writing data to this device (btrfs will then prefer allocating data chunks from this device before considering other devices) - recommended for HDDs, set by defaultecho 1 | sudo tee SSD-ID/type
to prefer writing meta-data to this device (btrfs will then prefer allocating meta-data chunks from this device before considering other devices) - recommended for SSDsecho 4 | sudo tee LEGACY-ID/type
echo 5 | sudo tee LEGACY-ID/type
Important note: This recommends to use at least two independent SSDs so btrfs meta-data raid1 requirement is still satisfied. You can, however, create two partitions on the same SSD but then it's no longer protected against hardware faults, it's essentially dup-quality meta-data then, not raid1. Before sizing the partitions, look at
btrfs device usage
to find the amount of meta-data, at least double that size to size your meta-data partitions.This can be combined with bcache by directly using meta-data partitions as a native SSD partition for btrfs, and only using data partitions routed through bcache. This also takes a lot of meta-data pressure from bcache, making it more efficient and less write-wearing as a result.
Real-world example
In this example,
sde
is a 1 TB SSD having two meta-data partitions (2x 128 GB) with the remaining space dedicated to a single bcache partition attached to my btrfs pool devices:A curious reader may find that
sde1
andsde3
are missing, which is my EFI boot partition (sde1) and swap space (sde3).Read Policies aka "RAID1 read balancer"
To use the balancer,
CONFIG_BTRFS_EXPERIMENTAL=y
is needed while building the kernel. The balancer offers six modes:Combined with bcache, performance is unstable with both
round-robin
andlatency
because latency and throughput depend on whether data is cached or not - but still it is overall better than the old PID balancer.Unexpectedly,
queue
performs exceptionally well on my mixed device setup. YMMV with identical member devices. It outperforms all other policies in each discipline and benchmark.To use the balancer, use
btrfs.read_policy=<pid,round-robin,latency,latency-rr,queue,devid:#>
on the kernel cmdline. There's also a sysfs interface at/sys/fs/btrfs/<UUID>/read_policy
to switch balancers on demand (e.g., for benchmarks). Seemodinfo btrfs
for more information.Benchmark results: https://gist.github.com/kakra/ce99896e5915f9b26d13c5637f56ff37
Note 1: The latency calculation currently uses an average of the full history of requests only - which is bad because it will cancel out changing variations over time. A better approach would be to use an EMA (exponential moving average) with an alpha of 1/8 or 1/16. This requires to sample individual bio latency and thus requires to change structs and code in other parts of btrfs. I'm not very familiar with all the internal structures yet, and the feature is still guarded by
CONFIG_BTRFS_EXPERIMENTAL
, making that approach more complex. OTOH, having a permanent EMA right in the bio structures of btrfs could prove useful in other areas of btrfs.Note 2: In theory, both latency modes should automatically prefer faster zones of HDDs and properly switch stripes automatically. In practice, this is probably overruled by note 1 except most of your data is in specific zones by coincidence, in which case the average would properly hold some sort of "zone performance".
Note 3: With high CPU core counts,
queue
might have a measurable CPU overhead due to queue length calculation (per-core counters have to be summed per each request).Real-world example
Some simple tests have shown that both the round-robin and latency balancers can increase throughput while loading games from 200-300 MB/s to 500-700 MB/s when combined with bcache.
Important note: This will be officially available with kernel 6.15 or later, excluding the latency balancer. I've included it because I think it can provide better performance in edge cases, e.g. asymmetric RAID or bcache. It may also provide better performance on the desktop because on the desktop, latency is more important than throughput. The latency balancer is thus an experiment and may go away. But I will keep it until at least the next LTS kernel.
Description / instructions for balancing
(AI generated after training with some stats, observations and incremental development steps)
Interpreting the Btrfs
read_stats
Sysfs OutputThe
/sys/fs/btrfs/<UUID>/devinfo/<DEVID>/read_stats
file, enhanced by these patches, offers valuable insights into the dynamic read balancing behavior and performance of individual devices within a Btrfs RAID1/10/1C3/4 setup. Here's a breakdown of the fields:cumulative ios %lu
: The total count of read I/O operations completed on this specific device since the filesystem was mounted.cumulative wait %llu
: The total time (in nanoseconds) accumulated waiting for all cumulative read IOs on this device.cumulative avg %llu
: The long-term average read latency (cumulative wait / cumulative ios
) in nanoseconds. This represents the device's average performance over its entire operational history within the current mount. It changes very slowly and can be heavily influenced by caching layers (like bcache or the page cache) if present.checkpoint ios %ld
: The number of read IOs completed since the last checkpoint. A checkpoint is established when a device undergoes "rehabilitation" – meaning itsage
counter reached theBTRFS_DEVICE_LATENCY_CHECKPOINT_AGE
threshold, triggering a read probe and a reset of these checkpoint statistics. For devices that have never been rehabilitated, this value will equalcumulative ios
.checkpoint wait %lld
: The total time (in nanoseconds) accumulated waiting for reads since the last checkpoint.checkpoint avg %llu
: The average read latency (checkpoint wait / checkpoint ios
) calculated only using the IOs since the last checkpoint. This is a key metric reflecting recent performance. It's much more responsive to current conditions than the cumulative average, especially after a period of being ignored.age %lld
: This counter tracks how "stale" the device is in terms of read selection. It increments each time a read balancing decision is made for a stripe group containing this device, but another device from that group is chosen.0
: The device was selected for a read very recently (in the last relevant balancing decision).> 0
: The device has been ignored for this many consecutive selection events where it was a candidate. A high value indicates it's consistently considered slower or less preferred than its peers.< 0
: The device has just been rehabilitated (hit theage
threshold). It is now in a "burst IO" probation period (e.g., starting at -100 and incrementing towards 0). During this negative age phase, its reported latency is forced to 0 to guarantee it receives reads.count %llu
: The number of times this device has triggered the rehabilitation mechanism by reaching theage
threshold. A high count suggests the device is frequently deemed too slow by the latency policy or is subject to other selection biases (like non-balancing metadata reads).ignored %lld
: A counter incremented every time this device was a candidate for a read, but the balancing policy ultimately selected a different device from the same stripe group. This provides insight into how often the policy actively chooses a peer over this device, indicating relative preference or "fairness" of the algorithm.How to Use These Stats:
checkpoint avg
(compared to peers) and a high, frequently resetage
are likely performance bottlenecks under the current load.cumulative avg
andcheckpoint avg
. A large difference after rehabilitation (count > 0
) shows the policy is adapting to performance changes more quickly than the cumulative average would suggest.checkpoint avg
but a persistently highage
andignored
count (like NVMe metadata mirrors sometimes exhibit) points towards a selection bias not based purely on latency.age
andcount
values help evaluate theAGE_THRESHOLD
andIO_BURST
parameters. Ifage
hits the threshold very frequently, it might be too low. Ifcheckpoint ios
barely increases after a reset, theIO_BURST
might be too short (or the device becomes slow again immediately).These enhanced statistics provide a powerful diagnostic tool for understanding and fine-tuning Btrfs's read balancing behavior in complex, real-world storage environments.