Skip to content

Commit 3c3dc5e

Browse files
josefbacikkdave
authored andcommitted
btrfs: check delayed refs when we're checking if a ref exists
In the patch 78c52d9 ("btrfs: check for refs on snapshot delete resume") I added some code to handle file systems that had been corrupted by a bug that incorrectly skipped updating the drop progress key while dropping a snapshot. This code would check to see if we had already deleted our reference for a child block, and skip the deletion if we had already. Unfortunately there is a bug, as the check would only check the on-disk references. I made an incorrect assumption that blocks in an already deleted snapshot that was having the deletion resume on mount wouldn't be modified. If we have 2 pending deleted snapshots that share blocks, we can easily modify the rules for a block. Take the following example subvolume a exists, and subvolume b is a snapshot of subvolume a. They share references to block 1. Block 1 will have 2 full references, one for subvolume a and one for subvolume b, and it belongs to subvolume a (btrfs_header_owner(block 1) == subvolume a). When deleting subvolume a, we will drop our full reference for block 1, and because we are the owner we will drop our full reference for all of block 1's children, convert block 1 to FULL BACKREF, and add a shared reference to all of block 1's children. Then we will start the snapshot deletion of subvolume b. We look up the extent info for block 1, which checks delayed refs and tells us that FULL BACKREF is set, so sets parent to the bytenr of block 1. However because this is a resumed snapshot deletion, we call into check_ref_exists(). Because check_ref_exists() only looks at the disk, it doesn't find the shared backref for the child of block 1, and thus returns 0 and we skip deleting the reference for the child of block 1 and continue. This orphans the child of block 1. The fix is to lookup the delayed refs, similar to what we do in btrfs_lookup_extent_info(). However we only care about whether the reference exists or not. If we fail to find our reference on disk, go look up the bytenr in the delayed refs, and if it exists look for an existing ref in the delayed ref head. If that exists then we know we can delete the reference safely and carry on. If it doesn't exist we know we have to skip over this block. This bug has existed since I introduced this fix, however requires having multiple deleted snapshots pending when we unmount. We noticed this in production because our shutdown path stops the container on the system, which deletes a bunch of subvolumes, and then reboots the box. This gives us plenty of opportunities to hit this issue. Looking at the history we've seen this occasionally in production, but we had a big spike recently thanks to faster machines getting jobs with multiple subvolumes in the job. Chris Mason wrote a reproducer which does the following mount /dev/nvme4n1 /btrfs btrfs subvol create /btrfs/s1 simoop -E -f 4k -n 200000 -z /btrfs/s1 while(true) ; do btrfs subvol snap /btrfs/s1 /btrfs/s2 simoop -f 4k -n 200000 -r 10 -z /btrfs/s2 btrfs subvol snap /btrfs/s2 /btrfs/s3 btrfs balance start -dusage=80 /btrfs btrfs subvol del /btrfs/s2 /btrfs/s3 umount /btrfs btrfsck /dev/nvme4n1 || exit 1 mount /dev/nvme4n1 /btrfs done On the second loop this would fail consistently, with my patch it has been running for hours and hasn't failed. I also used dm-log-writes to capture the state of the failure so I could debug the problem. Using the existing failure case to test my patch validated that it fixes the problem. Fixes: 78c52d9 ("btrfs: check for refs on snapshot delete resume") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent 6ece62a commit 3c3dc5e

File tree

3 files changed

+114
-6
lines changed

3 files changed

+114
-6
lines changed

fs/btrfs/delayed-ref.c

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1134,6 +1134,73 @@ btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 byt
11341134
return find_ref_head(delayed_refs, bytenr, false);
11351135
}
11361136

1137+
static int find_comp(struct btrfs_delayed_ref_node *entry, u64 root, u64 parent)
1138+
{
1139+
int type = parent ? BTRFS_SHARED_BLOCK_REF_KEY : BTRFS_TREE_BLOCK_REF_KEY;
1140+
1141+
if (type < entry->type)
1142+
return -1;
1143+
if (type > entry->type)
1144+
return 1;
1145+
1146+
if (type == BTRFS_TREE_BLOCK_REF_KEY) {
1147+
if (root < entry->ref_root)
1148+
return -1;
1149+
if (root > entry->ref_root)
1150+
return 1;
1151+
} else {
1152+
if (parent < entry->parent)
1153+
return -1;
1154+
if (parent > entry->parent)
1155+
return 1;
1156+
}
1157+
return 0;
1158+
}
1159+
1160+
/*
1161+
* Check to see if a given root/parent reference is attached to the head. This
1162+
* only checks for BTRFS_ADD_DELAYED_REF references that match, as that
1163+
* indicates the reference exists for the given root or parent. This is for
1164+
* tree blocks only.
1165+
*
1166+
* @head: the head of the bytenr we're searching.
1167+
* @root: the root objectid of the reference if it is a normal reference.
1168+
* @parent: the parent if this is a shared backref.
1169+
*/
1170+
bool btrfs_find_delayed_tree_ref(struct btrfs_delayed_ref_head *head,
1171+
u64 root, u64 parent)
1172+
{
1173+
struct rb_node *node;
1174+
bool found = false;
1175+
1176+
lockdep_assert_held(&head->mutex);
1177+
1178+
spin_lock(&head->lock);
1179+
node = head->ref_tree.rb_root.rb_node;
1180+
while (node) {
1181+
struct btrfs_delayed_ref_node *entry;
1182+
int ret;
1183+
1184+
entry = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
1185+
ret = find_comp(entry, root, parent);
1186+
if (ret < 0) {
1187+
node = node->rb_left;
1188+
} else if (ret > 0) {
1189+
node = node->rb_right;
1190+
} else {
1191+
/*
1192+
* We only want to count ADD actions, as drops mean the
1193+
* ref doesn't exist.
1194+
*/
1195+
if (entry->action == BTRFS_ADD_DELAYED_REF)
1196+
found = true;
1197+
break;
1198+
}
1199+
}
1200+
spin_unlock(&head->lock);
1201+
return found;
1202+
}
1203+
11371204
void __cold btrfs_delayed_ref_exit(void)
11381205
{
11391206
kmem_cache_destroy(btrfs_delayed_ref_head_cachep);

fs/btrfs/delayed-ref.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -389,6 +389,8 @@ void btrfs_dec_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info);
389389
int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
390390
enum btrfs_reserve_flush_enum flush);
391391
bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
392+
bool btrfs_find_delayed_tree_ref(struct btrfs_delayed_ref_head *head,
393+
u64 root, u64 parent);
392394

393395
static inline u64 btrfs_delayed_ref_owner(struct btrfs_delayed_ref_node *node)
394396
{

fs/btrfs/extent-tree.c

Lines changed: 45 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5472,23 +5472,62 @@ static int check_ref_exists(struct btrfs_trans_handle *trans,
54725472
struct btrfs_root *root, u64 bytenr, u64 parent,
54735473
int level)
54745474
{
5475+
struct btrfs_delayed_ref_root *delayed_refs;
5476+
struct btrfs_delayed_ref_head *head;
54755477
struct btrfs_path *path;
54765478
struct btrfs_extent_inline_ref *iref;
54775479
int ret;
5480+
bool exists = false;
54785481

54795482
path = btrfs_alloc_path();
54805483
if (!path)
54815484
return -ENOMEM;
5482-
5485+
again:
54835486
ret = lookup_extent_backref(trans, path, &iref, bytenr,
54845487
root->fs_info->nodesize, parent,
54855488
btrfs_root_id(root), level, 0);
5489+
if (ret != -ENOENT) {
5490+
/*
5491+
* If we get 0 then we found our reference, return 1, else
5492+
* return the error if it's not -ENOENT;
5493+
*/
5494+
btrfs_free_path(path);
5495+
return (ret < 0 ) ? ret : 1;
5496+
}
5497+
5498+
/*
5499+
* We could have a delayed ref with this reference, so look it up while
5500+
* we're holding the path open to make sure we don't race with the
5501+
* delayed ref running.
5502+
*/
5503+
delayed_refs = &trans->transaction->delayed_refs;
5504+
spin_lock(&delayed_refs->lock);
5505+
head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
5506+
if (!head)
5507+
goto out;
5508+
if (!mutex_trylock(&head->mutex)) {
5509+
/*
5510+
* We're contended, means that the delayed ref is running, get a
5511+
* reference and wait for the ref head to be complete and then
5512+
* try again.
5513+
*/
5514+
refcount_inc(&head->refs);
5515+
spin_unlock(&delayed_refs->lock);
5516+
5517+
btrfs_release_path(path);
5518+
5519+
mutex_lock(&head->mutex);
5520+
mutex_unlock(&head->mutex);
5521+
btrfs_put_delayed_ref_head(head);
5522+
goto again;
5523+
}
5524+
5525+
exists = btrfs_find_delayed_tree_ref(head, root->root_key.objectid, parent);
5526+
mutex_unlock(&head->mutex);
5527+
out:
5528+
spin_unlock(&delayed_refs->lock);
54865529
btrfs_free_path(path);
5487-
if (ret == -ENOENT)
5488-
return 0;
5489-
if (ret < 0)
5490-
return ret;
5491-
return 1;
5530+
return exists ? 1 : 0;
54925531
}
54935532

54945533
/*

0 commit comments

Comments
 (0)