Commit 1a7a92c8 authored by Josef Bacik's avatar Josef Bacik Committed by David Sterba
Browse files

btrfs: add a comment explaining the data flush steps



The data flushing steps are not obvious to people other than myself and
Chris.  Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.

Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
parent 57056740
Loading
Loading
Loading
Loading
+47 −0
Original line number Diff line number Diff line
@@ -998,6 +998,53 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
	} while (flush_state <= COMMIT_TRANS);
}

/*
 * FLUSH_DELALLOC_WAIT:
 *   Space is freed from flushing delalloc in one of two ways.
 *
 *   1) compression is on and we allocate less space than we reserved
 *   2) we are overwriting existing space
 *
 *   For #1 that extra space is reclaimed as soon as the delalloc pages are
 *   COWed, by way of btrfs_add_reserved_bytes() which adds the actual extent
 *   length to ->bytes_reserved, and subtracts the reserved space from
 *   ->bytes_may_use.
 *
 *   For #2 this is trickier.  Once the ordered extent runs we will drop the
 *   extent in the range we are overwriting, which creates a delayed ref for
 *   that freed extent.  This however is not reclaimed until the transaction
 *   commits, thus the next stages.
 *
 * RUN_DELAYED_IPUTS
 *   If we are freeing inodes, we want to make sure all delayed iputs have
 *   completed, because they could have been on an inode with i_nlink == 0, and
 *   thus have been truncated and freed up space.  But again this space is not
 *   immediately re-usable, it comes in the form of a delayed ref, which must be
 *   run and then the transaction must be committed.
 *
 * FLUSH_DELAYED_REFS
 *   The above two cases generate delayed refs that will affect
 *   ->total_bytes_pinned.  However this counter can be inconsistent with
 *   reality if there are outstanding delayed refs.  This is because we adjust
 *   the counter based solely on the current set of delayed refs and disregard
 *   any on-disk state which might include more refs.  So for example, if we
 *   have an extent with 2 references, but we only drop 1, we'll see that there
 *   is a negative delayed ref count for the extent and assume that the space
 *   will be freed, and thus increase ->total_bytes_pinned.
 *
 *   Running the delayed refs gives us the actual real view of what will be
 *   freed at the transaction commit time.  This stage will not actually free
 *   space for us, it just makes sure that may_commit_transaction() has all of
 *   the information it needs to make the right decision.
 *
 * COMMIT_TRANS
 *   This is where we reclaim all of the pinned space generated by the previous
 *   two stages.  We will not commit the transaction if we don't think we're
 *   likely to satisfy our request, which means if our current free space +
 *   total_bytes_pinned < reservation we will not commit.  This is why the
 *   previous states are actually important, to make sure we know for sure
 *   whether committing the transaction will allow us to make progress.
 */
static const enum btrfs_flush_state data_flush_states[] = {
	FLUSH_DELALLOC_WAIT,
	RUN_DELAYED_IPUTS,