Commit 96485e44 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull ext4 updates from Ted Ts'o:
 "The siginificant new ext4 feature this time around is Harshad's new
  fast_commit mode.

  In addition, thanks to Mauricio for fixing a race where mmap'ed pages
  that are being changed in parallel with a data=journal transaction
  commit could result in bad checksums in the failure that could cause
  journal replays to fail.

  Also notable is Ritesh's buffered write optimization which can result
  in significant improvements on parallel write workloads. (The kernel
  test robot reported a 330.6% improvement on fio.write_iops on a 96
  core system using DAX)

  Besides that, we have the usual miscellaneous cleanups and bug fixes"

Link: https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
  ext4: fix invalid inode checksum
  ext4: add fast commit stats in procfs
  ext4: add a mount opt to forcefully turn fast commits on
  ext4: fast commit recovery path
  jbd2: fast commit recovery path
  ext4: main fast-commit commit path
  jbd2: add fast commit machinery
  ext4 / jbd2: add fast commit initialization
  ext4: add fast_commit feature and handling for extended mount options
  doc: update ext4 and journalling docs to include fast commit feature
  ext4: Detect already used quota file early
  jbd2: avoid transaction reuse after reformatting
  ext4: use the normal helper to get the actual inode
  ext4: fix bs < ps issue reported with dioread_nolock mount opt
  ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  ext4: data=journal: fixes for ext4_page_mkwrite()
  jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()
  ext4: use ext4_sb_bread() instead of sb_bread()
  ...
parents f56e65df 13221811
Loading
Loading
Loading
Loading
+66 −0
Original line number Diff line number Diff line
@@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.

In case of ``data=ordered`` mode, Ext4 also supports fast commits which
help reduce commit latency significantly. The default ``data=ordered``
mode works by logging metadata blocks to the journal. In fast commit
mode, Ext4 only stores the minimal delta needed to recreate the
affected metadata in fast commit space that is shared with JBD2.
Once the fast commit area fills in or if fast commit is not possible
or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
A full commit invalidates all the fast commits that happened before
it and thus it makes the fast commit area empty for further fast
commits. This feature needs to be enabled at mkfs time.

The journal inode is typically inode 8. The first 68 bytes of the
journal inode are replicated in the ext4 superblock. The journal itself
is normal (but hidden) file within the filesystem. The file usually
@@ -609,3 +620,58 @@ bytes long (but uses a full block):
     - h\_commit\_nsec
     - Nanoseconds component of the above timestamp.

Fast commits
~~~~~~~~~~~~

Fast commit area is organized as a log of tag length values. Each TLV has
a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
of the entire field. It is followed by variable length tag specific value.
Here is the list of supported tags and their meanings:

.. list-table::
   :widths: 8 20 20 32
   :header-rows: 1

   * - Tag
     - Meaning
     - Value struct
     - Description
   * - EXT4_FC_TAG_HEAD
     - Fast commit area header
     - ``struct ext4_fc_head``
     - Stores the TID of the transaction after which these fast commits should
       be applied.
   * - EXT4_FC_TAG_ADD_RANGE
     - Add extent to inode
     - ``struct ext4_fc_add_range``
     - Stores the inode number and extent to be added in this inode
   * - EXT4_FC_TAG_DEL_RANGE
     - Remove logical offsets to inode
     - ``struct ext4_fc_del_range``
     - Stores the inode number and the logical offset range that needs to be
       removed
   * - EXT4_FC_TAG_CREAT
     - Create directory entry for a newly created file
     - ``struct ext4_fc_dentry_info``
     - Stores the parent inode number, inode number and directory entry of the
       newly created file
   * - EXT4_FC_TAG_LINK
     - Link a directory entry to an inode
     - ``struct ext4_fc_dentry_info``
     - Stores the parent inode number, inode number and directory entry
   * - EXT4_FC_TAG_UNLINK
     - Unlink a directory entry of an inode
     - ``struct ext4_fc_dentry_info``
     - Stores the parent inode number, inode number and directory entry

   * - EXT4_FC_TAG_PAD
     - Padding (unused area)
     - None
     - Unused bytes in the fast commit area.

   * - EXT4_FC_TAG_TAIL
     - Mark the end of a fast commit
     - ``struct ext4_fc_tail``
     - Stores the TID of the commit, CRC of the fast commit of which this tag
       represents the end of
+33 −0
Original line number Diff line number Diff line
@@ -132,6 +132,39 @@ The opportunities for abuse and DOS attacks with this should be obvious,
if you allow unprivileged userspace to trigger codepaths containing
these calls.

Fast commits
~~~~~~~~~~~~

JBD2 to also allows you to perform file-system specific delta commits known as
fast commits. In order to use fast commits, you first need to call
:c:func:`jbd2_fc_init` and tell how many blocks at the end of journal
area should be reserved for fast commits. Along with that, you will also need
to set following callbacks that perform correspodning work:

`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
fast commit.

`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
blocks.

File system is free to perform fast commits as and when it wants as long as it
gets permission from JBD2 to do so by calling the function
:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
file  system should tell JBD2 about it by calling
:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
commit immediately after stopping the fast commit it can do so by calling
:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
fails for some reason and the only way to guarantee consistency is for JBD2 to
perform the full traditional commit.

JBD2 helper functions to manage fast commit buffers. File system can use
:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
and wait on IO completion of fast commit buffers.

Currently, only Ext4 implements fast commits. For details of its implementation
of fast commits, please refer to the top level comments in
fs/ext4/fast_commit.c.

Summary
~~~~~~~

+1 −1
Original line number Diff line number Diff line
@@ -10,7 +10,7 @@ ext4-y := balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
		indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
		mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
		super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
		xattr_user.o
		xattr_user.o fast_commit.o

ext4-$(CONFIG_EXT4_FS_POSIX_ACL)	+= acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
+2 −0
Original line number Diff line number Diff line
@@ -242,6 +242,7 @@ retry:
	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
	if (IS_ERR(handle))
		return PTR_ERR(handle);
	ext4_fc_start_update(inode);

	if ((type == ACL_TYPE_ACCESS) && acl) {
		error = posix_acl_update_mode(inode, &mode, &acl);
@@ -259,6 +260,7 @@ retry:
	}
out_stop:
	ext4_journal_stop(handle);
	ext4_fc_stop_update(inode);
	if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
		goto retry;
	return error;
+9 −5
Original line number Diff line number Diff line
@@ -368,7 +368,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
				      struct buffer_head *bh)
{
	ext4_fsblk_t	blk;
	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
	struct ext4_group_info *grp;

	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
		return 0;

	grp = ext4_get_group_info(sb, block_group);

	if (buffer_verified(bh))
		return 0;
@@ -495,10 +500,9 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group,
	 */
	set_buffer_new(bh);
	trace_ext4_read_block_bitmap_load(sb, block_group, ignore_locked);
	bh->b_end_io = ext4_end_bitmap_read;
	get_bh(bh);
	submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO |
		  (ignore_locked ? REQ_RAHEAD : 0), bh);
	ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO |
			    (ignore_locked ? REQ_RAHEAD : 0),
			    ext4_end_bitmap_read);
	return bh;
verify:
	err = ext4_validate_block_bitmap(sb, desc, block_group, bh);
Loading