Commit 68cd4492 authored by Theodore Ts'o's avatar Theodore Ts'o
Browse files

Enable ext4 support for per-file/directory dax operations

This adds the same per-file/per-directory DAX support for ext4 as was
done for xfs, now that we finally have consensus over what the
interface should be.
parents 6b8ed620 15ee6567
Loading
Loading
Loading
Loading
+139 −3
Original line number Diff line number Diff line
@@ -20,8 +20,144 @@ Usage
If you have a block device which supports DAX, you can make a filesystem
on it as usual.  The DAX code currently only supports files with a block
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
size when creating the filesystem.  When mounting it, use the "-o dax"
option on the command line or add 'dax' to the options in /etc/fstab.
size when creating the filesystem.

Currently 3 filesystems support DAX: ext2, ext4 and xfs.  Enabling DAX on them
is different.

Enabling DAX on ext2
-----------------------------

When mounting the filesystem, use the "-o dax" option on the command line or
add 'dax' to the options in /etc/fstab.  This works to enable DAX on all files
within the filesystem.  It is equivalent to the '-o dax=always' behavior below.


Enabling DAX on xfs and ext4
----------------------------

Summary
-------

 1. There exists an in-kernel file access mode flag S_DAX that corresponds to
    the statx flag STATX_ATTR_DAX.  See the manpage for statx(2) for details
    about this access mode.

 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
    files and directories. This advisory flag can be set or cleared at any
    time, but doing so does not immediately affect the S_DAX state.

 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
    be inherited by all regular files and subdirectories that are subsequently
    created in this directory. Files and subdirectories that exist at the time
    this flag is set or cleared on the parent directory are not modified by
    this modification of the parent directory.

 4. There exist dax mount options which can override FS_XFLAG_DAX in the
    setting of the S_DAX flag.  Given underlying storage which supports DAX the
    following hold:

    "-o dax=inode"  means "follow FS_XFLAG_DAX" and is the default.

    "-o dax=never"  means "never set S_DAX, ignore FS_XFLAG_DAX."

    "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."

    "-o dax"        is a legacy option which is an alias for "dax=always".
		    This may be removed in the future so "-o dax=always" is
		    the preferred method for specifying this behavior.

    NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
    the same even when the filesystem is mounted with a dax option.  However,
    in-core inode state (S_DAX) will be overridden until the filesystem is
    remounted with dax=inode and the inode is evicted from kernel memory.

 5. The S_DAX policy can be changed via:

    a) Setting the parent directory FS_XFLAG_DAX as needed before files are
       created

    b) Setting the appropriate dax="foo" mount option

    c) Changing the FS_XFLAG_DAX flag on existing regular files and
       directories.  This has runtime constraints and limitations that are
       described in 6) below.

 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag,
    the change in behaviour for existing regular files may not occur
    immediately.  If the change must take effect immediately, the administrator
    needs to:

    a) stop the application so there are no active references to the data set
       the policy change will affect

    b) evict the data set from kernel caches so it will be re-instantiated when
       the application is restarted. This can be achieved by:

       i. drop-caches
       ii. a filesystem unmount and mount cycle
       iii. a system reboot


Details
-------

There are 2 per-file dax flags.  One is a persistent inode setting (FS_XFLAG_DAX)
and the other is a volatile flag indicating the active state of the feature
(S_DAX).

FS_XFLAG_DAX is preserved within the filesystem.  This persistent config
setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.

New files and directories automatically inherit FS_XFLAG_DAX from
their parent directory _when_ _created_.  Therefore, setting FS_XFLAG_DAX at
directory creation time can be used to set a default behavior for an entire
sub-tree.

To clarify inheritance, here are 3 examples:

Example A:

mkdir -p a/b/c
xfs_io -c 'chattr +x' a
mkdir a/b/c/d
mkdir a/e

	dax: a,e
	no dax: b,c,d

Example B:

mkdir a
xfs_io -c 'chattr +x' a
mkdir -p a/b/c/d

	dax: a,b,c,d
	no dax:

Example C:

mkdir -p a/b/c
xfs_io -c 'chattr +x' c
mkdir a/b/c/d

	dax: c,d
	no dax: a,b


The current enabled state (S_DAX) is set when a file inode is instantiated in
memory by the kernel.  It is set based on the underlying media support, the
value of FS_XFLAG_DAX and the filesystem's dax mount option.

statx can be used to query S_DAX.  NOTE that only regular files will ever have
S_DAX set and therefore statx will never indicate that S_DAX is set on
directories.

Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
if the underlying media does not support dax and/or the filesystem is
overridden with a mount option.



Implementation Tips for Block Driver Writers
+3 −0
Original line number Diff line number Diff line
@@ -39,3 +39,6 @@ is encrypted as well as the data itself.

Verity files cannot have blocks allocated past the end of the verity
metadata.

Verity and DAX are not compatible and attempts to set both of these flags
on a file will fail.
+3 −3
Original line number Diff line number Diff line
@@ -634,7 +634,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)

static inline void loop_update_dio(struct loop_device *lo)
{
	__loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
	__loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) |
				lo->use_dio);
}

@@ -1028,7 +1028,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
		blk_queue_write_cache(lo->lo_queue, true, false);

	if (io_is_direct(lo->lo_backing_file) && inode->i_sb->s_bdev) {
	if ((lo->lo_backing_file->f_flags & O_DIRECT) && inode->i_sb->s_bdev) {
		/* In case of direct I/O, match underlying block size */
		unsigned short bsize = bdev_logical_block_size(
			inode->i_sb->s_bdev);
+19 −0
Original line number Diff line number Diff line
@@ -647,6 +647,10 @@ static inline bool retain_dentry(struct dentry *dentry)
		if (dentry->d_op->d_delete(dentry))
			return false;
	}

	if (unlikely(dentry->d_flags & DCACHE_DONTCACHE))
		return false;

	/* retain; LRU fodder */
	dentry->d_lockref.count--;
	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
@@ -656,6 +660,21 @@ static inline bool retain_dentry(struct dentry *dentry)
	return true;
}

void d_mark_dontcache(struct inode *inode)
{
	struct dentry *de;

	spin_lock(&inode->i_lock);
	hlist_for_each_entry(de, &inode->i_dentry, d_u.d_alias) {
		spin_lock(&de->d_lock);
		de->d_flags |= DCACHE_DONTCACHE;
		spin_unlock(&de->d_lock);
	}
	inode->i_state |= I_DONTCACHE;
	spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_mark_dontcache);

/*
 * Finish off a dentry we've decided to kill.
 * dentry->d_lock must be held, returns with it unlocked.
+20 −7
Original line number Diff line number Diff line
@@ -426,13 +426,16 @@ struct flex_groups {
#define EXT4_VERITY_FL			0x00100000 /* Verity protected inode */
#define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
/* 0x00400000 was formerly EXT4_EOFBLOCKS_FL */

#define EXT4_DAX_FL			0x02000000 /* Inode is DAX */

#define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
#define EXT4_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
#define EXT4_CASEFOLD_FL		0x40000000 /* Casefolded directory */
#define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */

#define EXT4_FL_USER_VISIBLE		0x705BDFFF /* User visible flags */
#define EXT4_FL_USER_MODIFIABLE		0x604BC0FF /* User modifiable flags */
#define EXT4_FL_USER_VISIBLE		0x725BDFFF /* User visible flags */
#define EXT4_FL_USER_MODIFIABLE		0x624BC0FF /* User modifiable flags */

/* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
#define EXT4_FL_XFLAG_VISIBLE		(EXT4_SYNC_FL | \
@@ -440,14 +443,16 @@ struct flex_groups {
					 EXT4_APPEND_FL | \
					 EXT4_NODUMP_FL | \
					 EXT4_NOATIME_FL | \
					 EXT4_PROJINHERIT_FL)
					 EXT4_PROJINHERIT_FL | \
					 EXT4_DAX_FL)

/* Flags that should be inherited by new inodes from their parent. */
#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
			   EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
			   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
			   EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL)
			   EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL |\
			   EXT4_DAX_FL)

/* Flags that are appropriate for regular files (all but dir-specific ones). */
#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL | EXT4_CASEFOLD_FL |\
@@ -459,6 +464,10 @@ struct flex_groups {
/* The only flags that should be swapped */
#define EXT4_FL_SHOULD_SWAP (EXT4_HUGE_FILE_FL | EXT4_EXTENTS_FL)

/* Flags which are mutually exclusive to DAX */
#define EXT4_DAX_MUT_EXCL (EXT4_VERITY_FL | EXT4_ENCRYPT_FL |\
			   EXT4_JOURNAL_DATA_FL)

/* Mask out flags that are inappropriate for the given type of inode. */
static inline __u32 ext4_mask_flags(umode_t mode, __u32 flags)
{
@@ -499,6 +508,7 @@ enum {
	EXT4_INODE_VERITY	= 20,	/* Verity protected inode */
	EXT4_INODE_EA_INODE	= 21,	/* Inode used for large EA */
/* 22 was formerly EXT4_INODE_EOFBLOCKS */
	EXT4_INODE_DAX		= 25,	/* Inode is DAX */
	EXT4_INODE_INLINE_DATA	= 28,	/* Data in inode. */
	EXT4_INODE_PROJINHERIT	= 29,	/* Create with parents projid */
	EXT4_INODE_CASEFOLD	= 30,	/* Casefolded directory */
@@ -1135,9 +1145,9 @@ struct ext4_inode_info {
#define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
#ifdef CONFIG_FS_DAX
#define EXT4_MOUNT_DAX			0x00200	/* Direct Access */
#define EXT4_MOUNT_DAX_ALWAYS		0x00200	/* Direct Access */
#else
#define EXT4_MOUNT_DAX			0
#define EXT4_MOUNT_DAX_ALWAYS		0
#endif
#define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
@@ -1180,6 +1190,8 @@ struct ext4_inode_info {
						      blocks */
#define EXT4_MOUNT2_HURD_COMPAT		0x00000004 /* Support HURD-castrated
						      file systems */
#define EXT4_MOUNT2_DAX_NEVER		0x00000008 /* Do not allow Direct Access */
#define EXT4_MOUNT2_DAX_INODE		0x00000010 /* For printing options only */

#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM	0x00000008 /* User explicitly
						specified journal checksum */
@@ -1991,6 +2003,7 @@ static inline bool ext4_has_incompat_features(struct super_block *sb)
 */
#define EXT4_FLAGS_RESIZING	0
#define EXT4_FLAGS_SHUTDOWN	1
#define EXT4_FLAGS_BDEV_IS_DAX	2

static inline int ext4_forced_shutdown(struct ext4_sb_info *sbi)
{
@@ -2704,7 +2717,7 @@ extern int ext4_can_truncate(struct inode *inode);
extern int ext4_truncate(struct inode *);
extern int ext4_break_layouts(struct inode *);
extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_set_inode_flags(struct inode *, bool init);
extern int ext4_alloc_da_blocks(struct inode *inode);
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
Loading