docs: fs: convert docs without extension to ReST (ec23eb54) · Commits · 戴 / test

Documentation/filesystems/directory-locking→Documentation/filesystems/directory-locking.rst

+25 −15

Original line number	Diff line number	Diff line
		=================
		Directory Locking
		=================


		Locking scheme used for directory operations is based on two
		kinds of locks - per-inode (->i_rwsem) and per-filesystem
		(->s_vfs_rename_mutex).
		@@ -27,14 +32,17 @@ NB: we might get away with locking the the source (and target in exchange
		case) shared.

		5) link creation. Locking rules:

		* lock parent
		* check that source is not a directory
		* lock source
		* call the method.

		All locks are exclusive.

		6) cross-directory rename. The trickiest in the whole bunch. Locking
		rules:

		* lock the filesystem
		* lock parents in "ancestors first" order.
		* find source and target.
		@@ -46,6 +54,7 @@ rules:
		* If the target exists, lock it. If the source is a non-directory,
		lock it. If we need to lock both, do so in inode pointer order.
		* call the method.

		All ->i_rwsem are taken exclusive. Again, we might get away with locking
		the the source (and target in exchange case) shared.

		@@ -54,6 +63,7 @@ read, modified or removed by method will be locked by caller.


		If no directory is its own ancestor, the scheme above is deadlock-free.

		Proof:

		First of all, at any moment we have a partial ordering of the

Documentation/filesystems/index.rst

+2 −0

Original line number	Diff line number	Diff line
		@@ -20,6 +20,8 @@ algorithms work.
		path-lookup
		api-summary
		splice
		locking
		directory-locking

		Filesystem support layers
		=========================

Documentation/filesystems/Locking→Documentation/filesystems/locking.rst

+174 −85

Original line number	Diff line number	Diff line
		=======
		Locking
		=======

		The text below describes the locking rules for VFS-related methods.
		It is (believed to be) up-to-date. Please, if you change anything in
		prototypes or locking protocols - update this file. And update the relevant
		@@ -5,10 +9,14 @@ instances in the tree, don't leave that to maintainers of filesystems/devices/
		etc. At the very least, put the list of dubious cases in the end of this file.
		Don't turn it into log - maintainers of out-of-the-tree code are supposed to
		be able to use diff(1).

		Thing currently missing here: socket operations. Alexey?

		--------------------------- dentry_operations --------------------------
		prototypes:
		dentry_operations
		=================

		prototypes::

		int (d_revalidate)(struct dentry , unsigned int);
		int (d_weak_revalidate)(struct dentry , unsigned int);
		int (d_hash)(const struct dentry , struct qstr *);
		@@ -24,7 +32,10 @@ prototypes:
		struct dentry (d_real)(struct dentry , const struct inode );

		locking rules:
		rename_lock ->d_lock may block rcu-walk

		================== =========== ======== ============== ========
		ops rename_lock ->d_lock may block rcu-walk
		================== =========== ======== ============== ========
		d_revalidate: no no yes (ref-walk) maybe
		d_weak_revalidate: no no yes no
		d_hash no no no maybe
		@@ -38,9 +49,13 @@ d_dname: no no no no
		d_automount: no no yes no
		d_manage: no no yes (ref-walk) maybe
		d_real no no yes no
		================== =========== ======== ============== ========

		inode_operations
		================

		prototypes::

		--------------------------- inode_operations ---------------------------
		prototypes:
		int (create) (struct inode ,struct dentry *,umode_t, bool);
		struct dentry * (lookup) (struct inode ,struct dentry *, unsigned int);
		int (link) (struct dentry ,struct inode ,struct dentry );
		@@ -68,7 +83,10 @@ prototypes:

		locking rules:
		all may block
		i_rwsem(inode)

		============ =============================================
		ops i_rwsem(inode)
		============ =============================================
		lookup: shared
		create: exclusive
		link: exclusive (both)
		@@ -89,17 +107,21 @@ fiemap: no
		update_time: no
		atomic_open: exclusive
		tmpfile: no
		============ =============================================


		Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
		exclusive on victim.
		cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.

		See Documentation/filesystems/directory-locking for more detailed discussion
		See Documentation/filesystems/directory-locking.rst for more detailed discussion
		of the locking scheme for directory operations.

		----------------------- xattr_handler operations -----------------------
		prototypes:
		xattr_handler operations
		========================

		prototypes::

		bool (list)(struct dentry dentry);
		int (get)(const struct xattr_handler handler, struct dentry *dentry,
		struct inode inode, const char name, void *buffer,
		@@ -110,13 +132,20 @@ prototypes:

		locking rules:
		all may block
		i_rwsem(inode)

		===== ==============
		ops i_rwsem(inode)
		===== ==============
		list: no
		get: no
		set: exclusive
		===== ==============

		super_operations
		================

		prototypes::

		--------------------------- super_operations ---------------------------
		prototypes:
		struct inode (alloc_inode)(struct super_block *sb);
		void (free_inode)(struct inode );
		void (destroy_inode)(struct inode );
		@@ -138,7 +167,10 @@ prototypes:

		locking rules:
		All may block [not true, see below]
		s_umount

		====================== ============ ========================
		ops s_umount note
		====================== ============ ========================
		alloc_inode:
		free_inode: called from RCU callback
		destroy_inode:
		@@ -157,6 +189,7 @@ show_options: no (namespace_sem)
		quota_read: no (see below)
		quota_write: no (see below)
		bdev_try_to_free_page: no (see below)
		====================== ============ ========================

		->statfs() has s_umount (shared) when called by ustat(2) (native or
		compat), but that's an accident of bad API; s_umount is used to pin
		@@ -164,31 +197,44 @@ the superblock down when we only have dev_t given us by userland to
		identify the superblock. Everything else (statfs(), fstatfs(), etc.)
		doesn't hold it when calling ->statfs() - superblock is pinned down
		by resolving the pathname passed to syscall.

		->quota_read() and ->quota_write() functions are both guaranteed to
		be the only ones operating on the quota file by the quota code (via
		dqio_sem) (unless an admin really wants to screw up something and
		writes to quota files with quotas on). For other details about locking
		see also dquot_operations section.

		->bdev_try_to_free_page is called from the ->releasepage handler of
		the block device inode. See there for more details.

		--------------------------- file_system_type ---------------------------
		prototypes:
		file_system_type
		================

		prototypes::

		struct dentry (mount) (struct file_system_type *, int,
		const char , void );
		void (kill_sb) (struct super_block );

		locking rules:
		may block

		======= =========
		ops may block
		======= =========
		mount yes
		kill_sb yes
		======= =========

		->mount() returns ERR_PTR or the root dentry; its superblock should be locked
		on return.

		->kill_sb() takes a write-locked superblock, does all shutdown work on it,
		unlocks and drops the reference.

		--------------------------- address_space_operations --------------------------
		prototypes:
		address_space_operations
		========================
		prototypes::

		int (writepage)(struct page page, struct writeback_control *wbc);
		int (readpage)(struct file , struct page *);
		int (writepages)(struct address_space , struct writeback_control *);
		@@ -218,7 +264,9 @@ prototypes:
		locking rules:
		All except set_page_dirty and freepage may block

		PageLocked(page) i_rwsem
		====================== ======================== =========
		ops PageLocked(page) i_rwsem
		====================== ======================== =========
		writepage: yes, unlocks (see below)
		readpage: yes, unlocks
		writepages:
		@@ -239,6 +287,7 @@ is_partially_uptodate: yes
		error_remove_page: yes
		swap_activate: no
		swap_deactivate: no
		====================== ======================== =========

		->write_begin(), ->write_end() and ->readpage() may be called from
		the request handler (/dev/loop).
		@@ -299,10 +348,10 @@ in the filesystem like having dirty inodes at umount and losing written data.

		->writepages() is used for periodic writeback and for syscall-initiated
		sync operations. The address_space should start I/O against at least
		nr_to_write pages. nr_to_write must be decremented for each page which is
		written. The address_space implementation may write more (or less) pages
		than *nr_to_write asks for, but it should try to be reasonably close. If
		nr_to_write is NULL, all dirty pages must be written.
		``nr_to_write`` pages. ``nr_to_write`` must be decremented for each page
		which is written. The address_space implementation may write more (or less)
		pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
		If nr_to_write is NULL, all dirty pages must be written.

		writepages should _only_ write pages which are present on
		mapping->io_pages.
		@@ -344,23 +393,34 @@ address space operations.
		->swap_deactivate() will be called in the sys_swapoff()
		path after ->swap_activate() returned success.

		----------------------- file_lock_operations ------------------------------
		prototypes:
		file_lock_operations
		====================

		prototypes::

		void (fl_copy_lock)(struct file_lock , struct file_lock *);
		void (fl_release_private)(struct file_lock );


		locking rules:
		inode->i_lock may block

		=================== ============= =========
		ops inode->i_lock may block
		=================== ============= =========
		fl_copy_lock: yes no
		fl_release_private: maybe maybe[1]
		fl_release_private: maybe maybe[1]_
		=================== ============= =========

		[1]: ->fl_release_private for flock or POSIX locks is currently allowed
		.. [1]:
		->fl_release_private for flock or POSIX locks is currently allowed
		to block. Leases however can still be freed while the i_lock is held and
		so fl_release_private called on a lease should not block.

		----------------------- lock_manager_operations ---------------------------
		prototypes:
		lock_manager_operations
		=======================

		prototypes::

		void (lm_notify)(struct file_lock ); /* unblock callback */
		int (lm_grant)(struct file_lock , struct file_lock *, int);
		void (lm_break)(struct file_lock ); /* break_lease callback */
		@@ -368,24 +428,33 @@ prototypes:

		locking rules:

		inode->i_lock blocked_lock_lock may block
		========== ============= ================= =========
		ops inode->i_lock blocked_lock_lock may block
		========== ============= ================= =========
		lm_notify: yes yes no
		lm_grant: no no no
		lm_break: yes no no
		lm_change yes no no
		========== ============= ================= =========

		buffer_head
		===========

		prototypes::

		--------------------------- buffer_head -----------------------------------
		prototypes:
		void (b_end_io)(struct buffer_head bh, int uptodate);

		locking rules:

		called from interrupts. In other words, extreme care is needed here.
		bh is locked, but that's all warranties we have here. Currently only RAID1,
		highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
		call this method upon the IO completion.

		--------------------------- block_device_operations -----------------------
		prototypes:
		block_device_operations
		=======================
		prototypes::

		int (open) (struct block_device , fmode_t);
		int (release) (struct gendisk , fmode_t);
		int (ioctl) (struct block_device , fmode_t, unsigned, unsigned long);
		@@ -399,7 +468,10 @@ prototypes:
		void (swap_slot_free_notify) (struct block_device , unsigned long);

		locking rules:
		bd_mutex

		======================= ===================
		ops bd_mutex
		======================= ===================
		open: yes
		release: yes
		ioctl: no
		@@ -410,6 +482,7 @@ unlock_native_capacity: no
		revalidate_disk: no
		getgeo: no
		swap_slot_free_notify: no (see below)
		======================= ===================

		media_changed, unlock_native_capacity and revalidate_disk are called only from
		check_disk_change().
		@@ -418,8 +491,11 @@ swap_slot_free_notify is called with swap_lock and sometimes the page lock
		held.


		--------------------------- file_operations -------------------------------
		prototypes:
		file_operations
		===============

		prototypes::

		loff_t (llseek) (struct file , loff_t, int);
		ssize_t (read) (struct file , char __user , size_t, loff_t );
		ssize_t (write) (struct file , const char __user , size_t, loff_t );
		@@ -455,7 +531,6 @@ prototypes:
		size_t, unsigned int);
		int (setlease)(struct file , long, struct file_lock , void );
		long (fallocate)(struct file , int, loff_t, loff_t);
		};

		locking rules:
		All may block.
		@@ -490,8 +565,11 @@ in sys_read() and friends.
		the lease within the individual filesystem to record the result of the
		operation

		--------------------------- dquot_operations -------------------------------
		prototypes:
		dquot_operations
		================

		prototypes::

		int (write_dquot) (struct dquot );
		int (acquire_dquot) (struct dquot );
		int (release_dquot) (struct dquot );
		@@ -503,20 +581,26 @@ a proper locking wrt the filesystem and call the generic quota operations.

		What filesystem should expect from the generic quota functions:

		FS recursion Held locks when called
		============== ============ =========================
		ops FS recursion Held locks when called
		============== ============ =========================
		write_dquot: yes dqonoff_sem or dqptr_sem
		acquire_dquot: yes dqonoff_sem or dqptr_sem
		release_dquot: yes dqonoff_sem or dqptr_sem
		mark_dirty: no -
		write_info: yes dqonoff_sem
		============== ============ =========================

		FS recursion means calling ->quota_read() and ->quota_write() from superblock
		operations.

		More details about quota locking can be found in fs/dquot.c.

		--------------------------- vm_operations_struct -----------------------------
		prototypes:
		vm_operations_struct
		====================

		prototypes::

		void (open)(struct vm_area_struct);
		void (close)(struct vm_area_struct);
		vm_fault_t (fault)(struct vm_area_struct, struct vm_fault *);
		@@ -525,7 +609,10 @@ prototypes:
		int (access)(struct vm_area_struct , unsigned long, void*, int, int);

		locking rules:
		mmap_sem PageLocked(page)

		============= ======== ===========================
		ops mmap_sem PageLocked(page)
		============= ======== ===========================
		open: yes
		close: yes
		fault: yes can return with page locked
		@@ -533,6 +620,7 @@ map_pages: yes
		page_mkwrite: yes can return with page locked
		pfn_mkwrite: yes
		access: yes
		============= ======== ===========================

		->fault() is called when a previously not present pte is about
		to be faulted in. The filesystem must find and return the page associated
		@@ -569,7 +657,8 @@ access_process_vm(), typically used to debug a process through
		/proc/pid/mem or ptrace. This function is needed only for
		VM_IO \| VM_PFNMAP VMAs.

		================================================================================
		--------------------------------------------------------------------------------

		Dubious stuff

		(if you break something or notice that it is broken and do not fix it yourself

Documentation/filesystems/nfs/Exporting→Documentation/filesystems/nfs/exporting.rst

+18 −13

Original line number	Diff line number	Diff line
		:orphan:

		Making Filesystems Exportable
		=============================
		@@ -42,9 +43,9 @@ filehandle fragment, there is no automatic creation of a path prefix
		for the object. This leads to two related but distinct features of
		the dcache that are not needed for normal filesystem access.

		1/ The dcache must sometimes contain objects that are not part of the
		1. The dcache must sometimes contain objects that are not part of the
		proper prefix. i.e that are not connected to the root.
		2/ The dcache must be prepared for a newly found (via ->lookup) directory
		2. The dcache must be prepared for a newly found (via ->lookup) directory
		to already have a (non-connected) dentry, and must be able to move
		that dentry into place (based on the parent and name in the
		->lookup). This is particularly needed for directories as
		@@ -52,7 +53,7 @@ the dcache that are not needed for normal filesystem access.

		To implement these features, the dcache has:

		a/ A dentry flag DCACHE_DISCONNECTED which is set on
		a. A dentry flag DCACHE_DISCONNECTED which is set on
		any dentry that might not be part of the proper prefix.
		This is set when anonymous dentries are created, and cleared when a
		dentry is noticed to be a child of a dentry which is in the proper
		@@ -71,19 +72,23 @@ a/ A dentry flag DCACHE_DISCONNECTED which is set on
		dentries. That guarantees that we won't need to hunt them down upon
		umount.

		b/ A primitive for creation of secondary roots - d_obtain_root(inode).
		b. A primitive for creation of secondary roots - d_obtain_root(inode).
		Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the
		per-superblock list (->s_roots), so they can be located at umount
		time for eviction purposes.

		c/ Helper routines to allocate anonymous dentries, and to help attach
		c. Helper routines to allocate anonymous dentries, and to help attach
		loose directory dentries at lookup time. They are:

		d_obtain_alias(inode) will return a dentry for the given inode.
		If the inode already has a dentry, one of those is returned.

		If it doesn't, a new anonymous (IS_ROOT and
		DCACHE_DISCONNECTED) dentry is allocated and attached.

		In the case of a directory, care is taken that only one dentry
		can ever be attached.

		d_splice_alias(inode, dentry) will introduce a new dentry into the tree;
		either the passed-in dentry or a preexisting alias for the given inode
		(such as an anonymous one created by d_obtain_alias), if appropriate.
		@@ -95,17 +100,17 @@ Filesystem Issues

		For a filesystem to be exportable it must:

		1/ provide the filehandle fragment routines described below.
		2/ make sure that d_splice_alias is used rather than d_add
		1. provide the filehandle fragment routines described below.
		2. make sure that d_splice_alias is used rather than d_add
		when ->lookup finds an inode for a given parent and name.

		If inode is NULL, d_splice_alias(inode, dentry) is equivalent to
		If inode is NULL, d_splice_alias(inode, dentry) is equivalent to::

		d_add(dentry, inode), NULL

		Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)

		Typically the ->lookup routine will simply end with a:
		Typically the ->lookup routine will simply end with a::

		return d_splice_alias(inode, dentry);
		}

Documentation/filesystems/vfs.rst

+1 −1

Original line number	Diff line number	Diff line
		@@ -20,7 +20,7 @@ kernel which allows different filesystem implementations to coexist.

		VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on
		are called from a process context. Filesystem locking is described in
		the document Documentation/filesystems/Locking.
		the document Documentation/filesystems/locking.rst.


		Directory Entry Cache (dcache)

Admin message