XFS
XFS 是由硅谷图形公司(Silicon Graphics, Inc.)开发的高性能日志式文件系统。XFS 因其基于分配组 (allocation group)的设计而特别擅长并行 IO。当该文件系统跨越多个存储设备时,这种设计使得 IO 线程数、文件系统带宽、文件和文件系统大小都具有极大的可伸缩性。
为了使用 XFS 用户空间实用程序,请安装 xfsprogs包 软件包。它包含了管理 XFS 文件系统所需的必要工具。
可以使用如下命令在 device 上创建新文件系统:
# mkfs.xfs device
输出示例:
meta-data=/dev/device isize=256 agcount=4, agsize=3277258 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=13109032, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=6400, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
- 可以使用
-L label选项来为文件系统分配标签。 - 在对已包含文件系统的块设备使用 mkfs.xfs 时,需使用
-f选项来覆盖掉原有的文件系统。[3]这会完全清除旧文件系统中的所有数据!
xfsprogs包 3.2.0 引入了一种新型磁盘格式(v5),其包含了称为自描述元数据(Self-Describing Metadata)的元数据校验方案。
基于 CRC32,它提供的额外保护措施可以防止元数据损坏(例如在意外断电时)。当使用 xfsprogs包 3.2.3 或更高版本时,这种校验默认是打开的。如果需要在旧版内核中挂载 XFS 为可读写,可以在调用 mkfs.xfs(8) 时加上 -m crc=0 来关闭校验特性:
# mkfs.xfs -m crc=0 /dev/target_partition
自 Linux 内核版本 3.15 起,XFS v5 磁盘格式被视作稳定特性,可用于生产环境。
自 Linux 3.16 起,XFS 增加了一个 B+ 树用于索引未被使用的 inode。它等同于现有的索引已使用 inode 的 B+ 树,不同之处在于索引未用 inode 的 B+ 树会跟踪至少包含一个未用 inode 的 inode 块。这一设计的目的是改进分配 inode 时寻找未用 inode 簇的性能。它可以提高长期使用后的文件系统性能,比如你在数月或数年之间已经向文件系统写入或删除了数百万的文件。使用这个功能不会影响整个文件系统的可靠性程度或恢复能力。
这个功能依赖于新的 v5 磁盘格式,自 Linux 内核 3.15 版本起它被视作为可用于生产环境的稳定特性。它没有改变磁盘上原本的数据结构,但会添加了一个与分配 inode 的 B+ 树保持一致的结构;因此,旧版本的内核只能将带有 B+ 树功能的文件系统挂载为只读模式。
当使用 xfsprogs 3.2.3 或更高版本时这个功能默认是开启的。如果你需要一个旧版本内核可写入的文件系统,这个功能可以在格式化 XFS 分区时用 finobt=0 开关来关闭。你还需要把它和 crc=0 一起使用:
# mkfs.xfs -m crc=0,finobt=0 /dev/target_partition
也可以简写为(finobt 依赖于 crc):
# mkfs.xfs -m crc=0 /dev/target_partition
The reverse mapping btree is at its core:
- a secondary index of storage space usage that effectively provides a redundant copy of primary space usage metadata. This adds some overhead to filesystem operations, but its inclusion in a filesystem makes cross-referencing very fast. It is an essential feature for repairing filesystems online because we can rebuild damaged primary metadata from the secondary copy.
- The feature graduated from EXPERIMENTAL status in Linux 4.16 and is production ready. However, online filesystem checking and repair is (so far) the only use case for this feature, so it will remain opt-in at least until online checking graduates to production readiness.
From mkfs.xfs(8) § OPTIONS:
- The reverse mapping btree maps filesystem blocks to the owner of the filesystem block. Most of the mappings will be to an inode number and an offset, though there will also be mappings to filesystem metadata. This secondary metadata can be used to validate the primary metadata or to pinpoint exactly which data has been lost when a disk error occurs.
See also [5] and [6] for more information.
This feature is enabled by default for new filesystems as of xfsprogs 6.5.0.
Starting in Linux 5.10, XFS supports using refactored "timestamp and inode encoding functions to handle timestamps as a 64-bit nanosecond counter and bit shifting to increase the effective size. This now allows XFS to run well past the Year 2038 problem to now the Year 2486. Making a new XFS file-system with bigtime enabled allows a timestamp range from December 1901 to July 2486 rather than December 1901 to January 2038." The feature will also allow quota timer expirations from January 1970 to July 2486 rather than January 1970 to February 2106.
Big timestamps are enabled by default for new filesystems as of xfsprogs 5.15.
可以通过 xfs_info(8) 检查现有文件系统是否已启用大时间戳:
# xfs_info / | grep bigtime ... bigtime=0 ...
在 xfsprogs包 5.11 及更新版本上,可以使用 xfs_admin(8) 来升级未挂载的现有文件系统:
# xfs_admin -O bigtime=1 device
也可以使用 xfs_repair(8):
# xfs_repair -c bigtime=1 device
另外可以考虑顺便启用 inobtcount(这是另一个新的默认配置项)。
摘自 XFS FAQ:
- 默认参数已针对性能进行过优化。mkfs.xfs 可以检测到单盘与 MD/DM RAID 配置环境间的区别,并根据环境自动修改文件系统的默认参数。
- 基本上你只需在使用对硬件 RAID 时为
mkfs.xfs指定条带单元和宽度。
(详细内容请参考 #带区大小和宽度)
largeio 和 swalloc 值,以及比默认情况更大的 logbsize 和 allocsize 值等来提高性能。下列文章能提供更多有关详情:
- 对于挂载选项,只有
logbsize会对元数据性能产生可观影响。增加logbsize可以降低特定工作负载下日志 IO 的数量,但如果系统在进行大量修改时崩溃,那么恢复后可能会丢失更多的修改操作。
- 从内核 3.2.12 版本开始,默认的 I/O 调度器 CFQ 将使 XFS 的并行化大打折扣。
/sys/block/nvme*n*/queue/scheduler 中的内容进行验证。因此基本上参照#创建即可获得最佳性能。
如果这个文件系统位于条带化的 RAID 上,可以在 mkfs.xfs(8) 命令中指定带区大小来获得显著的性能提升。
XFS 有时可以检测到软 RAID 下的几何形 (geometry), 但万一您要重塑其或正在使用硬 RAID, 请参阅如何计算出正确的 sunit 和 swidth 值以获得最佳性能
某些文件系统可以通过在 /etc/fstab 文件中添加 noatime 挂载选项来增强性能。对于 XFS 文件系统来说,默认的访问时间记录行为是 relatime,与 noatime 相比这几乎没有额外开销,且仍然可以记录正确的访问时间。所有 Linux 文件系统现在都以这个选项为默认值(从大约 2.6.30 版本开始),但是 XFS 从 2006 年开始就采用了类似 relatime 的特性,因此不需要出于性能考虑而在 XFS 上使用 noatime。[7]
更多信息请参考 Fstab#atime 参数。
Despite XFS supporting async discard[8] since kernel 4.7[9][10], xfs(5) still recommends "that you use the fstrim application to discard unused blocks rather than the discard mount option because the performance impact of this option is quite severe."
See 固态硬盘#定期 TRIM.
尽管 XFS 本质上基于区段 (Extent) 并且延迟分配策略很大程度上增强了它对磁盘碎片的抗性,XFS 仍然提供了磁盘碎片整理程序(xfs_fsr,XFS filesystem reorganizer 的缩写),它可以在已挂载且活动的 XFS 文件系统上整理碎片。定期查看 XFS 碎片也很有用。
xfs_fsr(8) 可以改进已挂载文件系统的文件组织。该重组织算法一次操作一份文件,对文件进行压实或改进文件区段布局(改成连续数据块)。
查看当前文件系统中有多少磁盘碎片:
# xfs_db -c frag -r /dev/partition
要启动碎片整理,使用 xfs_fsr(8) 命令:
# xfs_fsr /dev/partition
The reflink feature, available since kernel version 4.9 and enabled by default since mkfs.xfs version 5.1.0, allows creating fast reflink'ed copies of files as well as deduplication after the fact, in the same way as btrfs:
Reflink copies initially use no additional space:
$ cp --reflink bigfile1 bigfile2
Until either file is edited, and a copy-on-write takes place. This can be very useful to create snapshots of (large) files.
现有文件系统可使用 duperemove包 或 util-linux包 的 hardlink(1) 工具进行去重。
使用外部日志(元数据日志)可能对提高性能很有帮助 (例如在 SSD 上)[11]。请参阅 mkfs.xfs(8) 获取有关 logdev 参数的更多详情.
要在创建 XFS 文件系统时保留指定大小的外部日志,请为 mkfs.xfs 命令指定 -l logdev=device,size=size 选项。如果省略 size 参数, 则会使用基于文件系统大小的日志大小。要在挂载 XFS 文件系统时让其使用外部日志,请为 mount 命令指定 -o logdev=device 选项。
XFS 有其专有的 sysctl 变量来设置“回写间隔”,默认为 3000.
/etc/sysctl.d/20-xfs-sync-interval.conf
fs.xfs.xfssyncd_centisecs = 10000
XFS 支持通过 xfs_growfs(8) 在线调整大小:
# xfs_growfs -D size /path/to/mnt/point
如果缺省 -D size 参数,那文件系统会自动扩大到可能的最大大小(即分区大小)。
“只有 1 AG 大小的文件系统无法缩容,且无法将文件系统缩到 1 AG 大小,其中 AG 指的是 分配组。”
xfs_scrub 请求内核检查 XFS 文件系统中的所有元数据对象。内核会扫描元数据记录以查找明显错误的值,然后与其它元数据进行交叉引用。其目的是通过检查单个元数据记录与文件系统中其它元数据的一致性,建立对整个文件系统一致性的合理置信度。如果存在完整的冗余数据结构,则可以根据其它元数据重建损坏的元数据。
启用/启动 xfs_scrub_all.timer 以定期在线检查所有 XFS 文件系统的元数据。
From Checking and Repairing an XFS File System (emphasis ours):
- If you can't mount an XFS file system, you can use the
xfs_repair -ncommand to check its consistency. Typically, you would only run this command on the device file of an unmounted file system that you believe has a problem. Thexfs_repair -ncommand displays output to indicates changes that would be made to the file system in the case where it would need to complete a repair operation, but doesn't modify the file system directly. - If you can mount the file system and you don't have a suitable backup, you can use the xfsdump command to back up the existing file system data. However, note that the command might fail if the file system's metadata has become corrupted.
- You can use the xfs_repair command to attempt to repair an XFS file system specified by its device file. The command replays the journal log to fix any inconsistencies that might have resulted from the file system not being cleanly unmounted. Unless the file system has an inconsistency, you typically don't need to use the follwoing command, as the journal is replayed every time that you mount an XFS file system.
# xfs_repair device
- If the journal log has become corrupted, you can reset the log by specifying the
-Loption to xfs_repair.
- 警告:
- The xfs_repair utility cannot repair an XFS file system with a dirty log. To clear the log, mount and unmount the XFS file system. If the log is corrupt and cannot be replayed, use the
-Loption ("force log zeroing") to clear the log, that is,xfs_repair -L /dev/device. Be aware that this may result in further corruption or data loss.[13] - Resetting the log can leave the file system in an inconsistent state, resulting in data loss and data corruption. Unless you're experienced with debugging and repairing XFS file systems by using the xfs_db, it is recommended that you instead recreate the file system and restore its contents from a backup.[14]
- The xfs_repair utility cannot repair an XFS file system with a dirty log. To clear the log, mount and unmount the XFS file system. If the log is corrupt and cannot be replayed, use the
- If you can't mount the file system or you don't have a suitable backup, running xfs_repair is the only viable option, unless you're experienced in using the xfs_db command.
-
xfs_db provides an internal command set that allows you to debug and repair an XFS file system manually. The commands enable you to perform scans on the file system, and navigate and display its data structures. If you specify the
-xoption to enable expert mode, you can modify the data structures.
# xfs_db [-x] device
- For more information, see the xfs_db(8) and xfs_repair(8), and the help command within xfs_db.
See also Which factors influence the memory usage of xfs_repair? and XFS Repair.
Even when being mounted read-only with mount -o ro an XFS file system's log will be replayed if it has not been unmounted cleanly.
There may be situations where a compromised XFS file system on a damaged storage device should be mounted read-only, so that files may be copied off it hopefully without causing further damage, yet it cannot be mounted because it has not been unmounted cleanly and is damaged to such an extent that the log cannot be replayed. Also, consider that replaying the log means writing to the compromised file system, which might be a bad idea in itself.
To mount an XFS file system without writing to it in any way and without replaying the log, use mount -o ro,norecovery.
xfs_undelete-gitAUR 可以从未挂载或只读挂载的 XFS 文件系统中恢复被删除的文件,但存在一定限制。更多信息请参考 https://github.com/ianka/xfs_undelete 。
XFS 配额挂载选项(uquota、gquota、prjquota 等)会在重新挂载文件系统时失效。要对根文件系统启用配额功能,这个挂载选项需要作为内核参数 rootflags= 传递给初始化内存盘(initramfs)。在随后的启动过程中,这个选项不需要在 /etc/fstab 中挂载根(/)文件系统的挂载选项里再次列出。
当执行 xfs_scrub_all 时,它将为每个已挂载的 XFS 文件系统启动 xfs_scrub@.service 服务。这项服务以用户 nobody 身份运行,所以如果 nobody 无法进入目录时,命令执行将会失败,并随附以下错误:
xfs_scrub@mountpoint.service: Changing to the requested working directory failed: Permission denied xfs_scrub@mountpoint.service: Failed at step CHDIR spawning /usr/bin/xfs_scrub: Permission denied xfs_scrub@mountpoint.service: Main process exited, code=exited, status=200/CHDIR
为了能让对应服务运行,请更改挂载点的权限以使用户 nobody 拥有执行权限。
When using a mkinitcpio-generated systemd based initramfs without the base hook, you will see the following messages in the journal:
systemd-fsck[288]: fsck: /usr/bin/fsck.xfs: execute failed: No such file or directory systemd-fsck[286]: fsck failed with exit status 8. systemd-fsck[286]: Ignoring error.
This is because fsck.xfs(8) is a shell script and requires /bin/sh to execute. /usr/bin/sh is provided by the base hook, so the solution is to prepend it to the HOOKS array in /etc/mkinitcpio.conf. E.g.:
HOOKS=(base systemd ... )
- XFS wiki (archive)
- XFS FAQ
- Improving Metadata Performance By Reducing Journal Overhead
- XFS Wikipedia Entry
- XFS User Guide[失效链接 2024-03-03 ⓘ] XFS User Guide no longer exists but has a link to the git repository