XFS
XFS 是由矽谷圖形公司(Silicon Graphics, Inc.)開發的高性能日誌式文件系統。XFS 因其基於分配組 (allocation group)的設計而特別擅長並行 IO。當該文件系統跨越多個存儲設備時,這種設計使得 IO 線程數、文件系統帶寬、文件和文件系統大小都具有極大的可伸縮性。
為了使用 XFS 用戶空間實用程序,請安裝 xfsprogs包 軟體包。它包含了管理 XFS 文件系統所需的必要工具。
可以使用如下命令在 device 上創建新文件系統:
# mkfs.xfs device
輸出示例:
meta-data=/dev/device isize=256 agcount=4, agsize=3277258 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=13109032, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=6400, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
- 可以使用
-L label選項來為文件系統分配標籤。 - 在對已包含文件系統的塊設備使用 mkfs.xfs 時,需使用
-f選項來覆蓋掉原有的文件系統。[3]這會完全清除舊文件系統中的所有數據!
xfsprogs包 3.2.0 引入了一種新型磁碟格式(v5),其包含了稱為自描述元數據(Self-Describing Metadata)的元數據校驗方案。
基於 CRC32,它提供的額外保護措施可以防止元數據損壞(例如在意外斷電時)。當使用 xfsprogs包 3.2.3 或更高版本時,這種校驗默認是打開的。如果需要在舊版內核中掛載 XFS 為可讀寫,可以在調用 mkfs.xfs(8) 時加上 -m crc=0 來關閉校驗特性:
# mkfs.xfs -m crc=0 /dev/target_partition
自 Linux 內核版本 3.15 起,XFS v5 磁碟格式被視作穩定特性,可用於生產環境。
自 Linux 3.16 起,XFS 增加了一個 B+ 樹用於索引未被使用的 inode。它等同於現有的索引已使用 inode 的 B+ 樹,不同之處在於索引未用 inode 的 B+ 樹會跟蹤至少包含一個未用 inode 的 inode 塊。這一設計的目的是改進分配 inode 時尋找未用 inode 簇的性能。它可以提高長期使用後的文件系統性能,比如你在數月或數年之間已經向文件系統寫入或刪除了數百萬的文件。使用這個功能不會影響整個文件系統的可靠性程度或恢復能力。
這個功能依賴於新的 v5 磁碟格式,自 Linux 內核 3.15 版本起它被視作為可用於生產環境的穩定特性。它沒有改變磁碟上原本的數據結構,但會添加了一個與分配 inode 的 B+ 樹保持一致的結構;因此,舊版本的內核只能將帶有 B+ 樹功能的文件系統掛載為只讀模式。
當使用 xfsprogs 3.2.3 或更高版本時這個功能默認是開啟的。如果你需要一個舊版本內核可寫入的文件系統,這個功能可以在格式化 XFS 分區時用 finobt=0 開關來關閉。你還需要把它和 crc=0 一起使用:
# mkfs.xfs -m crc=0,finobt=0 /dev/target_partition
也可以簡寫為(finobt 依賴於 crc):
# mkfs.xfs -m crc=0 /dev/target_partition
The reverse mapping btree is at its core:
- a secondary index of storage space usage that effectively provides a redundant copy of primary space usage metadata. This adds some overhead to filesystem operations, but its inclusion in a filesystem makes cross-referencing very fast. It is an essential feature for repairing filesystems online because we can rebuild damaged primary metadata from the secondary copy.
- The feature graduated from EXPERIMENTAL status in Linux 4.16 and is production ready. However, online filesystem checking and repair is (so far) the only use case for this feature, so it will remain opt-in at least until online checking graduates to production readiness.
From mkfs.xfs(8) § OPTIONS:
- The reverse mapping btree maps filesystem blocks to the owner of the filesystem block. Most of the mappings will be to an inode number and an offset, though there will also be mappings to filesystem metadata. This secondary metadata can be used to validate the primary metadata or to pinpoint exactly which data has been lost when a disk error occurs.
See also [5] and [6] for more information.
This feature is enabled by default for new filesystems as of xfsprogs 6.5.0.
Starting in Linux 5.10, XFS supports using refactored "timestamp and inode encoding functions to handle timestamps as a 64-bit nanosecond counter and bit shifting to increase the effective size. This now allows XFS to run well past the Year 2038 problem to now the Year 2486. Making a new XFS file-system with bigtime enabled allows a timestamp range from December 1901 to July 2486 rather than December 1901 to January 2038." The feature will also allow quota timer expirations from January 1970 to July 2486 rather than January 1970 to February 2106.
Big timestamps are enabled by default for new filesystems as of xfsprogs 5.15.
可以通過 xfs_info(8) 檢查現有文件系統是否已啟用大時間戳:
# xfs_info / | grep bigtime ... bigtime=0 ...
在 xfsprogs包 5.11 及更新版本上,可以使用 xfs_admin(8) 來升級未掛載的現有文件系統:
# xfs_admin -O bigtime=1 device
也可以使用 xfs_repair(8):
# xfs_repair -c bigtime=1 device
另外可以考慮順便啟用 inobtcount(這是另一個新的默認配置項)。
摘自 XFS FAQ:
- 默認參數已針對性能進行過優化。mkfs.xfs 可以檢測到單盤與 MD/DM RAID 配置環境間的區別,並根據環境自動修改文件系統的默認參數。
- 基本上你只需在使用對硬體 RAID 時為
mkfs.xfs指定條帶單元和寬度。
(詳細內容請參考 #帶區大小和寬度)
largeio 和 swalloc 值,以及比默認情況更大的 logbsize 和 allocsize 值等來提高性能。下列文章能提供更多有關詳情:
- 對於掛載選項,只有
logbsize會對元數據性能產生可觀影響。增加logbsize可以降低特定工作負載下日誌 IO 的數量,但如果系統在進行大量修改時崩潰,那麼恢復後可能會丟失更多的修改操作。
- 從內核 3.2.12 版本開始,默認的 I/O 調度器 CFQ 將使 XFS 的並行化大打折扣。
/sys/block/nvme*n*/queue/scheduler 中的內容進行驗證。因此基本上參照#創建即可獲得最佳性能。
如果這個文件系統位於條帶化的 RAID 上,可以在 mkfs.xfs(8) 命令中指定帶區大小來獲得顯著的性能提升。
XFS 有時可以檢測到軟 RAID 下的幾何形 (geometry), 但萬一您要重塑其或正在使用硬 RAID, 請參閱如何計算出正確的 sunit 和 swidth 值以獲得最佳性能
某些文件系統可以通過在 /etc/fstab 文件中添加 noatime 掛載選項來增強性能。對於 XFS 文件系統來說,默認的訪問時間記錄行為是 relatime,與 noatime 相比這幾乎沒有額外開銷,且仍然可以記錄正確的訪問時間。所有 Linux 文件系統現在都以這個選項為默認值(從大約 2.6.30 版本開始),但是 XFS 從 2006 年開始就採用了類似 relatime 的特性,因此不需要出於性能考慮而在 XFS 上使用 noatime。[7]
更多信息請參考 Fstab#atime 參數。
Despite XFS supporting async discard[8] since kernel 4.7[9][10], xfs(5) still recommends "that you use the fstrim application to discard unused blocks rather than the discard mount option because the performance impact of this option is quite severe."
See 固態硬碟#定期 TRIM.
儘管 XFS 本質上基於區段 (Extent) 並且延遲分配策略很大程度上增強了它對磁碟碎片的抗性,XFS 仍然提供了磁碟碎片整理程序(xfs_fsr,XFS filesystem reorganizer 的縮寫),它可以在已掛載且活動的 XFS 文件系統上整理碎片。定期查看 XFS 碎片也很有用。
xfs_fsr(8) 可以改進已掛載文件系統的文件組織。該重組織算法一次操作一份文件,對文件進行壓實或改進文件區段布局(改成連續數據塊)。
查看當前文件系統中有多少磁碟碎片:
# xfs_db -c frag -r /dev/partition
要啟動碎片整理,使用 xfs_fsr(8) 命令:
# xfs_fsr /dev/partition
The reflink feature, available since kernel version 4.9 and enabled by default since mkfs.xfs version 5.1.0, allows creating fast reflink'ed copies of files as well as deduplication after the fact, in the same way as btrfs:
Reflink copies initially use no additional space:
$ cp --reflink bigfile1 bigfile2
Until either file is edited, and a copy-on-write takes place. This can be very useful to create snapshots of (large) files.
現有文件系統可使用 duperemove包 或 util-linux包 的 hardlink(1) 工具進行去重。
使用外部日誌(元數據日誌)可能對提高性能很有幫助 (例如在 SSD 上)[11]。請參閱 mkfs.xfs(8) 獲取有關 logdev 參數的更多詳情.
要在創建 XFS 文件系統時保留指定大小的外部日誌,請為 mkfs.xfs 命令指定 -l logdev=device,size=size 選項。如果省略 size 參數, 則會使用基於文件系統大小的日誌大小。要在掛載 XFS 文件系統時讓其使用外部日誌,請為 mount 命令指定 -o logdev=device 選項。
XFS 有其專有的 sysctl 變量來設置「回寫間隔」,默認為 3000.
/etc/sysctl.d/20-xfs-sync-interval.conf
fs.xfs.xfssyncd_centisecs = 10000
XFS 支持通過 xfs_growfs(8) 在線調整大小:
# xfs_growfs -D size /path/to/mnt/point
如果預設 -D size 參數,那文件系統會自動擴大到可能的最大大小(即分區大小)。
「只有 1 AG 大小的文件系統無法縮容,且無法將文件系統縮到 1 AG 大小,其中 AG 指的是 分配組。」
xfs_scrub 請求內核檢查 XFS 文件系統中的所有元數據對象。內核會掃描元數據記錄以查找明顯錯誤的值,然後與其它元數據進行交叉引用。其目的是通過檢查單個元數據記錄與文件系統中其它元數據的一致性,建立對整個文件系統一致性的合理置信度。如果存在完整的冗餘數據結構,則可以根據其它元數據重建損壞的元數據。
啟用/啟動 xfs_scrub_all.timer 以定期在線檢查所有 XFS 文件系統的元數據。
From Checking and Repairing an XFS File System (emphasis ours):
- If you can't mount an XFS file system, you can use the
xfs_repair -ncommand to check its consistency. Typically, you would only run this command on the device file of an unmounted file system that you believe has a problem. Thexfs_repair -ncommand displays output to indicates changes that would be made to the file system in the case where it would need to complete a repair operation, but doesn't modify the file system directly. - If you can mount the file system and you don't have a suitable backup, you can use the xfsdump command to back up the existing file system data. However, note that the command might fail if the file system's metadata has become corrupted.
- You can use the xfs_repair command to attempt to repair an XFS file system specified by its device file. The command replays the journal log to fix any inconsistencies that might have resulted from the file system not being cleanly unmounted. Unless the file system has an inconsistency, you typically don't need to use the follwoing command, as the journal is replayed every time that you mount an XFS file system.
# xfs_repair device
- If the journal log has become corrupted, you can reset the log by specifying the
-Loption to xfs_repair.
- 警告:
- The xfs_repair utility cannot repair an XFS file system with a dirty log. To clear the log, mount and unmount the XFS file system. If the log is corrupt and cannot be replayed, use the
-Loption ("force log zeroing") to clear the log, that is,xfs_repair -L /dev/device. Be aware that this may result in further corruption or data loss.[13] - Resetting the log can leave the file system in an inconsistent state, resulting in data loss and data corruption. Unless you're experienced with debugging and repairing XFS file systems by using the xfs_db, it is recommended that you instead recreate the file system and restore its contents from a backup.[14]
- The xfs_repair utility cannot repair an XFS file system with a dirty log. To clear the log, mount and unmount the XFS file system. If the log is corrupt and cannot be replayed, use the
- If you can't mount the file system or you don't have a suitable backup, running xfs_repair is the only viable option, unless you're experienced in using the xfs_db command.
-
xfs_db provides an internal command set that allows you to debug and repair an XFS file system manually. The commands enable you to perform scans on the file system, and navigate and display its data structures. If you specify the
-xoption to enable expert mode, you can modify the data structures.
# xfs_db [-x] device
- For more information, see the xfs_db(8) and xfs_repair(8), and the help command within xfs_db.
See also Which factors influence the memory usage of xfs_repair? and XFS Repair.
Even when being mounted read-only with mount -o ro an XFS file system's log will be replayed if it has not been unmounted cleanly.
There may be situations where a compromised XFS file system on a damaged storage device should be mounted read-only, so that files may be copied off it hopefully without causing further damage, yet it cannot be mounted because it has not been unmounted cleanly and is damaged to such an extent that the log cannot be replayed. Also, consider that replaying the log means writing to the compromised file system, which might be a bad idea in itself.
To mount an XFS file system without writing to it in any way and without replaying the log, use mount -o ro,norecovery.
xfs_undelete-gitAUR 可以從未掛載或只讀掛載的 XFS 文件系統中恢復被刪除的文件,但存在一定限制。更多信息請參考 https://github.com/ianka/xfs_undelete 。
XFS 配額掛載選項(uquota、gquota、prjquota 等)會在重新掛載文件系統時失效。要對根文件系統啟用配額功能,這個掛載選項需要作為內核參數 rootflags= 傳遞給初始化內存檔(initramfs)。在隨後的啟動過程中,這個選項不需要在 /etc/fstab 中掛載根(/)文件系統的掛載選項裡再次列出。
當執行 xfs_scrub_all 時,它將為每個已掛載的 XFS 文件系統啟動 xfs_scrub@.service 服務。這項服務以用戶 nobody 身份運行,所以如果 nobody 無法進入目錄時,命令執行將會失敗,並隨附以下錯誤:
xfs_scrub@mountpoint.service: Changing to the requested working directory failed: Permission denied xfs_scrub@mountpoint.service: Failed at step CHDIR spawning /usr/bin/xfs_scrub: Permission denied xfs_scrub@mountpoint.service: Main process exited, code=exited, status=200/CHDIR
為了能讓對應服務運行,請更改掛載點的權限以使用戶 nobody 擁有執行權限。
When using a mkinitcpio-generated systemd based initramfs without the base hook, you will see the following messages in the journal:
systemd-fsck[288]: fsck: /usr/bin/fsck.xfs: execute failed: No such file or directory systemd-fsck[286]: fsck failed with exit status 8. systemd-fsck[286]: Ignoring error.
This is because fsck.xfs(8) is a shell script and requires /bin/sh to execute. /usr/bin/sh is provided by the base hook, so the solution is to prepend it to the HOOKS array in /etc/mkinitcpio.conf. E.g.:
HOOKS=(base systemd ... )
- XFS wiki (archive)
- XFS FAQ
- Improving Metadata Performance By Reducing Journal Overhead
- XFS Wikipedia Entry
- XFS User Guide[失效連結 2024-03-03 ⓘ] XFS User Guide no longer exists but has a link to the git repository