logo

oasis-root

Compiled tree of Oasis Linux based on own branch at <https://hacktivis.me/git/oasis/> git clone https://anongit.hacktivis.me/git/oasis-root.git

zfs.4 (116126B)


  1. .\"
  2. .\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
  3. .\" Copyright (c) 2019, 2021 by Delphix. All rights reserved.
  4. .\" Copyright (c) 2019 Datto Inc.
  5. .\" Copyright (c) 2023, 2024 Klara, Inc.
  6. .\" The contents of this file are subject to the terms of the Common Development
  7. .\" and Distribution License (the "License"). You may not use this file except
  8. .\" in compliance with the License. You can obtain a copy of the license at
  9. .\" usr/src/OPENSOLARIS.LICENSE or https://opensource.org/licenses/CDDL-1.0.
  10. .\"
  11. .\" See the License for the specific language governing permissions and
  12. .\" limitations under the License. When distributing Covered Code, include this
  13. .\" CDDL HEADER in each file and include the License file at
  14. .\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this
  15. .\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
  16. .\" own identifying information:
  17. .\" Portions Copyright [yyyy] [name of copyright owner]
  18. .\"
  19. .\" Copyright (c) 2024, Klara, Inc.
  20. .\"
  21. .Dd November 1, 2024
  22. .Dt ZFS 4
  23. .Os
  24. .
  25. .Sh NAME
  26. .Nm zfs
  27. .Nd tuning of the ZFS kernel module
  28. .
  29. .Sh DESCRIPTION
  30. The ZFS module supports these parameters:
  31. .Bl -tag -width Ds
  32. .It Sy dbuf_cache_max_bytes Ns = Ns Sy UINT64_MAX Ns B Pq u64
  33. Maximum size in bytes of the dbuf cache.
  34. The target size is determined by the MIN versus
  35. .No 1/2^ Ns Sy dbuf_cache_shift Pq 1/32nd
  36. of the target ARC size.
  37. The behavior of the dbuf cache and its associated settings
  38. can be observed via the
  39. .Pa /proc/spl/kstat/zfs/dbufstats
  40. kstat.
  41. .
  42. .It Sy dbuf_metadata_cache_max_bytes Ns = Ns Sy UINT64_MAX Ns B Pq u64
  43. Maximum size in bytes of the metadata dbuf cache.
  44. The target size is determined by the MIN versus
  45. .No 1/2^ Ns Sy dbuf_metadata_cache_shift Pq 1/64th
  46. of the target ARC size.
  47. The behavior of the metadata dbuf cache and its associated settings
  48. can be observed via the
  49. .Pa /proc/spl/kstat/zfs/dbufstats
  50. kstat.
  51. .
  52. .It Sy dbuf_cache_hiwater_pct Ns = Ns Sy 10 Ns % Pq uint
  53. The percentage over
  54. .Sy dbuf_cache_max_bytes
  55. when dbufs must be evicted directly.
  56. .
  57. .It Sy dbuf_cache_lowater_pct Ns = Ns Sy 10 Ns % Pq uint
  58. The percentage below
  59. .Sy dbuf_cache_max_bytes
  60. when the evict thread stops evicting dbufs.
  61. .
  62. .It Sy dbuf_cache_shift Ns = Ns Sy 5 Pq uint
  63. Set the size of the dbuf cache
  64. .Pq Sy dbuf_cache_max_bytes
  65. to a log2 fraction of the target ARC size.
  66. .
  67. .It Sy dbuf_metadata_cache_shift Ns = Ns Sy 6 Pq uint
  68. Set the size of the dbuf metadata cache
  69. .Pq Sy dbuf_metadata_cache_max_bytes
  70. to a log2 fraction of the target ARC size.
  71. .
  72. .It Sy dbuf_mutex_cache_shift Ns = Ns Sy 0 Pq uint
  73. Set the size of the mutex array for the dbuf cache.
  74. When set to
  75. .Sy 0
  76. the array is dynamically sized based on total system memory.
  77. .
  78. .It Sy dmu_object_alloc_chunk_shift Ns = Ns Sy 7 Po 128 Pc Pq uint
  79. dnode slots allocated in a single operation as a power of 2.
  80. The default value minimizes lock contention for the bulk operation performed.
  81. .
  82. .It Sy dmu_ddt_copies Ns = Ns Sy 3 Pq uint
  83. Controls the number of copies stored for DeDup Table
  84. .Pq DDT
  85. objects.
  86. Reducing the number of copies to 1 from the previous default of 3
  87. can reduce the write inflation caused by deduplication.
  88. This assumes redundancy for this data is provided by the vdev layer.
  89. If the DDT is damaged, space may be leaked
  90. .Pq not freed
  91. when the DDT can not report the correct reference count.
  92. .
  93. .It Sy dmu_prefetch_max Ns = Ns Sy 134217728 Ns B Po 128 MiB Pc Pq uint
  94. Limit the amount we can prefetch with one call to this amount in bytes.
  95. This helps to limit the amount of memory that can be used by prefetching.
  96. .
  97. .It Sy ignore_hole_birth Pq int
  98. Alias for
  99. .Sy send_holes_without_birth_time .
  100. .
  101. .It Sy l2arc_feed_again Ns = Ns Sy 1 Ns | Ns 0 Pq int
  102. Turbo L2ARC warm-up.
  103. When the L2ARC is cold the fill interval will be set as fast as possible.
  104. .
  105. .It Sy l2arc_feed_min_ms Ns = Ns Sy 200 Pq u64
  106. Min feed interval in milliseconds.
  107. Requires
  108. .Sy l2arc_feed_again Ns = Ns Ar 1
  109. and only applicable in related situations.
  110. .
  111. .It Sy l2arc_feed_secs Ns = Ns Sy 1 Pq u64
  112. Seconds between L2ARC writing.
  113. .
  114. .It Sy l2arc_headroom Ns = Ns Sy 8 Pq u64
  115. How far through the ARC lists to search for L2ARC cacheable content,
  116. expressed as a multiplier of
  117. .Sy l2arc_write_max .
  118. ARC persistence across reboots can be achieved with persistent L2ARC
  119. by setting this parameter to
  120. .Sy 0 ,
  121. allowing the full length of ARC lists to be searched for cacheable content.
  122. .
  123. .It Sy l2arc_headroom_boost Ns = Ns Sy 200 Ns % Pq u64
  124. Scales
  125. .Sy l2arc_headroom
  126. by this percentage when L2ARC contents are being successfully compressed
  127. before writing.
  128. A value of
  129. .Sy 100
  130. disables this feature.
  131. .
  132. .It Sy l2arc_exclude_special Ns = Ns Sy 0 Ns | Ns 1 Pq int
  133. Controls whether buffers present on special vdevs are eligible for caching
  134. into L2ARC.
  135. If set to 1, exclude dbufs on special vdevs from being cached to L2ARC.
  136. .
  137. .It Sy l2arc_mfuonly Ns = Ns Sy 0 Ns | Ns 1 Ns | Ns 2 Pq int
  138. Controls whether only MFU metadata and data are cached from ARC into L2ARC.
  139. This may be desired to avoid wasting space on L2ARC when reading/writing large
  140. amounts of data that are not expected to be accessed more than once.
  141. .Pp
  142. The default is 0,
  143. meaning both MRU and MFU data and metadata are cached.
  144. When turning off this feature (setting it to 0), some MRU buffers will
  145. still be present in ARC and eventually cached on L2ARC.
  146. .No If Sy l2arc_noprefetch Ns = Ns Sy 0 ,
  147. some prefetched buffers will be cached to L2ARC, and those might later
  148. transition to MRU, in which case the
  149. .Sy l2arc_mru_asize No arcstat will not be Sy 0 .
  150. .Pp
  151. Setting it to 1 means to L2 cache only MFU data and metadata.
  152. .Pp
  153. Setting it to 2 means to L2 cache all metadata (MRU+MFU) but
  154. only MFU data (ie: MRU data are not cached). This can be the right setting
  155. to cache as much metadata as possible even when having high data turnover.
  156. .Pp
  157. Regardless of
  158. .Sy l2arc_noprefetch ,
  159. some MFU buffers might be evicted from ARC,
  160. accessed later on as prefetches and transition to MRU as prefetches.
  161. If accessed again they are counted as MRU and the
  162. .Sy l2arc_mru_asize No arcstat will not be Sy 0 .
  163. .Pp
  164. The ARC status of L2ARC buffers when they were first cached in
  165. L2ARC can be seen in the
  166. .Sy l2arc_mru_asize , Sy l2arc_mfu_asize , No and Sy l2arc_prefetch_asize
  167. arcstats when importing the pool or onlining a cache
  168. device if persistent L2ARC is enabled.
  169. .Pp
  170. The
  171. .Sy evict_l2_eligible_mru
  172. arcstat does not take into account if this option is enabled as the information
  173. provided by the
  174. .Sy evict_l2_eligible_m[rf]u
  175. arcstats can be used to decide if toggling this option is appropriate
  176. for the current workload.
  177. .
  178. .It Sy l2arc_meta_percent Ns = Ns Sy 33 Ns % Pq uint
  179. Percent of ARC size allowed for L2ARC-only headers.
  180. Since L2ARC buffers are not evicted on memory pressure,
  181. too many headers on a system with an irrationally large L2ARC
  182. can render it slow or unusable.
  183. This parameter limits L2ARC writes and rebuilds to achieve the target.
  184. .
  185. .It Sy l2arc_trim_ahead Ns = Ns Sy 0 Ns % Pq u64
  186. Trims ahead of the current write size
  187. .Pq Sy l2arc_write_max
  188. on L2ARC devices by this percentage of write size if we have filled the device.
  189. If set to
  190. .Sy 100
  191. we TRIM twice the space required to accommodate upcoming writes.
  192. A minimum of
  193. .Sy 64 MiB
  194. will be trimmed.
  195. It also enables TRIM of the whole L2ARC device upon creation
  196. or addition to an existing pool or if the header of the device is
  197. invalid upon importing a pool or onlining a cache device.
  198. A value of
  199. .Sy 0
  200. disables TRIM on L2ARC altogether and is the default as it can put significant
  201. stress on the underlying storage devices.
  202. This will vary depending of how well the specific device handles these commands.
  203. .
  204. .It Sy l2arc_noprefetch Ns = Ns Sy 1 Ns | Ns 0 Pq int
  205. Do not write buffers to L2ARC if they were prefetched but not used by
  206. applications.
  207. In case there are prefetched buffers in L2ARC and this option
  208. is later set, we do not read the prefetched buffers from L2ARC.
  209. Unsetting this option is useful for caching sequential reads from the
  210. disks to L2ARC and serve those reads from L2ARC later on.
  211. This may be beneficial in case the L2ARC device is significantly faster
  212. in sequential reads than the disks of the pool.
  213. .Pp
  214. Use
  215. .Sy 1
  216. to disable and
  217. .Sy 0
  218. to enable caching/reading prefetches to/from L2ARC.
  219. .
  220. .It Sy l2arc_norw Ns = Ns Sy 0 Ns | Ns 1 Pq int
  221. No reads during writes.
  222. .
  223. .It Sy l2arc_write_boost Ns = Ns Sy 33554432 Ns B Po 32 MiB Pc Pq u64
  224. Cold L2ARC devices will have
  225. .Sy l2arc_write_max
  226. increased by this amount while they remain cold.
  227. .
  228. .It Sy l2arc_write_max Ns = Ns Sy 33554432 Ns B Po 32 MiB Pc Pq u64
  229. Max write bytes per interval.
  230. .
  231. .It Sy l2arc_rebuild_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  232. Rebuild the L2ARC when importing a pool (persistent L2ARC).
  233. This can be disabled if there are problems importing a pool
  234. or attaching an L2ARC device (e.g. the L2ARC device is slow
  235. in reading stored log metadata, or the metadata
  236. has become somehow fragmented/unusable).
  237. .
  238. .It Sy l2arc_rebuild_blocks_min_l2size Ns = Ns Sy 1073741824 Ns B Po 1 GiB Pc Pq u64
  239. Mininum size of an L2ARC device required in order to write log blocks in it.
  240. The log blocks are used upon importing the pool to rebuild the persistent L2ARC.
  241. .Pp
  242. For L2ARC devices less than 1 GiB, the amount of data
  243. .Fn l2arc_evict
  244. evicts is significant compared to the amount of restored L2ARC data.
  245. In this case, do not write log blocks in L2ARC in order not to waste space.
  246. .
  247. .It Sy metaslab_aliquot Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
  248. Metaslab granularity, in bytes.
  249. This is roughly similar to what would be referred to as the "stripe size"
  250. in traditional RAID arrays.
  251. In normal operation, ZFS will try to write this amount of data to each disk
  252. before moving on to the next top-level vdev.
  253. .
  254. .It Sy metaslab_bias_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  255. Enable metaslab group biasing based on their vdevs' over- or under-utilization
  256. relative to the pool.
  257. .
  258. .It Sy metaslab_force_ganging Ns = Ns Sy 16777217 Ns B Po 16 MiB + 1 B Pc Pq u64
  259. Make some blocks above a certain size be gang blocks.
  260. This option is used by the test suite to facilitate testing.
  261. .
  262. .It Sy metaslab_force_ganging_pct Ns = Ns Sy 3 Ns % Pq uint
  263. For blocks that could be forced to be a gang block (due to
  264. .Sy metaslab_force_ganging ) ,
  265. force this many of them to be gang blocks.
  266. .
  267. .It Sy brt_zap_prefetch Ns = Ns Sy 1 Ns | Ns 0 Pq int
  268. Controls prefetching BRT records for blocks which are going to be cloned.
  269. .
  270. .It Sy brt_zap_default_bs Ns = Ns Sy 12 Po 4 KiB Pc Pq int
  271. Default BRT ZAP data block size as a power of 2. Note that changing this after
  272. creating a BRT on the pool will not affect existing BRTs, only newly created
  273. ones.
  274. .
  275. .It Sy brt_zap_default_ibs Ns = Ns Sy 12 Po 4 KiB Pc Pq int
  276. Default BRT ZAP indirect block size as a power of 2. Note that changing this
  277. after creating a BRT on the pool will not affect existing BRTs, only newly
  278. created ones.
  279. .
  280. .It Sy ddt_zap_default_bs Ns = Ns Sy 15 Po 32 KiB Pc Pq int
  281. Default DDT ZAP data block size as a power of 2. Note that changing this after
  282. creating a DDT on the pool will not affect existing DDTs, only newly created
  283. ones.
  284. .
  285. .It Sy ddt_zap_default_ibs Ns = Ns Sy 15 Po 32 KiB Pc Pq int
  286. Default DDT ZAP indirect block size as a power of 2. Note that changing this
  287. after creating a DDT on the pool will not affect existing DDTs, only newly
  288. created ones.
  289. .
  290. .It Sy zfs_default_bs Ns = Ns Sy 9 Po 512 B Pc Pq int
  291. Default dnode block size as a power of 2.
  292. .
  293. .It Sy zfs_default_ibs Ns = Ns Sy 17 Po 128 KiB Pc Pq int
  294. Default dnode indirect block size as a power of 2.
  295. .
  296. .It Sy zfs_dio_enabled Ns = Ns Sy 0 Ns | Ns 1 Pq int
  297. Enable Direct I/O.
  298. If this setting is 0, then all I/O requests will be directed through the ARC
  299. acting as though the dataset property
  300. .Sy direct
  301. was set to
  302. .Sy disabled .
  303. .
  304. .It Sy zfs_history_output_max Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
  305. When attempting to log an output nvlist of an ioctl in the on-disk history,
  306. the output will not be stored if it is larger than this size (in bytes).
  307. This must be less than
  308. .Sy DMU_MAX_ACCESS Pq 64 MiB .
  309. This applies primarily to
  310. .Fn zfs_ioc_channel_program Pq cf. Xr zfs-program 8 .
  311. .
  312. .It Sy zfs_keep_log_spacemaps_at_export Ns = Ns Sy 0 Ns | Ns 1 Pq int
  313. Prevent log spacemaps from being destroyed during pool exports and destroys.
  314. .
  315. .It Sy zfs_metaslab_segment_weight_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  316. Enable/disable segment-based metaslab selection.
  317. .
  318. .It Sy zfs_metaslab_switch_threshold Ns = Ns Sy 2 Pq int
  319. When using segment-based metaslab selection, continue allocating
  320. from the active metaslab until this option's
  321. worth of buckets have been exhausted.
  322. .
  323. .It Sy metaslab_debug_load Ns = Ns Sy 0 Ns | Ns 1 Pq int
  324. Load all metaslabs during pool import.
  325. .
  326. .It Sy metaslab_debug_unload Ns = Ns Sy 0 Ns | Ns 1 Pq int
  327. Prevent metaslabs from being unloaded.
  328. .
  329. .It Sy metaslab_fragmentation_factor_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  330. Enable use of the fragmentation metric in computing metaslab weights.
  331. .
  332. .It Sy metaslab_df_max_search Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
  333. Maximum distance to search forward from the last offset.
  334. Without this limit, fragmented pools can see
  335. .Em >100`000
  336. iterations and
  337. .Fn metaslab_block_picker
  338. becomes the performance limiting factor on high-performance storage.
  339. .Pp
  340. With the default setting of
  341. .Sy 16 MiB ,
  342. we typically see less than
  343. .Em 500
  344. iterations, even with very fragmented
  345. .Sy ashift Ns = Ns Sy 9
  346. pools.
  347. The maximum number of iterations possible is
  348. .Sy metaslab_df_max_search / 2^(ashift+1) .
  349. With the default setting of
  350. .Sy 16 MiB
  351. this is
  352. .Em 16*1024 Pq with Sy ashift Ns = Ns Sy 9
  353. or
  354. .Em 2*1024 Pq with Sy ashift Ns = Ns Sy 12 .
  355. .
  356. .It Sy metaslab_df_use_largest_segment Ns = Ns Sy 0 Ns | Ns 1 Pq int
  357. If not searching forward (due to
  358. .Sy metaslab_df_max_search , metaslab_df_free_pct ,
  359. .No or Sy metaslab_df_alloc_threshold ) ,
  360. this tunable controls which segment is used.
  361. If set, we will use the largest free segment.
  362. If unset, we will use a segment of at least the requested size.
  363. .
  364. .It Sy zfs_metaslab_max_size_cache_sec Ns = Ns Sy 3600 Ns s Po 1 hour Pc Pq u64
  365. When we unload a metaslab, we cache the size of the largest free chunk.
  366. We use that cached size to determine whether or not to load a metaslab
  367. for a given allocation.
  368. As more frees accumulate in that metaslab while it's unloaded,
  369. the cached max size becomes less and less accurate.
  370. After a number of seconds controlled by this tunable,
  371. we stop considering the cached max size and start
  372. considering only the histogram instead.
  373. .
  374. .It Sy zfs_metaslab_mem_limit Ns = Ns Sy 25 Ns % Pq uint
  375. When we are loading a new metaslab, we check the amount of memory being used
  376. to store metaslab range trees.
  377. If it is over a threshold, we attempt to unload the least recently used metaslab
  378. to prevent the system from clogging all of its memory with range trees.
  379. This tunable sets the percentage of total system memory that is the threshold.
  380. .
  381. .It Sy zfs_metaslab_try_hard_before_gang Ns = Ns Sy 0 Ns | Ns 1 Pq int
  382. .Bl -item -compact
  383. .It
  384. If unset, we will first try normal allocation.
  385. .It
  386. If that fails then we will do a gang allocation.
  387. .It
  388. If that fails then we will do a "try hard" gang allocation.
  389. .It
  390. If that fails then we will have a multi-layer gang block.
  391. .El
  392. .Pp
  393. .Bl -item -compact
  394. .It
  395. If set, we will first try normal allocation.
  396. .It
  397. If that fails then we will do a "try hard" allocation.
  398. .It
  399. If that fails we will do a gang allocation.
  400. .It
  401. If that fails we will do a "try hard" gang allocation.
  402. .It
  403. If that fails then we will have a multi-layer gang block.
  404. .El
  405. .
  406. .It Sy zfs_metaslab_find_max_tries Ns = Ns Sy 100 Pq uint
  407. When not trying hard, we only consider this number of the best metaslabs.
  408. This improves performance, especially when there are many metaslabs per vdev
  409. and the allocation can't actually be satisfied
  410. (so we would otherwise iterate all metaslabs).
  411. .
  412. .It Sy zfs_vdev_default_ms_count Ns = Ns Sy 200 Pq uint
  413. When a vdev is added, target this number of metaslabs per top-level vdev.
  414. .
  415. .It Sy zfs_vdev_default_ms_shift Ns = Ns Sy 29 Po 512 MiB Pc Pq uint
  416. Default lower limit for metaslab size.
  417. .
  418. .It Sy zfs_vdev_max_ms_shift Ns = Ns Sy 34 Po 16 GiB Pc Pq uint
  419. Default upper limit for metaslab size.
  420. .
  421. .It Sy zfs_vdev_max_auto_ashift Ns = Ns Sy 14 Pq uint
  422. Maximum ashift used when optimizing for logical \[->] physical sector size on
  423. new
  424. top-level vdevs.
  425. May be increased up to
  426. .Sy ASHIFT_MAX Po 16 Pc ,
  427. but this may negatively impact pool space efficiency.
  428. .
  429. .It Sy zfs_vdev_direct_write_verify Ns = Ns Sy Linux 1 | FreeBSD 0 Pq uint
  430. If non-zero, then a Direct I/O write's checksum will be verified every
  431. time the write is issued and before it is commited to the block pointer.
  432. In the event the checksum is not valid then the I/O operation will return EIO.
  433. This module parameter can be used to detect if the
  434. contents of the users buffer have changed in the process of doing a Direct I/O
  435. write.
  436. It can also help to identify if reported checksum errors are tied to Direct I/O
  437. writes.
  438. Each verify error causes a
  439. .Sy dio_verify_wr
  440. zevent.
  441. Direct Write I/O checkum verify errors can be seen with
  442. .Nm zpool Cm status Fl d .
  443. The default value for this is 1 on Linux, but is 0 for
  444. .Fx
  445. because user pages can be placed under write protection in
  446. .Fx
  447. before the Direct I/O write is issued.
  448. .
  449. .It Sy zfs_vdev_min_auto_ashift Ns = Ns Sy ASHIFT_MIN Po 9 Pc Pq uint
  450. Minimum ashift used when creating new top-level vdevs.
  451. .
  452. .It Sy zfs_vdev_min_ms_count Ns = Ns Sy 16 Pq uint
  453. Minimum number of metaslabs to create in a top-level vdev.
  454. .
  455. .It Sy vdev_validate_skip Ns = Ns Sy 0 Ns | Ns 1 Pq int
  456. Skip label validation steps during pool import.
  457. Changing is not recommended unless you know what you're doing
  458. and are recovering a damaged label.
  459. .
  460. .It Sy zfs_vdev_ms_count_limit Ns = Ns Sy 131072 Po 128k Pc Pq uint
  461. Practical upper limit of total metaslabs per top-level vdev.
  462. .
  463. .It Sy metaslab_preload_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  464. Enable metaslab group preloading.
  465. .
  466. .It Sy metaslab_preload_limit Ns = Ns Sy 10 Pq uint
  467. Maximum number of metaslabs per group to preload
  468. .
  469. .It Sy metaslab_preload_pct Ns = Ns Sy 50 Pq uint
  470. Percentage of CPUs to run a metaslab preload taskq
  471. .
  472. .It Sy metaslab_lba_weighting_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  473. Give more weight to metaslabs with lower LBAs,
  474. assuming they have greater bandwidth,
  475. as is typically the case on a modern constant angular velocity disk drive.
  476. .
  477. .It Sy metaslab_unload_delay Ns = Ns Sy 32 Pq uint
  478. After a metaslab is used, we keep it loaded for this many TXGs, to attempt to
  479. reduce unnecessary reloading.
  480. Note that both this many TXGs and
  481. .Sy metaslab_unload_delay_ms
  482. milliseconds must pass before unloading will occur.
  483. .
  484. .It Sy metaslab_unload_delay_ms Ns = Ns Sy 600000 Ns ms Po 10 min Pc Pq uint
  485. After a metaslab is used, we keep it loaded for this many milliseconds,
  486. to attempt to reduce unnecessary reloading.
  487. Note, that both this many milliseconds and
  488. .Sy metaslab_unload_delay
  489. TXGs must pass before unloading will occur.
  490. .
  491. .It Sy reference_history Ns = Ns Sy 3 Pq uint
  492. Maximum reference holders being tracked when reference_tracking_enable is
  493. active.
  494. .It Sy raidz_expand_max_copy_bytes Ns = Ns Sy 160MB Pq ulong
  495. Max amount of memory to use for RAID-Z expansion I/O.
  496. This limits how much I/O can be outstanding at once.
  497. .
  498. .It Sy raidz_expand_max_reflow_bytes Ns = Ns Sy 0 Pq ulong
  499. For testing, pause RAID-Z expansion when reflow amount reaches this value.
  500. .
  501. .It Sy raidz_io_aggregate_rows Ns = Ns Sy 4 Pq ulong
  502. For expanded RAID-Z, aggregate reads that have more rows than this.
  503. .
  504. .It Sy reference_history Ns = Ns Sy 3 Pq int
  505. Maximum reference holders being tracked when reference_tracking_enable is
  506. active.
  507. .
  508. .It Sy reference_tracking_enable Ns = Ns Sy 0 Ns | Ns 1 Pq int
  509. Track reference holders to
  510. .Sy refcount_t
  511. objects (debug builds only).
  512. .
  513. .It Sy send_holes_without_birth_time Ns = Ns Sy 1 Ns | Ns 0 Pq int
  514. When set, the
  515. .Sy hole_birth
  516. optimization will not be used, and all holes will always be sent during a
  517. .Nm zfs Cm send .
  518. This is useful if you suspect your datasets are affected by a bug in
  519. .Sy hole_birth .
  520. .
  521. .It Sy spa_config_path Ns = Ns Pa /etc/zfs/zpool.cache Pq charp
  522. SPA config file.
  523. .
  524. .It Sy spa_asize_inflation Ns = Ns Sy 24 Pq uint
  525. Multiplication factor used to estimate actual disk consumption from the
  526. size of data being written.
  527. The default value is a worst case estimate,
  528. but lower values may be valid for a given pool depending on its configuration.
  529. Pool administrators who understand the factors involved
  530. may wish to specify a more realistic inflation factor,
  531. particularly if they operate close to quota or capacity limits.
  532. .
  533. .It Sy spa_load_print_vdev_tree Ns = Ns Sy 0 Ns | Ns 1 Pq int
  534. Whether to print the vdev tree in the debugging message buffer during pool
  535. import.
  536. .
  537. .It Sy spa_load_verify_data Ns = Ns Sy 1 Ns | Ns 0 Pq int
  538. Whether to traverse data blocks during an "extreme rewind"
  539. .Pq Fl X
  540. import.
  541. .Pp
  542. An extreme rewind import normally performs a full traversal of all
  543. blocks in the pool for verification.
  544. If this parameter is unset, the traversal skips non-metadata blocks.
  545. It can be toggled once the
  546. import has started to stop or start the traversal of non-metadata blocks.
  547. .
  548. .It Sy spa_load_verify_metadata Ns = Ns Sy 1 Ns | Ns 0 Pq int
  549. Whether to traverse blocks during an "extreme rewind"
  550. .Pq Fl X
  551. pool import.
  552. .Pp
  553. An extreme rewind import normally performs a full traversal of all
  554. blocks in the pool for verification.
  555. If this parameter is unset, the traversal is not performed.
  556. It can be toggled once the import has started to stop or start the traversal.
  557. .
  558. .It Sy spa_load_verify_shift Ns = Ns Sy 4 Po 1/16th Pc Pq uint
  559. Sets the maximum number of bytes to consume during pool import to the log2
  560. fraction of the target ARC size.
  561. .
  562. .It Sy spa_slop_shift Ns = Ns Sy 5 Po 1/32nd Pc Pq int
  563. Normally, we don't allow the last
  564. .Sy 3.2% Pq Sy 1/2^spa_slop_shift
  565. of space in the pool to be consumed.
  566. This ensures that we don't run the pool completely out of space,
  567. due to unaccounted changes (e.g. to the MOS).
  568. It also limits the worst-case time to allocate space.
  569. If we have less than this amount of free space,
  570. most ZPL operations (e.g. write, create) will return
  571. .Sy ENOSPC .
  572. .
  573. .It Sy spa_num_allocators Ns = Ns Sy 4 Pq int
  574. Determines the number of block alloctators to use per spa instance.
  575. Capped by the number of actual CPUs in the system via
  576. .Sy spa_cpus_per_allocator .
  577. .Pp
  578. Note that setting this value too high could result in performance
  579. degredation and/or excess fragmentation.
  580. Set value only applies to pools imported/created after that.
  581. .
  582. .It Sy spa_cpus_per_allocator Ns = Ns Sy 4 Pq int
  583. Determines the minimum number of CPUs in a system for block alloctator
  584. per spa instance.
  585. Set value only applies to pools imported/created after that.
  586. .
  587. .It Sy spa_upgrade_errlog_limit Ns = Ns Sy 0 Pq uint
  588. Limits the number of on-disk error log entries that will be converted to the
  589. new format when enabling the
  590. .Sy head_errlog
  591. feature.
  592. The default is to convert all log entries.
  593. .
  594. .It Sy vdev_removal_max_span Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq uint
  595. During top-level vdev removal, chunks of data are copied from the vdev
  596. which may include free space in order to trade bandwidth for IOPS.
  597. This parameter determines the maximum span of free space, in bytes,
  598. which will be included as "unnecessary" data in a chunk of copied data.
  599. .Pp
  600. The default value here was chosen to align with
  601. .Sy zfs_vdev_read_gap_limit ,
  602. which is a similar concept when doing
  603. regular reads (but there's no reason it has to be the same).
  604. .
  605. .It Sy vdev_file_logical_ashift Ns = Ns Sy 9 Po 512 B Pc Pq u64
  606. Logical ashift for file-based devices.
  607. .
  608. .It Sy vdev_file_physical_ashift Ns = Ns Sy 9 Po 512 B Pc Pq u64
  609. Physical ashift for file-based devices.
  610. .
  611. .It Sy zap_iterate_prefetch Ns = Ns Sy 1 Ns | Ns 0 Pq int
  612. If set, when we start iterating over a ZAP object,
  613. prefetch the entire object (all leaf blocks).
  614. However, this is limited by
  615. .Sy dmu_prefetch_max .
  616. .
  617. .It Sy zap_micro_max_size Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq int
  618. Maximum micro ZAP size.
  619. A "micro" ZAP is upgraded to a "fat" ZAP once it grows beyond the specified
  620. size.
  621. Sizes higher than 128KiB will be clamped to 128KiB unless the
  622. .Sy large_microzap
  623. feature is enabled.
  624. .
  625. .It Sy zap_shrink_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  626. If set, adjacent empty ZAP blocks will be collapsed, reducing disk space.
  627. .
  628. .It Sy zfetch_min_distance Ns = Ns Sy 4194304 Ns B Po 4 MiB Pc Pq uint
  629. Min bytes to prefetch per stream.
  630. Prefetch distance starts from the demand access size and quickly grows to
  631. this value, doubling on each hit.
  632. After that it may grow further by 1/8 per hit, but only if some prefetch
  633. since last time haven't completed in time to satisfy demand request, i.e.
  634. prefetch depth didn't cover the read latency or the pool got saturated.
  635. .
  636. .It Sy zfetch_max_distance Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq uint
  637. Max bytes to prefetch per stream.
  638. .
  639. .It Sy zfetch_max_idistance Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq uint
  640. Max bytes to prefetch indirects for per stream.
  641. .
  642. .It Sy zfetch_max_reorder Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
  643. Requests within this byte distance from the current prefetch stream position
  644. are considered parts of the stream, reordered due to parallel processing.
  645. Such requests do not advance the stream position immediately unless
  646. .Sy zfetch_hole_shift
  647. fill threshold is reached, but saved to fill holes in the stream later.
  648. .
  649. .It Sy zfetch_max_streams Ns = Ns Sy 8 Pq uint
  650. Max number of streams per zfetch (prefetch streams per file).
  651. .
  652. .It Sy zfetch_min_sec_reap Ns = Ns Sy 1 Pq uint
  653. Min time before inactive prefetch stream can be reclaimed
  654. .
  655. .It Sy zfetch_max_sec_reap Ns = Ns Sy 2 Pq uint
  656. Max time before inactive prefetch stream can be deleted
  657. .
  658. .It Sy zfs_abd_scatter_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  659. Enables ARC from using scatter/gather lists and forces all allocations to be
  660. linear in kernel memory.
  661. Disabling can improve performance in some code paths
  662. at the expense of fragmented kernel memory.
  663. .
  664. .It Sy zfs_abd_scatter_max_order Ns = Ns Sy MAX_ORDER\-1 Pq uint
  665. Maximum number of consecutive memory pages allocated in a single block for
  666. scatter/gather lists.
  667. .Pp
  668. The value of
  669. .Sy MAX_ORDER
  670. depends on kernel configuration.
  671. .
  672. .It Sy zfs_abd_scatter_min_size Ns = Ns Sy 1536 Ns B Po 1.5 KiB Pc Pq uint
  673. This is the minimum allocation size that will use scatter (page-based) ABDs.
  674. Smaller allocations will use linear ABDs.
  675. .
  676. .It Sy zfs_arc_dnode_limit Ns = Ns Sy 0 Ns B Pq u64
  677. When the number of bytes consumed by dnodes in the ARC exceeds this number of
  678. bytes, try to unpin some of it in response to demand for non-metadata.
  679. This value acts as a ceiling to the amount of dnode metadata, and defaults to
  680. .Sy 0 ,
  681. which indicates that a percent which is based on
  682. .Sy zfs_arc_dnode_limit_percent
  683. of the ARC meta buffers that may be used for dnodes.
  684. .It Sy zfs_arc_dnode_limit_percent Ns = Ns Sy 10 Ns % Pq u64
  685. Percentage that can be consumed by dnodes of ARC meta buffers.
  686. .Pp
  687. See also
  688. .Sy zfs_arc_dnode_limit ,
  689. which serves a similar purpose but has a higher priority if nonzero.
  690. .
  691. .It Sy zfs_arc_dnode_reduce_percent Ns = Ns Sy 10 Ns % Pq u64
  692. Percentage of ARC dnodes to try to scan in response to demand for non-metadata
  693. when the number of bytes consumed by dnodes exceeds
  694. .Sy zfs_arc_dnode_limit .
  695. .
  696. .It Sy zfs_arc_average_blocksize Ns = Ns Sy 8192 Ns B Po 8 KiB Pc Pq uint
  697. The ARC's buffer hash table is sized based on the assumption of an average
  698. block size of this value.
  699. This works out to roughly 1 MiB of hash table per 1 GiB of physical memory
  700. with 8-byte pointers.
  701. For configurations with a known larger average block size,
  702. this value can be increased to reduce the memory footprint.
  703. .
  704. .It Sy zfs_arc_eviction_pct Ns = Ns Sy 200 Ns % Pq uint
  705. When
  706. .Fn arc_is_overflowing ,
  707. .Fn arc_get_data_impl
  708. waits for this percent of the requested amount of data to be evicted.
  709. For example, by default, for every
  710. .Em 2 KiB
  711. that's evicted,
  712. .Em 1 KiB
  713. of it may be "reused" by a new allocation.
  714. Since this is above
  715. .Sy 100 Ns % ,
  716. it ensures that progress is made towards getting
  717. .Sy arc_size No under Sy arc_c .
  718. Since this is finite, it ensures that allocations can still happen,
  719. even during the potentially long time that
  720. .Sy arc_size No is more than Sy arc_c .
  721. .
  722. .It Sy zfs_arc_evict_batch_limit Ns = Ns Sy 10 Pq uint
  723. Number ARC headers to evict per sub-list before proceeding to another sub-list.
  724. This batch-style operation prevents entire sub-lists from being evicted at once
  725. but comes at a cost of additional unlocking and locking.
  726. .
  727. .It Sy zfs_arc_grow_retry Ns = Ns Sy 0 Ns s Pq uint
  728. If set to a non zero value, it will replace the
  729. .Sy arc_grow_retry
  730. value with this value.
  731. The
  732. .Sy arc_grow_retry
  733. .No value Pq default Sy 5 Ns s
  734. is the number of seconds the ARC will wait before
  735. trying to resume growth after a memory pressure event.
  736. .
  737. .It Sy zfs_arc_lotsfree_percent Ns = Ns Sy 10 Ns % Pq int
  738. Throttle I/O when free system memory drops below this percentage of total
  739. system memory.
  740. Setting this value to
  741. .Sy 0
  742. will disable the throttle.
  743. .
  744. .It Sy zfs_arc_max Ns = Ns Sy 0 Ns B Pq u64
  745. Max size of ARC in bytes.
  746. If
  747. .Sy 0 ,
  748. then the max size of ARC is determined by the amount of system memory installed.
  749. The larger of
  750. .Sy all_system_memory No \- Sy 1 GiB
  751. and
  752. .Sy 5/8 No \(mu Sy all_system_memory
  753. will be used as the limit.
  754. This value must be at least
  755. .Sy 67108864 Ns B Pq 64 MiB .
  756. .Pp
  757. This value can be changed dynamically, with some caveats.
  758. It cannot be set back to
  759. .Sy 0
  760. while running, and reducing it below the current ARC size will not cause
  761. the ARC to shrink without memory pressure to induce shrinking.
  762. .
  763. .It Sy zfs_arc_meta_balance Ns = Ns Sy 500 Pq uint
  764. Balance between metadata and data on ghost hits.
  765. Values above 100 increase metadata caching by proportionally reducing effect
  766. of ghost data hits on target data/metadata rate.
  767. .
  768. .It Sy zfs_arc_min Ns = Ns Sy 0 Ns B Pq u64
  769. Min size of ARC in bytes.
  770. .No If set to Sy 0 , arc_c_min
  771. will default to consuming the larger of
  772. .Sy 32 MiB
  773. and
  774. .Sy all_system_memory No / Sy 32 .
  775. .
  776. .It Sy zfs_arc_min_prefetch_ms Ns = Ns Sy 0 Ns ms Ns Po Ns ≡ Ns 1s Pc Pq uint
  777. Minimum time prefetched blocks are locked in the ARC.
  778. .
  779. .It Sy zfs_arc_min_prescient_prefetch_ms Ns = Ns Sy 0 Ns ms Ns Po Ns ≡ Ns 6s Pc Pq uint
  780. Minimum time "prescient prefetched" blocks are locked in the ARC.
  781. These blocks are meant to be prefetched fairly aggressively ahead of
  782. the code that may use them.
  783. .
  784. .It Sy zfs_arc_prune_task_threads Ns = Ns Sy 1 Pq int
  785. Number of arc_prune threads.
  786. .Fx
  787. does not need more than one.
  788. Linux may theoretically use one per mount point up to number of CPUs,
  789. but that was not proven to be useful.
  790. .
  791. .It Sy zfs_max_missing_tvds Ns = Ns Sy 0 Pq int
  792. Number of missing top-level vdevs which will be allowed during
  793. pool import (only in read-only mode).
  794. .
  795. .It Sy zfs_max_nvlist_src_size Ns = Sy 0 Pq u64
  796. Maximum size in bytes allowed to be passed as
  797. .Sy zc_nvlist_src_size
  798. for ioctls on
  799. .Pa /dev/zfs .
  800. This prevents a user from causing the kernel to allocate
  801. an excessive amount of memory.
  802. When the limit is exceeded, the ioctl fails with
  803. .Sy EINVAL
  804. and a description of the error is sent to the
  805. .Pa zfs-dbgmsg
  806. log.
  807. This parameter should not need to be touched under normal circumstances.
  808. If
  809. .Sy 0 ,
  810. equivalent to a quarter of the user-wired memory limit under
  811. .Fx
  812. and to
  813. .Sy 134217728 Ns B Pq 128 MiB
  814. under Linux.
  815. .
  816. .It Sy zfs_multilist_num_sublists Ns = Ns Sy 0 Pq uint
  817. To allow more fine-grained locking, each ARC state contains a series
  818. of lists for both data and metadata objects.
  819. Locking is performed at the level of these "sub-lists".
  820. This parameters controls the number of sub-lists per ARC state,
  821. and also applies to other uses of the multilist data structure.
  822. .Pp
  823. If
  824. .Sy 0 ,
  825. equivalent to the greater of the number of online CPUs and
  826. .Sy 4 .
  827. .
  828. .It Sy zfs_arc_overflow_shift Ns = Ns Sy 8 Pq int
  829. The ARC size is considered to be overflowing if it exceeds the current
  830. ARC target size
  831. .Pq Sy arc_c
  832. by thresholds determined by this parameter.
  833. Exceeding by
  834. .Sy ( arc_c No >> Sy zfs_arc_overflow_shift ) No / Sy 2
  835. starts ARC reclamation process.
  836. If that appears insufficient, exceeding by
  837. .Sy ( arc_c No >> Sy zfs_arc_overflow_shift ) No \(mu Sy 1.5
  838. blocks new buffer allocation until the reclaim thread catches up.
  839. Started reclamation process continues till ARC size returns below the
  840. target size.
  841. .Pp
  842. The default value of
  843. .Sy 8
  844. causes the ARC to start reclamation if it exceeds the target size by
  845. .Em 0.2%
  846. of the target size, and block allocations by
  847. .Em 0.6% .
  848. .
  849. .It Sy zfs_arc_shrink_shift Ns = Ns Sy 0 Pq uint
  850. If nonzero, this will update
  851. .Sy arc_shrink_shift Pq default Sy 7
  852. with the new value.
  853. .
  854. .It Sy zfs_arc_pc_percent Ns = Ns Sy 0 Ns % Po off Pc Pq uint
  855. Percent of pagecache to reclaim ARC to.
  856. .Pp
  857. This tunable allows the ZFS ARC to play more nicely
  858. with the kernel's LRU pagecache.
  859. It can guarantee that the ARC size won't collapse under scanning
  860. pressure on the pagecache, yet still allows the ARC to be reclaimed down to
  861. .Sy zfs_arc_min
  862. if necessary.
  863. This value is specified as percent of pagecache size (as measured by
  864. .Sy NR_FILE_PAGES ) ,
  865. where that percent may exceed
  866. .Sy 100 .
  867. This
  868. only operates during memory pressure/reclaim.
  869. .
  870. .It Sy zfs_arc_shrinker_limit Ns = Ns Sy 0 Pq int
  871. This is a limit on how many pages the ARC shrinker makes available for
  872. eviction in response to one page allocation attempt.
  873. Note that in practice, the kernel's shrinker can ask us to evict
  874. up to about four times this for one allocation attempt.
  875. To reduce OOM risk, this limit is applied for kswapd reclaims only.
  876. .Pp
  877. For example a value of
  878. .Sy 10000 Pq in practice, Em 160 MiB No per allocation attempt with 4 KiB pages
  879. limits the amount of time spent attempting to reclaim ARC memory to
  880. less than 100 ms per allocation attempt,
  881. even with a small average compressed block size of ~8 KiB.
  882. .Pp
  883. The parameter can be set to 0 (zero) to disable the limit,
  884. and only applies on Linux.
  885. .
  886. .It Sy zfs_arc_shrinker_seeks Ns = Ns Sy 2 Pq int
  887. Relative cost of ARC eviction on Linux, AKA number of seeks needed to
  888. restore evicted page.
  889. Bigger values make ARC more precious and evictions smaller, comparing to
  890. other kernel subsystems.
  891. Value of 4 means parity with page cache.
  892. .
  893. .It Sy zfs_arc_sys_free Ns = Ns Sy 0 Ns B Pq u64
  894. The target number of bytes the ARC should leave as free memory on the system.
  895. If zero, equivalent to the bigger of
  896. .Sy 512 KiB No and Sy all_system_memory/64 .
  897. .
  898. .It Sy zfs_autoimport_disable Ns = Ns Sy 1 Ns | Ns 0 Pq int
  899. Disable pool import at module load by ignoring the cache file
  900. .Pq Sy spa_config_path .
  901. .
  902. .It Sy zfs_checksum_events_per_second Ns = Ns Sy 20 Ns /s Pq uint
  903. Rate limit checksum events to this many per second.
  904. Note that this should not be set below the ZED thresholds
  905. (currently 10 checksums over 10 seconds)
  906. or else the daemon may not trigger any action.
  907. .
  908. .It Sy zfs_commit_timeout_pct Ns = Ns Sy 10 Ns % Pq uint
  909. This controls the amount of time that a ZIL block (lwb) will remain "open"
  910. when it isn't "full", and it has a thread waiting for it to be committed to
  911. stable storage.
  912. The timeout is scaled based on a percentage of the last lwb
  913. latency to avoid significantly impacting the latency of each individual
  914. transaction record (itx).
  915. .
  916. .It Sy zfs_condense_indirect_commit_entry_delay_ms Ns = Ns Sy 0 Ns ms Pq int
  917. Vdev indirection layer (used for device removal) sleeps for this many
  918. milliseconds during mapping generation.
  919. Intended for use with the test suite to throttle vdev removal speed.
  920. .
  921. .It Sy zfs_condense_indirect_obsolete_pct Ns = Ns Sy 25 Ns % Pq uint
  922. Minimum percent of obsolete bytes in vdev mapping required to attempt to
  923. condense
  924. .Pq see Sy zfs_condense_indirect_vdevs_enable .
  925. Intended for use with the test suite
  926. to facilitate triggering condensing as needed.
  927. .
  928. .It Sy zfs_condense_indirect_vdevs_enable Ns = Ns Sy 1 Ns | Ns 0 Pq int
  929. Enable condensing indirect vdev mappings.
  930. When set, attempt to condense indirect vdev mappings
  931. if the mapping uses more than
  932. .Sy zfs_condense_min_mapping_bytes
  933. bytes of memory and if the obsolete space map object uses more than
  934. .Sy zfs_condense_max_obsolete_bytes
  935. bytes on-disk.
  936. The condensing process is an attempt to save memory by removing obsolete
  937. mappings.
  938. .
  939. .It Sy zfs_condense_max_obsolete_bytes Ns = Ns Sy 1073741824 Ns B Po 1 GiB Pc Pq u64
  940. Only attempt to condense indirect vdev mappings if the on-disk size
  941. of the obsolete space map object is greater than this number of bytes
  942. .Pq see Sy zfs_condense_indirect_vdevs_enable .
  943. .
  944. .It Sy zfs_condense_min_mapping_bytes Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq u64
  945. Minimum size vdev mapping to attempt to condense
  946. .Pq see Sy zfs_condense_indirect_vdevs_enable .
  947. .
  948. .It Sy zfs_dbgmsg_enable Ns = Ns Sy 1 Ns | Ns 0 Pq int
  949. Internally ZFS keeps a small log to facilitate debugging.
  950. The log is enabled by default, and can be disabled by unsetting this option.
  951. The contents of the log can be accessed by reading
  952. .Pa /proc/spl/kstat/zfs/dbgmsg .
  953. Writing
  954. .Sy 0
  955. to the file clears the log.
  956. .Pp
  957. This setting does not influence debug prints due to
  958. .Sy zfs_flags .
  959. .
  960. .It Sy zfs_dbgmsg_maxsize Ns = Ns Sy 4194304 Ns B Po 4 MiB Pc Pq uint
  961. Maximum size of the internal ZFS debug log.
  962. .
  963. .It Sy zfs_dbuf_state_index Ns = Ns Sy 0 Pq int
  964. Historically used for controlling what reporting was available under
  965. .Pa /proc/spl/kstat/zfs .
  966. No effect.
  967. .
  968. .It Sy zfs_deadman_checktime_ms Ns = Ns Sy 60000 Ns ms Po 1 min Pc Pq u64
  969. Check time in milliseconds.
  970. This defines the frequency at which we check for hung I/O requests
  971. and potentially invoke the
  972. .Sy zfs_deadman_failmode
  973. behavior.
  974. .
  975. .It Sy zfs_deadman_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  976. When a pool sync operation takes longer than
  977. .Sy zfs_deadman_synctime_ms ,
  978. or when an individual I/O operation takes longer than
  979. .Sy zfs_deadman_ziotime_ms ,
  980. then the operation is considered to be "hung".
  981. If
  982. .Sy zfs_deadman_enabled
  983. is set, then the deadman behavior is invoked as described by
  984. .Sy zfs_deadman_failmode .
  985. By default, the deadman is enabled and set to
  986. .Sy wait
  987. which results in "hung" I/O operations only being logged.
  988. The deadman is automatically disabled when a pool gets suspended.
  989. .
  990. .It Sy zfs_deadman_events_per_second Ns = Ns Sy 1 Ns /s Pq int
  991. Rate limit deadman zevents (which report hung I/O operations) to this many per
  992. second.
  993. .
  994. .It Sy zfs_deadman_failmode Ns = Ns Sy wait Pq charp
  995. Controls the failure behavior when the deadman detects a "hung" I/O operation.
  996. Valid values are:
  997. .Bl -tag -compact -offset 4n -width "continue"
  998. .It Sy wait
  999. Wait for a "hung" operation to complete.
  1000. For each "hung" operation a "deadman" event will be posted
  1001. describing that operation.
  1002. .It Sy continue
  1003. Attempt to recover from a "hung" operation by re-dispatching it
  1004. to the I/O pipeline if possible.
  1005. .It Sy panic
  1006. Panic the system.
  1007. This can be used to facilitate automatic fail-over
  1008. to a properly configured fail-over partner.
  1009. .El
  1010. .
  1011. .It Sy zfs_deadman_synctime_ms Ns = Ns Sy 600000 Ns ms Po 10 min Pc Pq u64
  1012. Interval in milliseconds after which the deadman is triggered and also
  1013. the interval after which a pool sync operation is considered to be "hung".
  1014. Once this limit is exceeded the deadman will be invoked every
  1015. .Sy zfs_deadman_checktime_ms
  1016. milliseconds until the pool sync completes.
  1017. .
  1018. .It Sy zfs_deadman_ziotime_ms Ns = Ns Sy 300000 Ns ms Po 5 min Pc Pq u64
  1019. Interval in milliseconds after which the deadman is triggered and an
  1020. individual I/O operation is considered to be "hung".
  1021. As long as the operation remains "hung",
  1022. the deadman will be invoked every
  1023. .Sy zfs_deadman_checktime_ms
  1024. milliseconds until the operation completes.
  1025. .
  1026. .It Sy zfs_dedup_prefetch Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1027. Enable prefetching dedup-ed blocks which are going to be freed.
  1028. .
  1029. .It Sy zfs_dedup_log_flush_passes_max Ns = Ns Sy 8 Ns Pq uint
  1030. Maximum number of dedup log flush passes (iterations) each transaction.
  1031. .Pp
  1032. At the start of each transaction, OpenZFS will estimate how many entries it
  1033. needs to flush out to keep up with the change rate, taking the amount and time
  1034. taken to flush on previous txgs into account (see
  1035. .Sy zfs_dedup_log_flush_flow_rate_txgs ) .
  1036. It will spread this amount into a number of passes.
  1037. At each pass, it will use the amount already flushed and the total time taken
  1038. by flushing and by other IO to recompute how much it should do for the remainder
  1039. of the txg.
  1040. .Pp
  1041. Reducing the max number of passes will make flushing more aggressive, flushing
  1042. out more entries on each pass.
  1043. This can be faster, but also more likely to compete with other IO.
  1044. Increasing the max number of passes will put fewer entries onto each pass,
  1045. keeping the overhead of dedup changes to a minimum but possibly causing a large
  1046. number of changes to be dumped on the last pass, which can blow out the txg
  1047. sync time beyond
  1048. .Sy zfs_txg_timeout .
  1049. .
  1050. .It Sy zfs_dedup_log_flush_min_time_ms Ns = Ns Sy 1000 Ns Pq uint
  1051. Minimum time to spend on dedup log flush each transaction.
  1052. .Pp
  1053. At least this long will be spent flushing dedup log entries each transaction,
  1054. up to
  1055. .Sy zfs_txg_timeout .
  1056. This occurs even if doing so would delay the transaction, that is, other IO
  1057. completes under this time.
  1058. .
  1059. .It Sy zfs_dedup_log_flush_entries_min Ns = Ns Sy 1000 Ns Pq uint
  1060. Flush at least this many entries each transaction.
  1061. .Pp
  1062. OpenZFS will estimate how many entries it needs to flush each transaction to
  1063. keep up with the ingest rate (see
  1064. .Sy zfs_dedup_log_flush_flow_rate_txgs ) .
  1065. This sets the minimum for that estimate.
  1066. Raising it can force OpenZFS to flush more aggressively, keeping the log small
  1067. and so reducing pool import times, but can make it less able to back off if
  1068. log flushing would compete with other IO too much.
  1069. .
  1070. .It Sy zfs_dedup_log_flush_flow_rate_txgs Ns = Ns Sy 10 Ns Pq uint
  1071. Number of transactions to use to compute the flow rate.
  1072. .Pp
  1073. OpenZFS will estimate how many entries it needs to flush each transaction by
  1074. monitoring the number of entries changed (ingest rate), number of entries
  1075. flushed (flush rate) and time spent flushing (flush time rate) and combining
  1076. these into an overall "flow rate".
  1077. It will use an exponential weighted moving average over some number of recent
  1078. transactions to compute these rates.
  1079. This sets the number of transactions to compute these averages over.
  1080. Setting it higher can help to smooth out the flow rate in the face of spiky
  1081. workloads, but will take longer for the flow rate to adjust to a sustained
  1082. change in the ingress rate.
  1083. .
  1084. .It Sy zfs_dedup_log_txg_max Ns = Ns Sy 8 Ns Pq uint
  1085. Max transactions to before starting to flush dedup logs.
  1086. .Pp
  1087. OpenZFS maintains two dedup logs, one receiving new changes, one flushing.
  1088. If there is nothing to flush, it will accumulate changes for no more than this
  1089. many transactions before switching the logs and starting to flush entries out.
  1090. .
  1091. .It Sy zfs_dedup_log_mem_max Ns = Ns Sy 0 Ns Pq u64
  1092. Max memory to use for dedup logs.
  1093. .Pp
  1094. OpenZFS will spend no more than this much memory on maintaining the in-memory
  1095. dedup log.
  1096. Flushing will begin when around half this amount is being spent on logs.
  1097. The default value of
  1098. .Sy 0
  1099. will cause it to be set by
  1100. .Sy zfs_dedup_log_mem_max_percent
  1101. instead.
  1102. .
  1103. .It Sy zfs_dedup_log_mem_max_percent Ns = Ns Sy 1 Ns % Pq uint
  1104. Max memory to use for dedup logs, as a percentage of total memory.
  1105. .Pp
  1106. If
  1107. .Sy zfs_dedup_log_mem_max
  1108. is not set, it will be initialised as a percentage of the total memory in the
  1109. system.
  1110. .
  1111. .It Sy zfs_delay_min_dirty_percent Ns = Ns Sy 60 Ns % Pq uint
  1112. Start to delay each transaction once there is this amount of dirty data,
  1113. expressed as a percentage of
  1114. .Sy zfs_dirty_data_max .
  1115. This value should be at least
  1116. .Sy zfs_vdev_async_write_active_max_dirty_percent .
  1117. .No See Sx ZFS TRANSACTION DELAY .
  1118. .
  1119. .It Sy zfs_delay_scale Ns = Ns Sy 500000 Pq int
  1120. This controls how quickly the transaction delay approaches infinity.
  1121. Larger values cause longer delays for a given amount of dirty data.
  1122. .Pp
  1123. For the smoothest delay, this value should be about 1 billion divided
  1124. by the maximum number of operations per second.
  1125. This will smoothly handle between ten times and a tenth of this number.
  1126. .No See Sx ZFS TRANSACTION DELAY .
  1127. .Pp
  1128. .Sy zfs_delay_scale No \(mu Sy zfs_dirty_data_max Em must No be smaller than Sy 2^64 .
  1129. .
  1130. .It Sy zfs_dio_write_verify_events_per_second Ns = Ns Sy 20 Ns /s Pq uint
  1131. Rate limit Direct I/O write verify events to this many per second.
  1132. .
  1133. .It Sy zfs_disable_ivset_guid_check Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1134. Disables requirement for IVset GUIDs to be present and match when doing a raw
  1135. receive of encrypted datasets.
  1136. Intended for users whose pools were created with
  1137. OpenZFS pre-release versions and now have compatibility issues.
  1138. .
  1139. .It Sy zfs_key_max_salt_uses Ns = Ns Sy 400000000 Po 4*10^8 Pc Pq ulong
  1140. Maximum number of uses of a single salt value before generating a new one for
  1141. encrypted datasets.
  1142. The default value is also the maximum.
  1143. .
  1144. .It Sy zfs_object_mutex_size Ns = Ns Sy 64 Pq uint
  1145. Size of the znode hashtable used for holds.
  1146. .Pp
  1147. Due to the need to hold locks on objects that may not exist yet, kernel mutexes
  1148. are not created per-object and instead a hashtable is used where collisions
  1149. will result in objects waiting when there is not actually contention on the
  1150. same object.
  1151. .
  1152. .It Sy zfs_slow_io_events_per_second Ns = Ns Sy 20 Ns /s Pq int
  1153. Rate limit delay zevents (which report slow I/O operations) to this many per
  1154. second.
  1155. .
  1156. .It Sy zfs_unflushed_max_mem_amt Ns = Ns Sy 1073741824 Ns B Po 1 GiB Pc Pq u64
  1157. Upper-bound limit for unflushed metadata changes to be held by the
  1158. log spacemap in memory, in bytes.
  1159. .
  1160. .It Sy zfs_unflushed_max_mem_ppm Ns = Ns Sy 1000 Ns ppm Po 0.1% Pc Pq u64
  1161. Part of overall system memory that ZFS allows to be used
  1162. for unflushed metadata changes by the log spacemap, in millionths.
  1163. .
  1164. .It Sy zfs_unflushed_log_block_max Ns = Ns Sy 131072 Po 128k Pc Pq u64
  1165. Describes the maximum number of log spacemap blocks allowed for each pool.
  1166. The default value means that the space in all the log spacemaps
  1167. can add up to no more than
  1168. .Sy 131072
  1169. blocks (which means
  1170. .Em 16 GiB
  1171. of logical space before compression and ditto blocks,
  1172. assuming that blocksize is
  1173. .Em 128 KiB ) .
  1174. .Pp
  1175. This tunable is important because it involves a trade-off between import
  1176. time after an unclean export and the frequency of flushing metaslabs.
  1177. The higher this number is, the more log blocks we allow when the pool is
  1178. active which means that we flush metaslabs less often and thus decrease
  1179. the number of I/O operations for spacemap updates per TXG.
  1180. At the same time though, that means that in the event of an unclean export,
  1181. there will be more log spacemap blocks for us to read, inducing overhead
  1182. in the import time of the pool.
  1183. The lower the number, the amount of flushing increases, destroying log
  1184. blocks quicker as they become obsolete faster, which leaves less blocks
  1185. to be read during import time after a crash.
  1186. .Pp
  1187. Each log spacemap block existing during pool import leads to approximately
  1188. one extra logical I/O issued.
  1189. This is the reason why this tunable is exposed in terms of blocks rather
  1190. than space used.
  1191. .
  1192. .It Sy zfs_unflushed_log_block_min Ns = Ns Sy 1000 Pq u64
  1193. If the number of metaslabs is small and our incoming rate is high,
  1194. we could get into a situation that we are flushing all our metaslabs every TXG.
  1195. Thus we always allow at least this many log blocks.
  1196. .
  1197. .It Sy zfs_unflushed_log_block_pct Ns = Ns Sy 400 Ns % Pq u64
  1198. Tunable used to determine the number of blocks that can be used for
  1199. the spacemap log, expressed as a percentage of the total number of
  1200. unflushed metaslabs in the pool.
  1201. .
  1202. .It Sy zfs_unflushed_log_txg_max Ns = Ns Sy 1000 Pq u64
  1203. Tunable limiting maximum time in TXGs any metaslab may remain unflushed.
  1204. It effectively limits maximum number of unflushed per-TXG spacemap logs
  1205. that need to be read after unclean pool export.
  1206. .
  1207. .It Sy zfs_unlink_suspend_progress Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  1208. When enabled, files will not be asynchronously removed from the list of pending
  1209. unlinks and the space they consume will be leaked.
  1210. Once this option has been disabled and the dataset is remounted,
  1211. the pending unlinks will be processed and the freed space returned to the pool.
  1212. This option is used by the test suite.
  1213. .
  1214. .It Sy zfs_delete_blocks Ns = Ns Sy 20480 Pq ulong
  1215. This is the used to define a large file for the purposes of deletion.
  1216. Files containing more than
  1217. .Sy zfs_delete_blocks
  1218. will be deleted asynchronously, while smaller files are deleted synchronously.
  1219. Decreasing this value will reduce the time spent in an
  1220. .Xr unlink 2
  1221. system call, at the expense of a longer delay before the freed space is
  1222. available.
  1223. This only applies on Linux.
  1224. .
  1225. .It Sy zfs_dirty_data_max Ns = Pq int
  1226. Determines the dirty space limit in bytes.
  1227. Once this limit is exceeded, new writes are halted until space frees up.
  1228. This parameter takes precedence over
  1229. .Sy zfs_dirty_data_max_percent .
  1230. .No See Sx ZFS TRANSACTION DELAY .
  1231. .Pp
  1232. Defaults to
  1233. .Sy physical_ram/10 ,
  1234. capped at
  1235. .Sy zfs_dirty_data_max_max .
  1236. .
  1237. .It Sy zfs_dirty_data_max_max Ns = Pq int
  1238. Maximum allowable value of
  1239. .Sy zfs_dirty_data_max ,
  1240. expressed in bytes.
  1241. This limit is only enforced at module load time, and will be ignored if
  1242. .Sy zfs_dirty_data_max
  1243. is later changed.
  1244. This parameter takes precedence over
  1245. .Sy zfs_dirty_data_max_max_percent .
  1246. .No See Sx ZFS TRANSACTION DELAY .
  1247. .Pp
  1248. Defaults to
  1249. .Sy min(physical_ram/4, 4GiB) ,
  1250. or
  1251. .Sy min(physical_ram/4, 1GiB)
  1252. for 32-bit systems.
  1253. .
  1254. .It Sy zfs_dirty_data_max_max_percent Ns = Ns Sy 25 Ns % Pq uint
  1255. Maximum allowable value of
  1256. .Sy zfs_dirty_data_max ,
  1257. expressed as a percentage of physical RAM.
  1258. This limit is only enforced at module load time, and will be ignored if
  1259. .Sy zfs_dirty_data_max
  1260. is later changed.
  1261. The parameter
  1262. .Sy zfs_dirty_data_max_max
  1263. takes precedence over this one.
  1264. .No See Sx ZFS TRANSACTION DELAY .
  1265. .
  1266. .It Sy zfs_dirty_data_max_percent Ns = Ns Sy 10 Ns % Pq uint
  1267. Determines the dirty space limit, expressed as a percentage of all memory.
  1268. Once this limit is exceeded, new writes are halted until space frees up.
  1269. The parameter
  1270. .Sy zfs_dirty_data_max
  1271. takes precedence over this one.
  1272. .No See Sx ZFS TRANSACTION DELAY .
  1273. .Pp
  1274. Subject to
  1275. .Sy zfs_dirty_data_max_max .
  1276. .
  1277. .It Sy zfs_dirty_data_sync_percent Ns = Ns Sy 20 Ns % Pq uint
  1278. Start syncing out a transaction group if there's at least this much dirty data
  1279. .Pq as a percentage of Sy zfs_dirty_data_max .
  1280. This should be less than
  1281. .Sy zfs_vdev_async_write_active_min_dirty_percent .
  1282. .
  1283. .It Sy zfs_wrlog_data_max Ns = Pq int
  1284. The upper limit of write-transaction zil log data size in bytes.
  1285. Write operations are throttled when approaching the limit until log data is
  1286. cleared out after transaction group sync.
  1287. Because of some overhead, it should be set at least 2 times the size of
  1288. .Sy zfs_dirty_data_max
  1289. .No to prevent harming normal write throughput .
  1290. It also should be smaller than the size of the slog device if slog is present.
  1291. .Pp
  1292. Defaults to
  1293. .Sy zfs_dirty_data_max*2
  1294. .
  1295. .It Sy zfs_fallocate_reserve_percent Ns = Ns Sy 110 Ns % Pq uint
  1296. Since ZFS is a copy-on-write filesystem with snapshots, blocks cannot be
  1297. preallocated for a file in order to guarantee that later writes will not
  1298. run out of space.
  1299. Instead,
  1300. .Xr fallocate 2
  1301. space preallocation only checks that sufficient space is currently available
  1302. in the pool or the user's project quota allocation,
  1303. and then creates a sparse file of the requested size.
  1304. The requested space is multiplied by
  1305. .Sy zfs_fallocate_reserve_percent
  1306. to allow additional space for indirect blocks and other internal metadata.
  1307. Setting this to
  1308. .Sy 0
  1309. disables support for
  1310. .Xr fallocate 2
  1311. and causes it to return
  1312. .Sy EOPNOTSUPP .
  1313. .
  1314. .It Sy zfs_fletcher_4_impl Ns = Ns Sy fastest Pq string
  1315. Select a fletcher 4 implementation.
  1316. .Pp
  1317. Supported selectors are:
  1318. .Sy fastest , scalar , sse2 , ssse3 , avx2 , avx512f , avx512bw ,
  1319. .No and Sy aarch64_neon .
  1320. All except
  1321. .Sy fastest No and Sy scalar
  1322. require instruction set extensions to be available,
  1323. and will only appear if ZFS detects that they are present at runtime.
  1324. If multiple implementations of fletcher 4 are available, the
  1325. .Sy fastest
  1326. will be chosen using a micro benchmark.
  1327. Selecting
  1328. .Sy scalar
  1329. results in the original CPU-based calculation being used.
  1330. Selecting any option other than
  1331. .Sy fastest No or Sy scalar
  1332. results in vector instructions
  1333. from the respective CPU instruction set being used.
  1334. .
  1335. .It Sy zfs_bclone_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1336. Enables access to the block cloning feature.
  1337. If this setting is 0, then even if feature@block_cloning is enabled,
  1338. using functions and system calls that attempt to clone blocks will act as
  1339. though the feature is disabled.
  1340. .
  1341. .It Sy zfs_bclone_wait_dirty Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1342. When set to 1 the FICLONE and FICLONERANGE ioctls wait for dirty data to be
  1343. written to disk.
  1344. This allows the clone operation to reliably succeed when a file is
  1345. modified and then immediately cloned.
  1346. For small files this may be slower than making a copy of the file.
  1347. Therefore, this setting defaults to 0 which causes a clone operation to
  1348. immediately fail when encountering a dirty block.
  1349. .
  1350. .It Sy zfs_blake3_impl Ns = Ns Sy fastest Pq string
  1351. Select a BLAKE3 implementation.
  1352. .Pp
  1353. Supported selectors are:
  1354. .Sy cycle , fastest , generic , sse2 , sse41 , avx2 , avx512 .
  1355. All except
  1356. .Sy cycle , fastest No and Sy generic
  1357. require instruction set extensions to be available,
  1358. and will only appear if ZFS detects that they are present at runtime.
  1359. If multiple implementations of BLAKE3 are available, the
  1360. .Sy fastest will be chosen using a micro benchmark. You can see the
  1361. benchmark results by reading this kstat file:
  1362. .Pa /proc/spl/kstat/zfs/chksum_bench .
  1363. .
  1364. .It Sy zfs_free_bpobj_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1365. Enable/disable the processing of the free_bpobj object.
  1366. .
  1367. .It Sy zfs_async_block_max_blocks Ns = Ns Sy UINT64_MAX Po unlimited Pc Pq u64
  1368. Maximum number of blocks freed in a single TXG.
  1369. .
  1370. .It Sy zfs_max_async_dedup_frees Ns = Ns Sy 100000 Po 10^5 Pc Pq u64
  1371. Maximum number of dedup blocks freed in a single TXG.
  1372. .
  1373. .It Sy zfs_vdev_async_read_max_active Ns = Ns Sy 3 Pq uint
  1374. Maximum asynchronous read I/O operations active to each device.
  1375. .No See Sx ZFS I/O SCHEDULER .
  1376. .
  1377. .It Sy zfs_vdev_async_read_min_active Ns = Ns Sy 1 Pq uint
  1378. Minimum asynchronous read I/O operation active to each device.
  1379. .No See Sx ZFS I/O SCHEDULER .
  1380. .
  1381. .It Sy zfs_vdev_async_write_active_max_dirty_percent Ns = Ns Sy 60 Ns % Pq uint
  1382. When the pool has more than this much dirty data, use
  1383. .Sy zfs_vdev_async_write_max_active
  1384. to limit active async writes.
  1385. If the dirty data is between the minimum and maximum,
  1386. the active I/O limit is linearly interpolated.
  1387. .No See Sx ZFS I/O SCHEDULER .
  1388. .
  1389. .It Sy zfs_vdev_async_write_active_min_dirty_percent Ns = Ns Sy 30 Ns % Pq uint
  1390. When the pool has less than this much dirty data, use
  1391. .Sy zfs_vdev_async_write_min_active
  1392. to limit active async writes.
  1393. If the dirty data is between the minimum and maximum,
  1394. the active I/O limit is linearly
  1395. interpolated.
  1396. .No See Sx ZFS I/O SCHEDULER .
  1397. .
  1398. .It Sy zfs_vdev_async_write_max_active Ns = Ns Sy 10 Pq uint
  1399. Maximum asynchronous write I/O operations active to each device.
  1400. .No See Sx ZFS I/O SCHEDULER .
  1401. .
  1402. .It Sy zfs_vdev_async_write_min_active Ns = Ns Sy 2 Pq uint
  1403. Minimum asynchronous write I/O operations active to each device.
  1404. .No See Sx ZFS I/O SCHEDULER .
  1405. .Pp
  1406. Lower values are associated with better latency on rotational media but poorer
  1407. resilver performance.
  1408. The default value of
  1409. .Sy 2
  1410. was chosen as a compromise.
  1411. A value of
  1412. .Sy 3
  1413. has been shown to improve resilver performance further at a cost of
  1414. further increasing latency.
  1415. .
  1416. .It Sy zfs_vdev_initializing_max_active Ns = Ns Sy 1 Pq uint
  1417. Maximum initializing I/O operations active to each device.
  1418. .No See Sx ZFS I/O SCHEDULER .
  1419. .
  1420. .It Sy zfs_vdev_initializing_min_active Ns = Ns Sy 1 Pq uint
  1421. Minimum initializing I/O operations active to each device.
  1422. .No See Sx ZFS I/O SCHEDULER .
  1423. .
  1424. .It Sy zfs_vdev_max_active Ns = Ns Sy 1000 Pq uint
  1425. The maximum number of I/O operations active to each device.
  1426. Ideally, this will be at least the sum of each queue's
  1427. .Sy max_active .
  1428. .No See Sx ZFS I/O SCHEDULER .
  1429. .
  1430. .It Sy zfs_vdev_open_timeout_ms Ns = Ns Sy 1000 Pq uint
  1431. Timeout value to wait before determining a device is missing
  1432. during import.
  1433. This is helpful for transient missing paths due
  1434. to links being briefly removed and recreated in response to
  1435. udev events.
  1436. .
  1437. .It Sy zfs_vdev_rebuild_max_active Ns = Ns Sy 3 Pq uint
  1438. Maximum sequential resilver I/O operations active to each device.
  1439. .No See Sx ZFS I/O SCHEDULER .
  1440. .
  1441. .It Sy zfs_vdev_rebuild_min_active Ns = Ns Sy 1 Pq uint
  1442. Minimum sequential resilver I/O operations active to each device.
  1443. .No See Sx ZFS I/O SCHEDULER .
  1444. .
  1445. .It Sy zfs_vdev_removal_max_active Ns = Ns Sy 2 Pq uint
  1446. Maximum removal I/O operations active to each device.
  1447. .No See Sx ZFS I/O SCHEDULER .
  1448. .
  1449. .It Sy zfs_vdev_removal_min_active Ns = Ns Sy 1 Pq uint
  1450. Minimum removal I/O operations active to each device.
  1451. .No See Sx ZFS I/O SCHEDULER .
  1452. .
  1453. .It Sy zfs_vdev_scrub_max_active Ns = Ns Sy 2 Pq uint
  1454. Maximum scrub I/O operations active to each device.
  1455. .No See Sx ZFS I/O SCHEDULER .
  1456. .
  1457. .It Sy zfs_vdev_scrub_min_active Ns = Ns Sy 1 Pq uint
  1458. Minimum scrub I/O operations active to each device.
  1459. .No See Sx ZFS I/O SCHEDULER .
  1460. .
  1461. .It Sy zfs_vdev_sync_read_max_active Ns = Ns Sy 10 Pq uint
  1462. Maximum synchronous read I/O operations active to each device.
  1463. .No See Sx ZFS I/O SCHEDULER .
  1464. .
  1465. .It Sy zfs_vdev_sync_read_min_active Ns = Ns Sy 10 Pq uint
  1466. Minimum synchronous read I/O operations active to each device.
  1467. .No See Sx ZFS I/O SCHEDULER .
  1468. .
  1469. .It Sy zfs_vdev_sync_write_max_active Ns = Ns Sy 10 Pq uint
  1470. Maximum synchronous write I/O operations active to each device.
  1471. .No See Sx ZFS I/O SCHEDULER .
  1472. .
  1473. .It Sy zfs_vdev_sync_write_min_active Ns = Ns Sy 10 Pq uint
  1474. Minimum synchronous write I/O operations active to each device.
  1475. .No See Sx ZFS I/O SCHEDULER .
  1476. .
  1477. .It Sy zfs_vdev_trim_max_active Ns = Ns Sy 2 Pq uint
  1478. Maximum trim/discard I/O operations active to each device.
  1479. .No See Sx ZFS I/O SCHEDULER .
  1480. .
  1481. .It Sy zfs_vdev_trim_min_active Ns = Ns Sy 1 Pq uint
  1482. Minimum trim/discard I/O operations active to each device.
  1483. .No See Sx ZFS I/O SCHEDULER .
  1484. .
  1485. .It Sy zfs_vdev_nia_delay Ns = Ns Sy 5 Pq uint
  1486. For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
  1487. the number of concurrently-active I/O operations is limited to
  1488. .Sy zfs_*_min_active ,
  1489. unless the vdev is "idle".
  1490. When there are no interactive I/O operations active (synchronous or otherwise),
  1491. and
  1492. .Sy zfs_vdev_nia_delay
  1493. operations have completed since the last interactive operation,
  1494. then the vdev is considered to be "idle",
  1495. and the number of concurrently-active non-interactive operations is increased to
  1496. .Sy zfs_*_max_active .
  1497. .No See Sx ZFS I/O SCHEDULER .
  1498. .
  1499. .It Sy zfs_vdev_nia_credit Ns = Ns Sy 5 Pq uint
  1500. Some HDDs tend to prioritize sequential I/O so strongly, that concurrent
  1501. random I/O latency reaches several seconds.
  1502. On some HDDs this happens even if sequential I/O operations
  1503. are submitted one at a time, and so setting
  1504. .Sy zfs_*_max_active Ns = Sy 1
  1505. does not help.
  1506. To prevent non-interactive I/O, like scrub,
  1507. from monopolizing the device, no more than
  1508. .Sy zfs_vdev_nia_credit operations can be sent
  1509. while there are outstanding incomplete interactive operations.
  1510. This enforced wait ensures the HDD services the interactive I/O
  1511. within a reasonable amount of time.
  1512. .No See Sx ZFS I/O SCHEDULER .
  1513. .
  1514. .It Sy zfs_vdev_queue_depth_pct Ns = Ns Sy 1000 Ns % Pq uint
  1515. Maximum number of queued allocations per top-level vdev expressed as
  1516. a percentage of
  1517. .Sy zfs_vdev_async_write_max_active ,
  1518. which allows the system to detect devices that are more capable
  1519. of handling allocations and to allocate more blocks to those devices.
  1520. This allows for dynamic allocation distribution when devices are imbalanced,
  1521. as fuller devices will tend to be slower than empty devices.
  1522. .Pp
  1523. Also see
  1524. .Sy zio_dva_throttle_enabled .
  1525. .
  1526. .It Sy zfs_vdev_def_queue_depth Ns = Ns Sy 32 Pq uint
  1527. Default queue depth for each vdev IO allocator.
  1528. Higher values allow for better coalescing of sequential writes before sending
  1529. them to the disk, but can increase transaction commit times.
  1530. .
  1531. .It Sy zfs_vdev_failfast_mask Ns = Ns Sy 1 Pq uint
  1532. Defines if the driver should retire on a given error type.
  1533. The following options may be bitwise-ored together:
  1534. .TS
  1535. box;
  1536. lbz r l l .
  1537. Value Name Description
  1538. _
  1539. 1 Device No driver retries on device errors
  1540. 2 Transport No driver retries on transport errors.
  1541. 4 Driver No driver retries on driver errors.
  1542. .TE
  1543. .
  1544. .It Sy zfs_vdev_disk_max_segs Ns = Ns Sy 0 Pq uint
  1545. Maximum number of segments to add to a BIO (min 4).
  1546. If this is higher than the maximum allowed by the device queue or the kernel
  1547. itself, it will be clamped.
  1548. Setting it to zero will cause the kernel's ideal size to be used.
  1549. This parameter only applies on Linux.
  1550. This parameter is ignored if
  1551. .Sy zfs_vdev_disk_classic Ns = Ns Sy 1 .
  1552. .
  1553. .It Sy zfs_vdev_disk_classic Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  1554. If set to 1, OpenZFS will submit IO to Linux using the method it used in 2.2
  1555. and earlier.
  1556. This "classic" method has known issues with highly fragmented IO requests and
  1557. is slower on many workloads, but it has been in use for many years and is known
  1558. to be very stable.
  1559. If you set this parameter, please also open a bug report why you did so,
  1560. including the workload involved and any error messages.
  1561. .Pp
  1562. This parameter and the classic submission method will be removed once we have
  1563. total confidence in the new method.
  1564. .Pp
  1565. This parameter only applies on Linux, and can only be set at module load time.
  1566. .
  1567. .It Sy zfs_expire_snapshot Ns = Ns Sy 300 Ns s Pq int
  1568. Time before expiring
  1569. .Pa .zfs/snapshot .
  1570. .
  1571. .It Sy zfs_admin_snapshot Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1572. Allow the creation, removal, or renaming of entries in the
  1573. .Sy .zfs/snapshot
  1574. directory to cause the creation, destruction, or renaming of snapshots.
  1575. When enabled, this functionality works both locally and over NFS exports
  1576. which have the
  1577. .Em no_root_squash
  1578. option set.
  1579. .
  1580. .It Sy zfs_snapshot_no_setuid Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1581. Whether to disable
  1582. .Em setuid/setgid
  1583. support for snapshot mounts triggered by access to the
  1584. .Sy .zfs/snapshot
  1585. directory by setting the
  1586. .Em nosuid
  1587. mount option.
  1588. .
  1589. .It Sy zfs_flags Ns = Ns Sy 0 Pq int
  1590. Set additional debugging flags.
  1591. The following flags may be bitwise-ored together:
  1592. .TS
  1593. box;
  1594. lbz r l l .
  1595. Value Name Description
  1596. _
  1597. 1 ZFS_DEBUG_DPRINTF Enable dprintf entries in the debug log.
  1598. * 2 ZFS_DEBUG_DBUF_VERIFY Enable extra dbuf verifications.
  1599. * 4 ZFS_DEBUG_DNODE_VERIFY Enable extra dnode verifications.
  1600. 8 ZFS_DEBUG_SNAPNAMES Enable snapshot name verification.
  1601. * 16 ZFS_DEBUG_MODIFY Check for illegally modified ARC buffers.
  1602. 64 ZFS_DEBUG_ZIO_FREE Enable verification of block frees.
  1603. 128 ZFS_DEBUG_HISTOGRAM_VERIFY Enable extra spacemap histogram verifications.
  1604. 256 ZFS_DEBUG_METASLAB_VERIFY Verify space accounting on disk matches in-memory \fBrange_trees\fP.
  1605. 512 ZFS_DEBUG_SET_ERROR Enable \fBSET_ERROR\fP and dprintf entries in the debug log.
  1606. 1024 ZFS_DEBUG_INDIRECT_REMAP Verify split blocks created by device removal.
  1607. 2048 ZFS_DEBUG_TRIM Verify TRIM ranges are always within the allocatable range tree.
  1608. 4096 ZFS_DEBUG_LOG_SPACEMAP Verify that the log summary is consistent with the spacemap log
  1609. and enable \fBzfs_dbgmsgs\fP for metaslab loading and flushing.
  1610. .TE
  1611. .Sy \& * No Requires debug build .
  1612. .
  1613. .It Sy zfs_btree_verify_intensity Ns = Ns Sy 0 Pq uint
  1614. Enables btree verification.
  1615. The following settings are culminative:
  1616. .TS
  1617. box;
  1618. lbz r l l .
  1619. Value Description
  1620. 1 Verify height.
  1621. 2 Verify pointers from children to parent.
  1622. 3 Verify element counts.
  1623. 4 Verify element order. (expensive)
  1624. * 5 Verify unused memory is poisoned. (expensive)
  1625. .TE
  1626. .Sy \& * No Requires debug build .
  1627. .
  1628. .It Sy zfs_free_leak_on_eio Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1629. If destroy encounters an
  1630. .Sy EIO
  1631. while reading metadata (e.g. indirect blocks),
  1632. space referenced by the missing metadata can not be freed.
  1633. Normally this causes the background destroy to become "stalled",
  1634. as it is unable to make forward progress.
  1635. While in this stalled state, all remaining space to free
  1636. from the error-encountering filesystem is "temporarily leaked".
  1637. Set this flag to cause it to ignore the
  1638. .Sy EIO ,
  1639. permanently leak the space from indirect blocks that can not be read,
  1640. and continue to free everything else that it can.
  1641. .Pp
  1642. The default "stalling" behavior is useful if the storage partially
  1643. fails (i.e. some but not all I/O operations fail), and then later recovers.
  1644. In this case, we will be able to continue pool operations while it is
  1645. partially failed, and when it recovers, we can continue to free the
  1646. space, with no leaks.
  1647. Note, however, that this case is actually fairly rare.
  1648. .Pp
  1649. Typically pools either
  1650. .Bl -enum -compact -offset 4n -width "1."
  1651. .It
  1652. fail completely (but perhaps temporarily,
  1653. e.g. due to a top-level vdev going offline), or
  1654. .It
  1655. have localized, permanent errors (e.g. disk returns the wrong data
  1656. due to bit flip or firmware bug).
  1657. .El
  1658. In the former case, this setting does not matter because the
  1659. pool will be suspended and the sync thread will not be able to make
  1660. forward progress regardless.
  1661. In the latter, because the error is permanent, the best we can do
  1662. is leak the minimum amount of space,
  1663. which is what setting this flag will do.
  1664. It is therefore reasonable for this flag to normally be set,
  1665. but we chose the more conservative approach of not setting it,
  1666. so that there is no possibility of
  1667. leaking space in the "partial temporary" failure case.
  1668. .
  1669. .It Sy zfs_free_min_time_ms Ns = Ns Sy 1000 Ns ms Po 1s Pc Pq uint
  1670. During a
  1671. .Nm zfs Cm destroy
  1672. operation using the
  1673. .Sy async_destroy
  1674. feature,
  1675. a minimum of this much time will be spent working on freeing blocks per TXG.
  1676. .
  1677. .It Sy zfs_obsolete_min_time_ms Ns = Ns Sy 500 Ns ms Pq uint
  1678. Similar to
  1679. .Sy zfs_free_min_time_ms ,
  1680. but for cleanup of old indirection records for removed vdevs.
  1681. .
  1682. .It Sy zfs_immediate_write_sz Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq s64
  1683. Largest data block to write to the ZIL.
  1684. Larger blocks will be treated as if the dataset being written to had the
  1685. .Sy logbias Ns = Ns Sy throughput
  1686. property set.
  1687. .
  1688. .It Sy zfs_initialize_value Ns = Ns Sy 16045690984833335022 Po 0xDEADBEEFDEADBEEE Pc Pq u64
  1689. Pattern written to vdev free space by
  1690. .Xr zpool-initialize 8 .
  1691. .
  1692. .It Sy zfs_initialize_chunk_size Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
  1693. Size of writes used by
  1694. .Xr zpool-initialize 8 .
  1695. This option is used by the test suite.
  1696. .
  1697. .It Sy zfs_livelist_max_entries Ns = Ns Sy 500000 Po 5*10^5 Pc Pq u64
  1698. The threshold size (in block pointers) at which we create a new sub-livelist.
  1699. Larger sublists are more costly from a memory perspective but the fewer
  1700. sublists there are, the lower the cost of insertion.
  1701. .
  1702. .It Sy zfs_livelist_min_percent_shared Ns = Ns Sy 75 Ns % Pq int
  1703. If the amount of shared space between a snapshot and its clone drops below
  1704. this threshold, the clone turns off the livelist and reverts to the old
  1705. deletion method.
  1706. This is in place because livelists no long give us a benefit
  1707. once a clone has been overwritten enough.
  1708. .
  1709. .It Sy zfs_livelist_condense_new_alloc Ns = Ns Sy 0 Pq int
  1710. Incremented each time an extra ALLOC blkptr is added to a livelist entry while
  1711. it is being condensed.
  1712. This option is used by the test suite to track race conditions.
  1713. .
  1714. .It Sy zfs_livelist_condense_sync_cancel Ns = Ns Sy 0 Pq int
  1715. Incremented each time livelist condensing is canceled while in
  1716. .Fn spa_livelist_condense_sync .
  1717. This option is used by the test suite to track race conditions.
  1718. .
  1719. .It Sy zfs_livelist_condense_sync_pause Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1720. When set, the livelist condense process pauses indefinitely before
  1721. executing the synctask \(em
  1722. .Fn spa_livelist_condense_sync .
  1723. This option is used by the test suite to trigger race conditions.
  1724. .
  1725. .It Sy zfs_livelist_condense_zthr_cancel Ns = Ns Sy 0 Pq int
  1726. Incremented each time livelist condensing is canceled while in
  1727. .Fn spa_livelist_condense_cb .
  1728. This option is used by the test suite to track race conditions.
  1729. .
  1730. .It Sy zfs_livelist_condense_zthr_pause Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1731. When set, the livelist condense process pauses indefinitely before
  1732. executing the open context condensing work in
  1733. .Fn spa_livelist_condense_cb .
  1734. This option is used by the test suite to trigger race conditions.
  1735. .
  1736. .It Sy zfs_lua_max_instrlimit Ns = Ns Sy 100000000 Po 10^8 Pc Pq u64
  1737. The maximum execution time limit that can be set for a ZFS channel program,
  1738. specified as a number of Lua instructions.
  1739. .
  1740. .It Sy zfs_lua_max_memlimit Ns = Ns Sy 104857600 Po 100 MiB Pc Pq u64
  1741. The maximum memory limit that can be set for a ZFS channel program, specified
  1742. in bytes.
  1743. .
  1744. .It Sy zfs_max_dataset_nesting Ns = Ns Sy 50 Pq int
  1745. The maximum depth of nested datasets.
  1746. This value can be tuned temporarily to
  1747. fix existing datasets that exceed the predefined limit.
  1748. .
  1749. .It Sy zfs_max_log_walking Ns = Ns Sy 5 Pq u64
  1750. The number of past TXGs that the flushing algorithm of the log spacemap
  1751. feature uses to estimate incoming log blocks.
  1752. .
  1753. .It Sy zfs_max_logsm_summary_length Ns = Ns Sy 10 Pq u64
  1754. Maximum number of rows allowed in the summary of the spacemap log.
  1755. .
  1756. .It Sy zfs_max_recordsize Ns = Ns Sy 16777216 Po 16 MiB Pc Pq uint
  1757. We currently support block sizes from
  1758. .Em 512 Po 512 B Pc No to Em 16777216 Po 16 MiB Pc .
  1759. The benefits of larger blocks, and thus larger I/O,
  1760. need to be weighed against the cost of COWing a giant block to modify one byte.
  1761. Additionally, very large blocks can have an impact on I/O latency,
  1762. and also potentially on the memory allocator.
  1763. Therefore, we formerly forbade creating blocks larger than 1M.
  1764. Larger blocks could be created by changing it,
  1765. and pools with larger blocks can always be imported and used,
  1766. regardless of this setting.
  1767. .Pp
  1768. Note that it is still limited by default to
  1769. .Ar 1 MiB
  1770. on x86_32, because Linux's
  1771. 3/1 memory split doesn't leave much room for 16M chunks.
  1772. .
  1773. .It Sy zfs_allow_redacted_dataset_mount Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1774. Allow datasets received with redacted send/receive to be mounted.
  1775. Normally disabled because these datasets may be missing key data.
  1776. .
  1777. .It Sy zfs_min_metaslabs_to_flush Ns = Ns Sy 1 Pq u64
  1778. Minimum number of metaslabs to flush per dirty TXG.
  1779. .
  1780. .It Sy zfs_metaslab_fragmentation_threshold Ns = Ns Sy 70 Ns % Pq uint
  1781. Allow metaslabs to keep their active state as long as their fragmentation
  1782. percentage is no more than this value.
  1783. An active metaslab that exceeds this threshold
  1784. will no longer keep its active status allowing better metaslabs to be selected.
  1785. .
  1786. .It Sy zfs_mg_fragmentation_threshold Ns = Ns Sy 95 Ns % Pq uint
  1787. Metaslab groups are considered eligible for allocations if their
  1788. fragmentation metric (measured as a percentage) is less than or equal to
  1789. this value.
  1790. If a metaslab group exceeds this threshold then it will be
  1791. skipped unless all metaslab groups within the metaslab class have also
  1792. crossed this threshold.
  1793. .
  1794. .It Sy zfs_mg_noalloc_threshold Ns = Ns Sy 0 Ns % Pq uint
  1795. Defines a threshold at which metaslab groups should be eligible for allocations.
  1796. The value is expressed as a percentage of free space
  1797. beyond which a metaslab group is always eligible for allocations.
  1798. If a metaslab group's free space is less than or equal to the
  1799. threshold, the allocator will avoid allocating to that group
  1800. unless all groups in the pool have reached the threshold.
  1801. Once all groups have reached the threshold, all groups are allowed to accept
  1802. allocations.
  1803. The default value of
  1804. .Sy 0
  1805. disables the feature and causes all metaslab groups to be eligible for
  1806. allocations.
  1807. .Pp
  1808. This parameter allows one to deal with pools having heavily imbalanced
  1809. vdevs such as would be the case when a new vdev has been added.
  1810. Setting the threshold to a non-zero percentage will stop allocations
  1811. from being made to vdevs that aren't filled to the specified percentage
  1812. and allow lesser filled vdevs to acquire more allocations than they
  1813. otherwise would under the old
  1814. .Sy zfs_mg_alloc_failures
  1815. facility.
  1816. .
  1817. .It Sy zfs_ddt_data_is_special Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1818. If enabled, ZFS will place DDT data into the special allocation class.
  1819. .
  1820. .It Sy zfs_user_indirect_is_special Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1821. If enabled, ZFS will place user data indirect blocks
  1822. into the special allocation class.
  1823. .
  1824. .It Sy zfs_multihost_history Ns = Ns Sy 0 Pq uint
  1825. Historical statistics for this many latest multihost updates will be available
  1826. in
  1827. .Pa /proc/spl/kstat/zfs/ Ns Ao Ar pool Ac Ns Pa /multihost .
  1828. .
  1829. .It Sy zfs_multihost_interval Ns = Ns Sy 1000 Ns ms Po 1 s Pc Pq u64
  1830. Used to control the frequency of multihost writes which are performed when the
  1831. .Sy multihost
  1832. pool property is on.
  1833. This is one of the factors used to determine the
  1834. length of the activity check during import.
  1835. .Pp
  1836. The multihost write period is
  1837. .Sy zfs_multihost_interval No / Sy leaf-vdevs .
  1838. On average a multihost write will be issued for each leaf vdev
  1839. every
  1840. .Sy zfs_multihost_interval
  1841. milliseconds.
  1842. In practice, the observed period can vary with the I/O load
  1843. and this observed value is the delay which is stored in the uberblock.
  1844. .
  1845. .It Sy zfs_multihost_import_intervals Ns = Ns Sy 20 Pq uint
  1846. Used to control the duration of the activity test on import.
  1847. Smaller values of
  1848. .Sy zfs_multihost_import_intervals
  1849. will reduce the import time but increase
  1850. the risk of failing to detect an active pool.
  1851. The total activity check time is never allowed to drop below one second.
  1852. .Pp
  1853. On import the activity check waits a minimum amount of time determined by
  1854. .Sy zfs_multihost_interval No \(mu Sy zfs_multihost_import_intervals ,
  1855. or the same product computed on the host which last had the pool imported,
  1856. whichever is greater.
  1857. The activity check time may be further extended if the value of MMP
  1858. delay found in the best uberblock indicates actual multihost updates happened
  1859. at longer intervals than
  1860. .Sy zfs_multihost_interval .
  1861. A minimum of
  1862. .Em 100 ms
  1863. is enforced.
  1864. .Pp
  1865. .Sy 0 No is equivalent to Sy 1 .
  1866. .
  1867. .It Sy zfs_multihost_fail_intervals Ns = Ns Sy 10 Pq uint
  1868. Controls the behavior of the pool when multihost write failures or delays are
  1869. detected.
  1870. .Pp
  1871. When
  1872. .Sy 0 ,
  1873. multihost write failures or delays are ignored.
  1874. The failures will still be reported to the ZED which depending on
  1875. its configuration may take action such as suspending the pool or offlining a
  1876. device.
  1877. .Pp
  1878. Otherwise, the pool will be suspended if
  1879. .Sy zfs_multihost_fail_intervals No \(mu Sy zfs_multihost_interval
  1880. milliseconds pass without a successful MMP write.
  1881. This guarantees the activity test will see MMP writes if the pool is imported.
  1882. .Sy 1 No is equivalent to Sy 2 ;
  1883. this is necessary to prevent the pool from being suspended
  1884. due to normal, small I/O latency variations.
  1885. .
  1886. .It Sy zfs_no_scrub_io Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1887. Set to disable scrub I/O.
  1888. This results in scrubs not actually scrubbing data and
  1889. simply doing a metadata crawl of the pool instead.
  1890. .
  1891. .It Sy zfs_no_scrub_prefetch Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1892. Set to disable block prefetching for scrubs.
  1893. .
  1894. .It Sy zfs_nocacheflush Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1895. Disable cache flush operations on disks when writing.
  1896. Setting this will cause pool corruption on power loss
  1897. if a volatile out-of-order write cache is enabled.
  1898. .
  1899. .It Sy zfs_nopwrite_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1900. Allow no-operation writes.
  1901. The occurrence of nopwrites will further depend on other pool properties
  1902. .Pq i.a. the checksumming and compression algorithms .
  1903. .
  1904. .It Sy zfs_dmu_offset_next_sync Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1905. Enable forcing TXG sync to find holes.
  1906. When enabled forces ZFS to sync data when
  1907. .Sy SEEK_HOLE No or Sy SEEK_DATA
  1908. flags are used allowing holes in a file to be accurately reported.
  1909. When disabled holes will not be reported in recently dirtied files.
  1910. .
  1911. .It Sy zfs_pd_bytes_max Ns = Ns Sy 52428800 Ns B Po 50 MiB Pc Pq int
  1912. The number of bytes which should be prefetched during a pool traversal, like
  1913. .Nm zfs Cm send
  1914. or other data crawling operations.
  1915. .
  1916. .It Sy zfs_traverse_indirect_prefetch_limit Ns = Ns Sy 32 Pq uint
  1917. The number of blocks pointed by indirect (non-L0) block which should be
  1918. prefetched during a pool traversal, like
  1919. .Nm zfs Cm send
  1920. or other data crawling operations.
  1921. .
  1922. .It Sy zfs_per_txg_dirty_frees_percent Ns = Ns Sy 30 Ns % Pq u64
  1923. Control percentage of dirtied indirect blocks from frees allowed into one TXG.
  1924. After this threshold is crossed, additional frees will wait until the next TXG.
  1925. .Sy 0 No disables this throttle .
  1926. .
  1927. .It Sy zfs_prefetch_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1928. Disable predictive prefetch.
  1929. Note that it leaves "prescient" prefetch
  1930. .Pq for, e.g., Nm zfs Cm send
  1931. intact.
  1932. Unlike predictive prefetch, prescient prefetch never issues I/O
  1933. that ends up not being needed, so it can't hurt performance.
  1934. .
  1935. .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1936. Disable QAT hardware acceleration for SHA256 checksums.
  1937. May be unset after the ZFS modules have been loaded to initialize the QAT
  1938. hardware as long as support is compiled in and the QAT driver is present.
  1939. .
  1940. .It Sy zfs_qat_compress_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1941. Disable QAT hardware acceleration for gzip compression.
  1942. May be unset after the ZFS modules have been loaded to initialize the QAT
  1943. hardware as long as support is compiled in and the QAT driver is present.
  1944. .
  1945. .It Sy zfs_qat_encrypt_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1946. Disable QAT hardware acceleration for AES-GCM encryption.
  1947. May be unset after the ZFS modules have been loaded to initialize the QAT
  1948. hardware as long as support is compiled in and the QAT driver is present.
  1949. .
  1950. .It Sy zfs_vnops_read_chunk_size Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
  1951. Bytes to read per chunk.
  1952. .
  1953. .It Sy zfs_read_history Ns = Ns Sy 0 Pq uint
  1954. Historical statistics for this many latest reads will be available in
  1955. .Pa /proc/spl/kstat/zfs/ Ns Ao Ar pool Ac Ns Pa /reads .
  1956. .
  1957. .It Sy zfs_read_history_hits Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1958. Include cache hits in read history
  1959. .
  1960. .It Sy zfs_rebuild_max_segment Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
  1961. Maximum read segment size to issue when sequentially resilvering a
  1962. top-level vdev.
  1963. .
  1964. .It Sy zfs_rebuild_scrub_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  1965. Automatically start a pool scrub when the last active sequential resilver
  1966. completes in order to verify the checksums of all blocks which have been
  1967. resilvered.
  1968. This is enabled by default and strongly recommended.
  1969. .
  1970. .It Sy zfs_rebuild_vdev_limit Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq u64
  1971. Maximum amount of I/O that can be concurrently issued for a sequential
  1972. resilver per leaf device, given in bytes.
  1973. .
  1974. .It Sy zfs_reconstruct_indirect_combinations_max Ns = Ns Sy 4096 Pq int
  1975. If an indirect split block contains more than this many possible unique
  1976. combinations when being reconstructed, consider it too computationally
  1977. expensive to check them all.
  1978. Instead, try at most this many randomly selected
  1979. combinations each time the block is accessed.
  1980. This allows all segment copies to participate fairly
  1981. in the reconstruction when all combinations
  1982. cannot be checked and prevents repeated use of one bad copy.
  1983. .
  1984. .It Sy zfs_recover Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1985. Set to attempt to recover from fatal errors.
  1986. This should only be used as a last resort,
  1987. as it typically results in leaked space, or worse.
  1988. .
  1989. .It Sy zfs_removal_ignore_errors Ns = Ns Sy 0 Ns | Ns 1 Pq int
  1990. Ignore hard I/O errors during device removal.
  1991. When set, if a device encounters a hard I/O error during the removal process
  1992. the removal will not be cancelled.
  1993. This can result in a normally recoverable block becoming permanently damaged
  1994. and is hence not recommended.
  1995. This should only be used as a last resort when the
  1996. pool cannot be returned to a healthy state prior to removing the device.
  1997. .
  1998. .It Sy zfs_removal_suspend_progress Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  1999. This is used by the test suite so that it can ensure that certain actions
  2000. happen while in the middle of a removal.
  2001. .
  2002. .It Sy zfs_remove_max_segment Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
  2003. The largest contiguous segment that we will attempt to allocate when removing
  2004. a device.
  2005. If there is a performance problem with attempting to allocate large blocks,
  2006. consider decreasing this.
  2007. The default value is also the maximum.
  2008. .
  2009. .It Sy zfs_resilver_disable_defer Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2010. Ignore the
  2011. .Sy resilver_defer
  2012. feature, causing an operation that would start a resilver to
  2013. immediately restart the one in progress.
  2014. .
  2015. .It Sy zfs_resilver_defer_percent Ns = Ns Sy 10 Ns % Pq uint
  2016. If the ongoing resilver progress is below this threshold, a new resilver will
  2017. restart from scratch instead of being deferred after the current one finishes,
  2018. even if the
  2019. .Sy resilver_defer
  2020. feature is enabled.
  2021. .
  2022. .It Sy zfs_resilver_min_time_ms Ns = Ns Sy 3000 Ns ms Po 3 s Pc Pq uint
  2023. Resilvers are processed by the sync thread.
  2024. While resilvering, it will spend at least this much time
  2025. working on a resilver between TXG flushes.
  2026. .
  2027. .It Sy zfs_scan_ignore_errors Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2028. If set, remove the DTL (dirty time list) upon completion of a pool scan (scrub),
  2029. even if there were unrepairable errors.
  2030. Intended to be used during pool repair or recovery to
  2031. stop resilvering when the pool is next imported.
  2032. .
  2033. .It Sy zfs_scrub_after_expand Ns = Ns Sy 1 Ns | Ns 0 Pq int
  2034. Automatically start a pool scrub after a RAIDZ expansion completes
  2035. in order to verify the checksums of all blocks which have been
  2036. copied during the expansion.
  2037. This is enabled by default and strongly recommended.
  2038. .
  2039. .It Sy zfs_scrub_min_time_ms Ns = Ns Sy 1000 Ns ms Po 1 s Pc Pq uint
  2040. Scrubs are processed by the sync thread.
  2041. While scrubbing, it will spend at least this much time
  2042. working on a scrub between TXG flushes.
  2043. .
  2044. .It Sy zfs_scrub_error_blocks_per_txg Ns = Ns Sy 4096 Pq uint
  2045. Error blocks to be scrubbed in one txg.
  2046. .
  2047. .It Sy zfs_scan_checkpoint_intval Ns = Ns Sy 7200 Ns s Po 2 hour Pc Pq uint
  2048. To preserve progress across reboots, the sequential scan algorithm periodically
  2049. needs to stop metadata scanning and issue all the verification I/O to disk.
  2050. The frequency of this flushing is determined by this tunable.
  2051. .
  2052. .It Sy zfs_scan_fill_weight Ns = Ns Sy 3 Pq uint
  2053. This tunable affects how scrub and resilver I/O segments are ordered.
  2054. A higher number indicates that we care more about how filled in a segment is,
  2055. while a lower number indicates we care more about the size of the extent without
  2056. considering the gaps within a segment.
  2057. This value is only tunable upon module insertion.
  2058. Changing the value afterwards will have no effect on scrub or resilver
  2059. performance.
  2060. .
  2061. .It Sy zfs_scan_issue_strategy Ns = Ns Sy 0 Pq uint
  2062. Determines the order that data will be verified while scrubbing or resilvering:
  2063. .Bl -tag -compact -offset 4n -width "a"
  2064. .It Sy 1
  2065. Data will be verified as sequentially as possible, given the
  2066. amount of memory reserved for scrubbing
  2067. .Pq see Sy zfs_scan_mem_lim_fact .
  2068. This may improve scrub performance if the pool's data is very fragmented.
  2069. .It Sy 2
  2070. The largest mostly-contiguous chunk of found data will be verified first.
  2071. By deferring scrubbing of small segments, we may later find adjacent data
  2072. to coalesce and increase the segment size.
  2073. .It Sy 0
  2074. .No Use strategy Sy 1 No during normal verification
  2075. .No and strategy Sy 2 No while taking a checkpoint .
  2076. .El
  2077. .
  2078. .It Sy zfs_scan_legacy Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2079. If unset, indicates that scrubs and resilvers will gather metadata in
  2080. memory before issuing sequential I/O.
  2081. Otherwise indicates that the legacy algorithm will be used,
  2082. where I/O is initiated as soon as it is discovered.
  2083. Unsetting will not affect scrubs or resilvers that are already in progress.
  2084. .
  2085. .It Sy zfs_scan_max_ext_gap Ns = Ns Sy 2097152 Ns B Po 2 MiB Pc Pq int
  2086. Sets the largest gap in bytes between scrub/resilver I/O operations
  2087. that will still be considered sequential for sorting purposes.
  2088. Changing this value will not
  2089. affect scrubs or resilvers that are already in progress.
  2090. .
  2091. .It Sy zfs_scan_mem_lim_fact Ns = Ns Sy 20 Ns ^-1 Pq uint
  2092. Maximum fraction of RAM used for I/O sorting by sequential scan algorithm.
  2093. This tunable determines the hard limit for I/O sorting memory usage.
  2094. When the hard limit is reached we stop scanning metadata and start issuing
  2095. data verification I/O.
  2096. This is done until we get below the soft limit.
  2097. .
  2098. .It Sy zfs_scan_mem_lim_soft_fact Ns = Ns Sy 20 Ns ^-1 Pq uint
  2099. The fraction of the hard limit used to determined the soft limit for I/O sorting
  2100. by the sequential scan algorithm.
  2101. When we cross this limit from below no action is taken.
  2102. When we cross this limit from above it is because we are issuing verification
  2103. I/O.
  2104. In this case (unless the metadata scan is done) we stop issuing verification I/O
  2105. and start scanning metadata again until we get to the hard limit.
  2106. .
  2107. .It Sy zfs_scan_report_txgs Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2108. When reporting resilver throughput and estimated completion time use the
  2109. performance observed over roughly the last
  2110. .Sy zfs_scan_report_txgs
  2111. TXGs.
  2112. When set to zero performance is calculated over the time between checkpoints.
  2113. .
  2114. .It Sy zfs_scan_strict_mem_lim Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2115. Enforce tight memory limits on pool scans when a sequential scan is in progress.
  2116. When disabled, the memory limit may be exceeded by fast disks.
  2117. .
  2118. .It Sy zfs_scan_suspend_progress Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2119. Freezes a scrub/resilver in progress without actually pausing it.
  2120. Intended for testing/debugging.
  2121. .
  2122. .It Sy zfs_scan_vdev_limit Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq int
  2123. Maximum amount of data that can be concurrently issued at once for scrubs and
  2124. resilvers per leaf device, given in bytes.
  2125. .
  2126. .It Sy zfs_send_corrupt_data Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2127. Allow sending of corrupt data (ignore read/checksum errors when sending).
  2128. .
  2129. .It Sy zfs_send_unmodified_spill_blocks Ns = Ns Sy 1 Ns | Ns 0 Pq int
  2130. Include unmodified spill blocks in the send stream.
  2131. Under certain circumstances, previous versions of ZFS could incorrectly
  2132. remove the spill block from an existing object.
  2133. Including unmodified copies of the spill blocks creates a backwards-compatible
  2134. stream which will recreate a spill block if it was incorrectly removed.
  2135. .
  2136. .It Sy zfs_send_no_prefetch_queue_ff Ns = Ns Sy 20 Ns ^\-1 Pq uint
  2137. The fill fraction of the
  2138. .Nm zfs Cm send
  2139. internal queues.
  2140. The fill fraction controls the timing with which internal threads are woken up.
  2141. .
  2142. .It Sy zfs_send_no_prefetch_queue_length Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq uint
  2143. The maximum number of bytes allowed in
  2144. .Nm zfs Cm send Ns 's
  2145. internal queues.
  2146. .
  2147. .It Sy zfs_send_queue_ff Ns = Ns Sy 20 Ns ^\-1 Pq uint
  2148. The fill fraction of the
  2149. .Nm zfs Cm send
  2150. prefetch queue.
  2151. The fill fraction controls the timing with which internal threads are woken up.
  2152. .
  2153. .It Sy zfs_send_queue_length Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
  2154. The maximum number of bytes allowed that will be prefetched by
  2155. .Nm zfs Cm send .
  2156. This value must be at least twice the maximum block size in use.
  2157. .
  2158. .It Sy zfs_recv_queue_ff Ns = Ns Sy 20 Ns ^\-1 Pq uint
  2159. The fill fraction of the
  2160. .Nm zfs Cm receive
  2161. queue.
  2162. The fill fraction controls the timing with which internal threads are woken up.
  2163. .
  2164. .It Sy zfs_recv_queue_length Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
  2165. The maximum number of bytes allowed in the
  2166. .Nm zfs Cm receive
  2167. queue.
  2168. This value must be at least twice the maximum block size in use.
  2169. .
  2170. .It Sy zfs_recv_write_batch_size Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq uint
  2171. The maximum amount of data, in bytes, that
  2172. .Nm zfs Cm receive
  2173. will write in one DMU transaction.
  2174. This is the uncompressed size, even when receiving a compressed send stream.
  2175. This setting will not reduce the write size below a single block.
  2176. Capped at a maximum of
  2177. .Sy 32 MiB .
  2178. .
  2179. .It Sy zfs_recv_best_effort_corrective Ns = Ns Sy 0 Pq int
  2180. When this variable is set to non-zero a corrective receive:
  2181. .Bl -enum -compact -offset 4n -width "1."
  2182. .It
  2183. Does not enforce the restriction of source & destination snapshot GUIDs
  2184. matching.
  2185. .It
  2186. If there is an error during healing, the healing receive is not
  2187. terminated instead it moves on to the next record.
  2188. .El
  2189. .
  2190. .It Sy zfs_override_estimate_recordsize Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2191. Setting this variable overrides the default logic for estimating block
  2192. sizes when doing a
  2193. .Nm zfs Cm send .
  2194. The default heuristic is that the average block size
  2195. will be the current recordsize.
  2196. Override this value if most data in your dataset is not of that size
  2197. and you require accurate zfs send size estimates.
  2198. .
  2199. .It Sy zfs_sync_pass_deferred_free Ns = Ns Sy 2 Pq uint
  2200. Flushing of data to disk is done in passes.
  2201. Defer frees starting in this pass.
  2202. .
  2203. .It Sy zfs_spa_discard_memory_limit Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq int
  2204. Maximum memory used for prefetching a checkpoint's space map on each
  2205. vdev while discarding the checkpoint.
  2206. .
  2207. .It Sy zfs_special_class_metadata_reserve_pct Ns = Ns Sy 25 Ns % Pq uint
  2208. Only allow small data blocks to be allocated on the special and dedup vdev
  2209. types when the available free space percentage on these vdevs exceeds this
  2210. value.
  2211. This ensures reserved space is available for pool metadata as the
  2212. special vdevs approach capacity.
  2213. .
  2214. .It Sy zfs_sync_pass_dont_compress Ns = Ns Sy 8 Pq uint
  2215. Starting in this sync pass, disable compression (including of metadata).
  2216. With the default setting, in practice, we don't have this many sync passes,
  2217. so this has no effect.
  2218. .Pp
  2219. The original intent was that disabling compression would help the sync passes
  2220. to converge.
  2221. However, in practice, disabling compression increases
  2222. the average number of sync passes; because when we turn compression off,
  2223. many blocks' size will change, and thus we have to re-allocate
  2224. (not overwrite) them.
  2225. It also increases the number of
  2226. .Em 128 KiB
  2227. allocations (e.g. for indirect blocks and spacemaps)
  2228. because these will not be compressed.
  2229. The
  2230. .Em 128 KiB
  2231. allocations are especially detrimental to performance
  2232. on highly fragmented systems, which may have very few free segments of this
  2233. size,
  2234. and may need to load new metaslabs to satisfy these allocations.
  2235. .
  2236. .It Sy zfs_sync_pass_rewrite Ns = Ns Sy 2 Pq uint
  2237. Rewrite new block pointers starting in this pass.
  2238. .
  2239. .It Sy zfs_trim_extent_bytes_max Ns = Ns Sy 134217728 Ns B Po 128 MiB Pc Pq uint
  2240. Maximum size of TRIM command.
  2241. Larger ranges will be split into chunks no larger than this value before
  2242. issuing.
  2243. .
  2244. .It Sy zfs_trim_extent_bytes_min Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq uint
  2245. Minimum size of TRIM commands.
  2246. TRIM ranges smaller than this will be skipped,
  2247. unless they're part of a larger range which was chunked.
  2248. This is done because it's common for these small TRIMs
  2249. to negatively impact overall performance.
  2250. .
  2251. .It Sy zfs_trim_metaslab_skip Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2252. Skip uninitialized metaslabs during the TRIM process.
  2253. This option is useful for pools constructed from large thinly-provisioned
  2254. devices
  2255. where TRIM operations are slow.
  2256. As a pool ages, an increasing fraction of the pool's metaslabs
  2257. will be initialized, progressively degrading the usefulness of this option.
  2258. This setting is stored when starting a manual TRIM and will
  2259. persist for the duration of the requested TRIM.
  2260. .
  2261. .It Sy zfs_trim_queue_limit Ns = Ns Sy 10 Pq uint
  2262. Maximum number of queued TRIMs outstanding per leaf vdev.
  2263. The number of concurrent TRIM commands issued to the device is controlled by
  2264. .Sy zfs_vdev_trim_min_active No and Sy zfs_vdev_trim_max_active .
  2265. .
  2266. .It Sy zfs_trim_txg_batch Ns = Ns Sy 32 Pq uint
  2267. The number of transaction groups' worth of frees which should be aggregated
  2268. before TRIM operations are issued to the device.
  2269. This setting represents a trade-off between issuing larger,
  2270. more efficient TRIM operations and the delay
  2271. before the recently trimmed space is available for use by the device.
  2272. .Pp
  2273. Increasing this value will allow frees to be aggregated for a longer time.
  2274. This will result is larger TRIM operations and potentially increased memory
  2275. usage.
  2276. Decreasing this value will have the opposite effect.
  2277. The default of
  2278. .Sy 32
  2279. was determined to be a reasonable compromise.
  2280. .
  2281. .It Sy zfs_txg_history Ns = Ns Sy 100 Pq uint
  2282. Historical statistics for this many latest TXGs will be available in
  2283. .Pa /proc/spl/kstat/zfs/ Ns Ao Ar pool Ac Ns Pa /TXGs .
  2284. .
  2285. .It Sy zfs_txg_timeout Ns = Ns Sy 5 Ns s Pq uint
  2286. Flush dirty data to disk at least every this many seconds (maximum TXG
  2287. duration).
  2288. .
  2289. .It Sy zfs_vdev_aggregation_limit Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq uint
  2290. Max vdev I/O aggregation size.
  2291. .
  2292. .It Sy zfs_vdev_aggregation_limit_non_rotating Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq uint
  2293. Max vdev I/O aggregation size for non-rotating media.
  2294. .
  2295. .It Sy zfs_vdev_mirror_rotating_inc Ns = Ns Sy 0 Pq int
  2296. A number by which the balancing algorithm increments the load calculation for
  2297. the purpose of selecting the least busy mirror member when an I/O operation
  2298. immediately follows its predecessor on rotational vdevs
  2299. for the purpose of making decisions based on load.
  2300. .
  2301. .It Sy zfs_vdev_mirror_rotating_seek_inc Ns = Ns Sy 5 Pq int
  2302. A number by which the balancing algorithm increments the load calculation for
  2303. the purpose of selecting the least busy mirror member when an I/O operation
  2304. lacks locality as defined by
  2305. .Sy zfs_vdev_mirror_rotating_seek_offset .
  2306. Operations within this that are not immediately following the previous operation
  2307. are incremented by half.
  2308. .
  2309. .It Sy zfs_vdev_mirror_rotating_seek_offset Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq int
  2310. The maximum distance for the last queued I/O operation in which
  2311. the balancing algorithm considers an operation to have locality.
  2312. .No See Sx ZFS I/O SCHEDULER .
  2313. .
  2314. .It Sy zfs_vdev_mirror_non_rotating_inc Ns = Ns Sy 0 Pq int
  2315. A number by which the balancing algorithm increments the load calculation for
  2316. the purpose of selecting the least busy mirror member on non-rotational vdevs
  2317. when I/O operations do not immediately follow one another.
  2318. .
  2319. .It Sy zfs_vdev_mirror_non_rotating_seek_inc Ns = Ns Sy 1 Pq int
  2320. A number by which the balancing algorithm increments the load calculation for
  2321. the purpose of selecting the least busy mirror member when an I/O operation
  2322. lacks
  2323. locality as defined by the
  2324. .Sy zfs_vdev_mirror_rotating_seek_offset .
  2325. Operations within this that are not immediately following the previous operation
  2326. are incremented by half.
  2327. .
  2328. .It Sy zfs_vdev_read_gap_limit Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq uint
  2329. Aggregate read I/O operations if the on-disk gap between them is within this
  2330. threshold.
  2331. .
  2332. .It Sy zfs_vdev_write_gap_limit Ns = Ns Sy 4096 Ns B Po 4 KiB Pc Pq uint
  2333. Aggregate write I/O operations if the on-disk gap between them is within this
  2334. threshold.
  2335. .
  2336. .It Sy zfs_vdev_raidz_impl Ns = Ns Sy fastest Pq string
  2337. Select the raidz parity implementation to use.
  2338. .Pp
  2339. Variants that don't depend on CPU-specific features
  2340. may be selected on module load, as they are supported on all systems.
  2341. The remaining options may only be set after the module is loaded,
  2342. as they are available only if the implementations are compiled in
  2343. and supported on the running system.
  2344. .Pp
  2345. Once the module is loaded,
  2346. .Pa /sys/module/zfs/parameters/zfs_vdev_raidz_impl
  2347. will show the available options,
  2348. with the currently selected one enclosed in square brackets.
  2349. .Pp
  2350. .TS
  2351. lb l l .
  2352. fastest selected by built-in benchmark
  2353. original original implementation
  2354. scalar scalar implementation
  2355. sse2 SSE2 instruction set 64-bit x86
  2356. ssse3 SSSE3 instruction set 64-bit x86
  2357. avx2 AVX2 instruction set 64-bit x86
  2358. avx512f AVX512F instruction set 64-bit x86
  2359. avx512bw AVX512F & AVX512BW instruction sets 64-bit x86
  2360. aarch64_neon NEON Aarch64/64-bit ARMv8
  2361. aarch64_neonx2 NEON with more unrolling Aarch64/64-bit ARMv8
  2362. powerpc_altivec Altivec PowerPC
  2363. .TE
  2364. .
  2365. .It Sy zfs_vdev_scheduler Pq charp
  2366. .Sy DEPRECATED .
  2367. Prints warning to kernel log for compatibility.
  2368. .
  2369. .It Sy zfs_zevent_len_max Ns = Ns Sy 512 Pq uint
  2370. Max event queue length.
  2371. Events in the queue can be viewed with
  2372. .Xr zpool-events 8 .
  2373. .
  2374. .It Sy zfs_zevent_retain_max Ns = Ns Sy 2000 Pq int
  2375. Maximum recent zevent records to retain for duplicate checking.
  2376. Setting this to
  2377. .Sy 0
  2378. disables duplicate detection.
  2379. .
  2380. .It Sy zfs_zevent_retain_expire_secs Ns = Ns Sy 900 Ns s Po 15 min Pc Pq int
  2381. Lifespan for a recent ereport that was retained for duplicate checking.
  2382. .
  2383. .It Sy zfs_zil_clean_taskq_maxalloc Ns = Ns Sy 1048576 Pq int
  2384. The maximum number of taskq entries that are allowed to be cached.
  2385. When this limit is exceeded transaction records (itxs)
  2386. will be cleaned synchronously.
  2387. .
  2388. .It Sy zfs_zil_clean_taskq_minalloc Ns = Ns Sy 1024 Pq int
  2389. The number of taskq entries that are pre-populated when the taskq is first
  2390. created and are immediately available for use.
  2391. .
  2392. .It Sy zfs_zil_clean_taskq_nthr_pct Ns = Ns Sy 100 Ns % Pq int
  2393. This controls the number of threads used by
  2394. .Sy dp_zil_clean_taskq .
  2395. The default value of
  2396. .Sy 100%
  2397. will create a maximum of one thread per cpu.
  2398. .
  2399. .It Sy zil_maxblocksize Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq uint
  2400. This sets the maximum block size used by the ZIL.
  2401. On very fragmented pools, lowering this
  2402. .Pq typically to Sy 36 KiB
  2403. can improve performance.
  2404. .
  2405. .It Sy zil_maxcopied Ns = Ns Sy 7680 Ns B Po 7.5 KiB Pc Pq uint
  2406. This sets the maximum number of write bytes logged via WR_COPIED.
  2407. It tunes a tradeoff between additional memory copy and possibly worse log
  2408. space efficiency vs additional range lock/unlock.
  2409. .
  2410. .It Sy zil_nocacheflush Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2411. Disable the cache flush commands that are normally sent to disk by
  2412. the ZIL after an LWB write has completed.
  2413. Setting this will cause ZIL corruption on power loss
  2414. if a volatile out-of-order write cache is enabled.
  2415. .
  2416. .It Sy zil_replay_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2417. Disable intent logging replay.
  2418. Can be disabled for recovery from corrupted ZIL.
  2419. .
  2420. .It Sy zil_slog_bulk Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq u64
  2421. Limit SLOG write size per commit executed with synchronous priority.
  2422. Any writes above that will be executed with lower (asynchronous) priority
  2423. to limit potential SLOG device abuse by single active ZIL writer.
  2424. .
  2425. .It Sy zfs_zil_saxattr Ns = Ns Sy 1 Ns | Ns 0 Pq int
  2426. Setting this tunable to zero disables ZIL logging of new
  2427. .Sy xattr Ns = Ns Sy sa
  2428. records if the
  2429. .Sy org.openzfs:zilsaxattr
  2430. feature is enabled on the pool.
  2431. This would only be necessary to work around bugs in the ZIL logging or replay
  2432. code for this record type.
  2433. The tunable has no effect if the feature is disabled.
  2434. .
  2435. .It Sy zfs_embedded_slog_min_ms Ns = Ns Sy 64 Pq uint
  2436. Usually, one metaslab from each normal-class vdev is dedicated for use by
  2437. the ZIL to log synchronous writes.
  2438. However, if there are fewer than
  2439. .Sy zfs_embedded_slog_min_ms
  2440. metaslabs in the vdev, this functionality is disabled.
  2441. This ensures that we don't set aside an unreasonable amount of space for the
  2442. ZIL.
  2443. .
  2444. .It Sy zstd_earlyabort_pass Ns = Ns Sy 1 Pq uint
  2445. Whether heuristic for detection of incompressible data with zstd levels >= 3
  2446. using LZ4 and zstd-1 passes is enabled.
  2447. .
  2448. .It Sy zstd_abort_size Ns = Ns Sy 131072 Pq uint
  2449. Minimal uncompressed size (inclusive) of a record before the early abort
  2450. heuristic will be attempted.
  2451. .
  2452. .It Sy zio_deadman_log_all Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2453. If non-zero, the zio deadman will produce debugging messages
  2454. .Pq see Sy zfs_dbgmsg_enable
  2455. for all zios, rather than only for leaf zios possessing a vdev.
  2456. This is meant to be used by developers to gain
  2457. diagnostic information for hang conditions which don't involve a mutex
  2458. or other locking primitive: typically conditions in which a thread in
  2459. the zio pipeline is looping indefinitely.
  2460. .
  2461. .It Sy zio_slow_io_ms Ns = Ns Sy 30000 Ns ms Po 30 s Pc Pq int
  2462. When an I/O operation takes more than this much time to complete,
  2463. it's marked as slow.
  2464. Each slow operation causes a delay zevent.
  2465. Slow I/O counters can be seen with
  2466. .Nm zpool Cm status Fl s .
  2467. .
  2468. .It Sy zio_dva_throttle_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
  2469. Throttle block allocations in the I/O pipeline.
  2470. This allows for dynamic allocation distribution when devices are imbalanced.
  2471. When enabled, the maximum number of pending allocations per top-level vdev
  2472. is limited by
  2473. .Sy zfs_vdev_queue_depth_pct .
  2474. .
  2475. .It Sy zfs_xattr_compat Ns = Ns 0 Ns | Ns 1 Pq int
  2476. Control the naming scheme used when setting new xattrs in the user namespace.
  2477. If
  2478. .Sy 0
  2479. .Pq the default on Linux ,
  2480. user namespace xattr names are prefixed with the namespace, to be backwards
  2481. compatible with previous versions of ZFS on Linux.
  2482. If
  2483. .Sy 1
  2484. .Pq the default on Fx ,
  2485. user namespace xattr names are not prefixed, to be backwards compatible with
  2486. previous versions of ZFS on illumos and
  2487. .Fx .
  2488. .Pp
  2489. Either naming scheme can be read on this and future versions of ZFS, regardless
  2490. of this tunable, but legacy ZFS on illumos or
  2491. .Fx
  2492. are unable to read user namespace xattrs written in the Linux format, and
  2493. legacy versions of ZFS on Linux are unable to read user namespace xattrs written
  2494. in the legacy ZFS format.
  2495. .Pp
  2496. An existing xattr with the alternate naming scheme is removed when overwriting
  2497. the xattr so as to not accumulate duplicates.
  2498. .
  2499. .It Sy zio_requeue_io_start_cut_in_line Ns = Ns Sy 0 Ns | Ns 1 Pq int
  2500. Prioritize requeued I/O.
  2501. .
  2502. .It Sy zio_taskq_batch_pct Ns = Ns Sy 80 Ns % Pq uint
  2503. Percentage of online CPUs which will run a worker thread for I/O.
  2504. These workers are responsible for I/O work such as compression, encryption,
  2505. checksum and parity calculations.
  2506. Fractional number of CPUs will be rounded down.
  2507. .Pp
  2508. The default value of
  2509. .Sy 80%
  2510. was chosen to avoid using all CPUs which can result in
  2511. latency issues and inconsistent application performance,
  2512. especially when slower compression and/or checksumming is enabled.
  2513. Set value only applies to pools imported/created after that.
  2514. .
  2515. .It Sy zio_taskq_batch_tpq Ns = Ns Sy 0 Pq uint
  2516. Number of worker threads per taskq.
  2517. Higher values improve I/O ordering and CPU utilization,
  2518. while lower reduce lock contention.
  2519. Set value only applies to pools imported/created after that.
  2520. .Pp
  2521. If
  2522. .Sy 0 ,
  2523. generate a system-dependent value close to 6 threads per taskq.
  2524. Set value only applies to pools imported/created after that.
  2525. .
  2526. .It Sy zio_taskq_write_tpq Ns = Ns Sy 16 Pq uint
  2527. Determines the minumum number of threads per write issue taskq.
  2528. Higher values improve CPU utilization on high throughput,
  2529. while lower reduce taskq locks contention on high IOPS.
  2530. Set value only applies to pools imported/created after that.
  2531. .
  2532. .It Sy zio_taskq_read Ns = Ns Sy fixed,1,8 null scale null Pq charp
  2533. Set the queue and thread configuration for the IO read queues.
  2534. This is an advanced debugging parameter.
  2535. Don't change this unless you understand what it does.
  2536. Set values only apply to pools imported/created after that.
  2537. .
  2538. .It Sy zio_taskq_write Ns = Ns Sy sync null scale null Pq charp
  2539. Set the queue and thread configuration for the IO write queues.
  2540. This is an advanced debugging parameter.
  2541. Don't change this unless you understand what it does.
  2542. Set values only apply to pools imported/created after that.
  2543. .
  2544. .It Sy zvol_inhibit_dev Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2545. Do not create zvol device nodes.
  2546. This may slightly improve startup time on
  2547. systems with a very large number of zvols.
  2548. .
  2549. .It Sy zvol_major Ns = Ns Sy 230 Pq uint
  2550. Major number for zvol block devices.
  2551. .
  2552. .It Sy zvol_max_discard_blocks Ns = Ns Sy 16384 Pq long
  2553. Discard (TRIM) operations done on zvols will be done in batches of this
  2554. many blocks, where block size is determined by the
  2555. .Sy volblocksize
  2556. property of a zvol.
  2557. .
  2558. .It Sy zvol_prefetch_bytes Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq uint
  2559. When adding a zvol to the system, prefetch this many bytes
  2560. from the start and end of the volume.
  2561. Prefetching these regions of the volume is desirable,
  2562. because they are likely to be accessed immediately by
  2563. .Xr blkid 8
  2564. or the kernel partitioner.
  2565. .
  2566. .It Sy zvol_request_sync Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2567. When processing I/O requests for a zvol, submit them synchronously.
  2568. This effectively limits the queue depth to
  2569. .Em 1
  2570. for each I/O submitter.
  2571. When unset, requests are handled asynchronously by a thread pool.
  2572. The number of requests which can be handled concurrently is controlled by
  2573. .Sy zvol_threads .
  2574. .Sy zvol_request_sync
  2575. is ignored when running on a kernel that supports block multiqueue
  2576. .Pq Li blk-mq .
  2577. .
  2578. .It Sy zvol_num_taskqs Ns = Ns Sy 0 Pq uint
  2579. Number of zvol taskqs.
  2580. If
  2581. .Sy 0
  2582. (the default) then scaling is done internally to prefer 6 threads per taskq.
  2583. This only applies on Linux.
  2584. .
  2585. .It Sy zvol_threads Ns = Ns Sy 0 Pq uint
  2586. The number of system wide threads to use for processing zvol block IOs.
  2587. If
  2588. .Sy 0
  2589. (the default) then internally set
  2590. .Sy zvol_threads
  2591. to the number of CPUs present or 32 (whichever is greater).
  2592. .
  2593. .It Sy zvol_blk_mq_threads Ns = Ns Sy 0 Pq uint
  2594. The number of threads per zvol to use for queuing IO requests.
  2595. This parameter will only appear if your kernel supports
  2596. .Li blk-mq
  2597. and is only read and assigned to a zvol at zvol load time.
  2598. If
  2599. .Sy 0
  2600. (the default) then internally set
  2601. .Sy zvol_blk_mq_threads
  2602. to the number of CPUs present.
  2603. .
  2604. .It Sy zvol_use_blk_mq Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2605. Set to
  2606. .Sy 1
  2607. to use the
  2608. .Li blk-mq
  2609. API for zvols.
  2610. Set to
  2611. .Sy 0
  2612. (the default) to use the legacy zvol APIs.
  2613. This setting can give better or worse zvol performance depending on
  2614. the workload.
  2615. This parameter will only appear if your kernel supports
  2616. .Li blk-mq
  2617. and is only read and assigned to a zvol at zvol load time.
  2618. .
  2619. .It Sy zvol_blk_mq_blocks_per_thread Ns = Ns Sy 8 Pq uint
  2620. If
  2621. .Sy zvol_use_blk_mq
  2622. is enabled, then process this number of
  2623. .Sy volblocksize Ns -sized blocks per zvol thread.
  2624. This tunable can be use to favor better performance for zvol reads (lower
  2625. values) or writes (higher values).
  2626. If set to
  2627. .Sy 0 ,
  2628. then the zvol layer will process the maximum number of blocks
  2629. per thread that it can.
  2630. This parameter will only appear if your kernel supports
  2631. .Li blk-mq
  2632. and is only applied at each zvol's load time.
  2633. .
  2634. .It Sy zvol_blk_mq_queue_depth Ns = Ns Sy 0 Pq uint
  2635. The queue_depth value for the zvol
  2636. .Li blk-mq
  2637. interface.
  2638. This parameter will only appear if your kernel supports
  2639. .Li blk-mq
  2640. and is only applied at each zvol's load time.
  2641. If
  2642. .Sy 0
  2643. (the default) then use the kernel's default queue depth.
  2644. Values are clamped to the kernel's
  2645. .Dv BLKDEV_MIN_RQ
  2646. and
  2647. .Dv BLKDEV_MAX_RQ Ns / Ns Dv BLKDEV_DEFAULT_RQ
  2648. limits.
  2649. .
  2650. .It Sy zvol_volmode Ns = Ns Sy 1 Pq uint
  2651. Defines zvol block devices behaviour when
  2652. .Sy volmode Ns = Ns Sy default :
  2653. .Bl -tag -compact -offset 4n -width "a"
  2654. .It Sy 1
  2655. .No equivalent to Sy full
  2656. .It Sy 2
  2657. .No equivalent to Sy dev
  2658. .It Sy 3
  2659. .No equivalent to Sy none
  2660. .El
  2661. .
  2662. .It Sy zvol_enforce_quotas Ns = Ns Sy 0 Ns | Ns 1 Pq uint
  2663. Enable strict ZVOL quota enforcement.
  2664. The strict quota enforcement may have a performance impact.
  2665. .El
  2666. .
  2667. .Sh ZFS I/O SCHEDULER
  2668. ZFS issues I/O operations to leaf vdevs to satisfy and complete I/O operations.
  2669. The scheduler determines when and in what order those operations are issued.
  2670. The scheduler divides operations into five I/O classes,
  2671. prioritized in the following order: sync read, sync write, async read,
  2672. async write, and scrub/resilver.
  2673. Each queue defines the minimum and maximum number of concurrent operations
  2674. that may be issued to the device.
  2675. In addition, the device has an aggregate maximum,
  2676. .Sy zfs_vdev_max_active .
  2677. Note that the sum of the per-queue minima must not exceed the aggregate maximum.
  2678. If the sum of the per-queue maxima exceeds the aggregate maximum,
  2679. then the number of active operations may reach
  2680. .Sy zfs_vdev_max_active ,
  2681. in which case no further operations will be issued,
  2682. regardless of whether all per-queue minima have been met.
  2683. .Pp
  2684. For many physical devices, throughput increases with the number of
  2685. concurrent operations, but latency typically suffers.
  2686. Furthermore, physical devices typically have a limit
  2687. at which more concurrent operations have no
  2688. effect on throughput or can actually cause it to decrease.
  2689. .Pp
  2690. The scheduler selects the next operation to issue by first looking for an
  2691. I/O class whose minimum has not been satisfied.
  2692. Once all are satisfied and the aggregate maximum has not been hit,
  2693. the scheduler looks for classes whose maximum has not been satisfied.
  2694. Iteration through the I/O classes is done in the order specified above.
  2695. No further operations are issued
  2696. if the aggregate maximum number of concurrent operations has been hit,
  2697. or if there are no operations queued for an I/O class that has not hit its
  2698. maximum.
  2699. Every time an I/O operation is queued or an operation completes,
  2700. the scheduler looks for new operations to issue.
  2701. .Pp
  2702. In general, smaller
  2703. .Sy max_active Ns s
  2704. will lead to lower latency of synchronous operations.
  2705. Larger
  2706. .Sy max_active Ns s
  2707. may lead to higher overall throughput, depending on underlying storage.
  2708. .Pp
  2709. The ratio of the queues'
  2710. .Sy max_active Ns s
  2711. determines the balance of performance between reads, writes, and scrubs.
  2712. For example, increasing
  2713. .Sy zfs_vdev_scrub_max_active
  2714. will cause the scrub or resilver to complete more quickly,
  2715. but reads and writes to have higher latency and lower throughput.
  2716. .Pp
  2717. All I/O classes have a fixed maximum number of outstanding operations,
  2718. except for the async write class.
  2719. Asynchronous writes represent the data that is committed to stable storage
  2720. during the syncing stage for transaction groups.
  2721. Transaction groups enter the syncing state periodically,
  2722. so the number of queued async writes will quickly burst up
  2723. and then bleed down to zero.
  2724. Rather than servicing them as quickly as possible,
  2725. the I/O scheduler changes the maximum number of active async write operations
  2726. according to the amount of dirty data in the pool.
  2727. Since both throughput and latency typically increase with the number of
  2728. concurrent operations issued to physical devices, reducing the
  2729. burstiness in the number of simultaneous operations also stabilizes the
  2730. response time of operations from other queues, in particular synchronous ones.
  2731. In broad strokes, the I/O scheduler will issue more concurrent operations
  2732. from the async write queue as there is more dirty data in the pool.
  2733. .
  2734. .Ss Async Writes
  2735. The number of concurrent operations issued for the async write I/O class
  2736. follows a piece-wise linear function defined by a few adjustable points:
  2737. .Bd -literal
  2738. | o---------| <-- \fBzfs_vdev_async_write_max_active\fP
  2739. ^ | /^ |
  2740. | | / | |
  2741. active | / | |
  2742. I/O | / | |
  2743. count | / | |
  2744. | / | |
  2745. |-------o | | <-- \fBzfs_vdev_async_write_min_active\fP
  2746. 0|_______^______|_________|
  2747. 0% | | 100% of \fBzfs_dirty_data_max\fP
  2748. | |
  2749. | `-- \fBzfs_vdev_async_write_active_max_dirty_percent\fP
  2750. `--------- \fBzfs_vdev_async_write_active_min_dirty_percent\fP
  2751. .Ed
  2752. .Pp
  2753. Until the amount of dirty data exceeds a minimum percentage of the dirty
  2754. data allowed in the pool, the I/O scheduler will limit the number of
  2755. concurrent operations to the minimum.
  2756. As that threshold is crossed, the number of concurrent operations issued
  2757. increases linearly to the maximum at the specified maximum percentage
  2758. of the dirty data allowed in the pool.
  2759. .Pp
  2760. Ideally, the amount of dirty data on a busy pool will stay in the sloped
  2761. part of the function between
  2762. .Sy zfs_vdev_async_write_active_min_dirty_percent
  2763. and
  2764. .Sy zfs_vdev_async_write_active_max_dirty_percent .
  2765. If it exceeds the maximum percentage,
  2766. this indicates that the rate of incoming data is
  2767. greater than the rate that the backend storage can handle.
  2768. In this case, we must further throttle incoming writes,
  2769. as described in the next section.
  2770. .
  2771. .Sh ZFS TRANSACTION DELAY
  2772. We delay transactions when we've determined that the backend storage
  2773. isn't able to accommodate the rate of incoming writes.
  2774. .Pp
  2775. If there is already a transaction waiting, we delay relative to when
  2776. that transaction will finish waiting.
  2777. This way the calculated delay time
  2778. is independent of the number of threads concurrently executing transactions.
  2779. .Pp
  2780. If we are the only waiter, wait relative to when the transaction started,
  2781. rather than the current time.
  2782. This credits the transaction for "time already served",
  2783. e.g. reading indirect blocks.
  2784. .Pp
  2785. The minimum time for a transaction to take is calculated as
  2786. .D1 min_time = min( Ns Sy zfs_delay_scale No \(mu Po Sy dirty No \- Sy min Pc / Po Sy max No \- Sy dirty Pc , 100ms)
  2787. .Pp
  2788. The delay has two degrees of freedom that can be adjusted via tunables.
  2789. The percentage of dirty data at which we start to delay is defined by
  2790. .Sy zfs_delay_min_dirty_percent .
  2791. This should typically be at or above
  2792. .Sy zfs_vdev_async_write_active_max_dirty_percent ,
  2793. so that we only start to delay after writing at full speed
  2794. has failed to keep up with the incoming write rate.
  2795. The scale of the curve is defined by
  2796. .Sy zfs_delay_scale .
  2797. Roughly speaking, this variable determines the amount of delay at the midpoint
  2798. of the curve.
  2799. .Bd -literal
  2800. delay
  2801. 10ms +-------------------------------------------------------------*+
  2802. | *|
  2803. 9ms + *+
  2804. | *|
  2805. 8ms + *+
  2806. | * |
  2807. 7ms + * +
  2808. | * |
  2809. 6ms + * +
  2810. | * |
  2811. 5ms + * +
  2812. | * |
  2813. 4ms + * +
  2814. | * |
  2815. 3ms + * +
  2816. | * |
  2817. 2ms + (midpoint) * +
  2818. | | ** |
  2819. 1ms + v *** +
  2820. | \fBzfs_delay_scale\fP ----------> ******** |
  2821. 0 +-------------------------------------*********----------------+
  2822. 0% <- \fBzfs_dirty_data_max\fP -> 100%
  2823. .Ed
  2824. .Pp
  2825. Note, that since the delay is added to the outstanding time remaining on the
  2826. most recent transaction it's effectively the inverse of IOPS.
  2827. Here, the midpoint of
  2828. .Em 500 us
  2829. translates to
  2830. .Em 2000 IOPS .
  2831. The shape of the curve
  2832. was chosen such that small changes in the amount of accumulated dirty data
  2833. in the first three quarters of the curve yield relatively small differences
  2834. in the amount of delay.
  2835. .Pp
  2836. The effects can be easier to understand when the amount of delay is
  2837. represented on a logarithmic scale:
  2838. .Bd -literal
  2839. delay
  2840. 100ms +-------------------------------------------------------------++
  2841. + +
  2842. | |
  2843. + *+
  2844. 10ms + *+
  2845. + ** +
  2846. | (midpoint) ** |
  2847. + | ** +
  2848. 1ms + v **** +
  2849. + \fBzfs_delay_scale\fP ----------> ***** +
  2850. | **** |
  2851. + **** +
  2852. 100us + ** +
  2853. + * +
  2854. | * |
  2855. + * +
  2856. 10us + * +
  2857. + +
  2858. | |
  2859. + +
  2860. +--------------------------------------------------------------+
  2861. 0% <- \fBzfs_dirty_data_max\fP -> 100%
  2862. .Ed
  2863. .Pp
  2864. Note here that only as the amount of dirty data approaches its limit does
  2865. the delay start to increase rapidly.
  2866. The goal of a properly tuned system should be to keep the amount of dirty data
  2867. out of that range by first ensuring that the appropriate limits are set
  2868. for the I/O scheduler to reach optimal throughput on the back-end storage,
  2869. and then by changing the value of
  2870. .Sy zfs_delay_scale
  2871. to increase the steepness of the curve.