zfs.4 (116126B)
- .\"
- .\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
- .\" Copyright (c) 2019, 2021 by Delphix. All rights reserved.
- .\" Copyright (c) 2019 Datto Inc.
- .\" Copyright (c) 2023, 2024 Klara, Inc.
- .\" The contents of this file are subject to the terms of the Common Development
- .\" and Distribution License (the "License"). You may not use this file except
- .\" in compliance with the License. You can obtain a copy of the license at
- .\" usr/src/OPENSOLARIS.LICENSE or https://opensource.org/licenses/CDDL-1.0.
- .\"
- .\" See the License for the specific language governing permissions and
- .\" limitations under the License. When distributing Covered Code, include this
- .\" CDDL HEADER in each file and include the License file at
- .\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this
- .\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
- .\" own identifying information:
- .\" Portions Copyright [yyyy] [name of copyright owner]
- .\"
- .\" Copyright (c) 2024, Klara, Inc.
- .\"
- .Dd November 1, 2024
- .Dt ZFS 4
- .Os
- .
- .Sh NAME
- .Nm zfs
- .Nd tuning of the ZFS kernel module
- .
- .Sh DESCRIPTION
- The ZFS module supports these parameters:
- .Bl -tag -width Ds
- .It Sy dbuf_cache_max_bytes Ns = Ns Sy UINT64_MAX Ns B Pq u64
- Maximum size in bytes of the dbuf cache.
- The target size is determined by the MIN versus
- .No 1/2^ Ns Sy dbuf_cache_shift Pq 1/32nd
- of the target ARC size.
- The behavior of the dbuf cache and its associated settings
- can be observed via the
- .Pa /proc/spl/kstat/zfs/dbufstats
- kstat.
- .
- .It Sy dbuf_metadata_cache_max_bytes Ns = Ns Sy UINT64_MAX Ns B Pq u64
- Maximum size in bytes of the metadata dbuf cache.
- The target size is determined by the MIN versus
- .No 1/2^ Ns Sy dbuf_metadata_cache_shift Pq 1/64th
- of the target ARC size.
- The behavior of the metadata dbuf cache and its associated settings
- can be observed via the
- .Pa /proc/spl/kstat/zfs/dbufstats
- kstat.
- .
- .It Sy dbuf_cache_hiwater_pct Ns = Ns Sy 10 Ns % Pq uint
- The percentage over
- .Sy dbuf_cache_max_bytes
- when dbufs must be evicted directly.
- .
- .It Sy dbuf_cache_lowater_pct Ns = Ns Sy 10 Ns % Pq uint
- The percentage below
- .Sy dbuf_cache_max_bytes
- when the evict thread stops evicting dbufs.
- .
- .It Sy dbuf_cache_shift Ns = Ns Sy 5 Pq uint
- Set the size of the dbuf cache
- .Pq Sy dbuf_cache_max_bytes
- to a log2 fraction of the target ARC size.
- .
- .It Sy dbuf_metadata_cache_shift Ns = Ns Sy 6 Pq uint
- Set the size of the dbuf metadata cache
- .Pq Sy dbuf_metadata_cache_max_bytes
- to a log2 fraction of the target ARC size.
- .
- .It Sy dbuf_mutex_cache_shift Ns = Ns Sy 0 Pq uint
- Set the size of the mutex array for the dbuf cache.
- When set to
- .Sy 0
- the array is dynamically sized based on total system memory.
- .
- .It Sy dmu_object_alloc_chunk_shift Ns = Ns Sy 7 Po 128 Pc Pq uint
- dnode slots allocated in a single operation as a power of 2.
- The default value minimizes lock contention for the bulk operation performed.
- .
- .It Sy dmu_ddt_copies Ns = Ns Sy 3 Pq uint
- Controls the number of copies stored for DeDup Table
- .Pq DDT
- objects.
- Reducing the number of copies to 1 from the previous default of 3
- can reduce the write inflation caused by deduplication.
- This assumes redundancy for this data is provided by the vdev layer.
- If the DDT is damaged, space may be leaked
- .Pq not freed
- when the DDT can not report the correct reference count.
- .
- .It Sy dmu_prefetch_max Ns = Ns Sy 134217728 Ns B Po 128 MiB Pc Pq uint
- Limit the amount we can prefetch with one call to this amount in bytes.
- This helps to limit the amount of memory that can be used by prefetching.
- .
- .It Sy ignore_hole_birth Pq int
- Alias for
- .Sy send_holes_without_birth_time .
- .
- .It Sy l2arc_feed_again Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Turbo L2ARC warm-up.
- When the L2ARC is cold the fill interval will be set as fast as possible.
- .
- .It Sy l2arc_feed_min_ms Ns = Ns Sy 200 Pq u64
- Min feed interval in milliseconds.
- Requires
- .Sy l2arc_feed_again Ns = Ns Ar 1
- and only applicable in related situations.
- .
- .It Sy l2arc_feed_secs Ns = Ns Sy 1 Pq u64
- Seconds between L2ARC writing.
- .
- .It Sy l2arc_headroom Ns = Ns Sy 8 Pq u64
- How far through the ARC lists to search for L2ARC cacheable content,
- expressed as a multiplier of
- .Sy l2arc_write_max .
- ARC persistence across reboots can be achieved with persistent L2ARC
- by setting this parameter to
- .Sy 0 ,
- allowing the full length of ARC lists to be searched for cacheable content.
- .
- .It Sy l2arc_headroom_boost Ns = Ns Sy 200 Ns % Pq u64
- Scales
- .Sy l2arc_headroom
- by this percentage when L2ARC contents are being successfully compressed
- before writing.
- A value of
- .Sy 100
- disables this feature.
- .
- .It Sy l2arc_exclude_special Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Controls whether buffers present on special vdevs are eligible for caching
- into L2ARC.
- If set to 1, exclude dbufs on special vdevs from being cached to L2ARC.
- .
- .It Sy l2arc_mfuonly Ns = Ns Sy 0 Ns | Ns 1 Ns | Ns 2 Pq int
- Controls whether only MFU metadata and data are cached from ARC into L2ARC.
- This may be desired to avoid wasting space on L2ARC when reading/writing large
- amounts of data that are not expected to be accessed more than once.
- .Pp
- The default is 0,
- meaning both MRU and MFU data and metadata are cached.
- When turning off this feature (setting it to 0), some MRU buffers will
- still be present in ARC and eventually cached on L2ARC.
- .No If Sy l2arc_noprefetch Ns = Ns Sy 0 ,
- some prefetched buffers will be cached to L2ARC, and those might later
- transition to MRU, in which case the
- .Sy l2arc_mru_asize No arcstat will not be Sy 0 .
- .Pp
- Setting it to 1 means to L2 cache only MFU data and metadata.
- .Pp
- Setting it to 2 means to L2 cache all metadata (MRU+MFU) but
- only MFU data (ie: MRU data are not cached). This can be the right setting
- to cache as much metadata as possible even when having high data turnover.
- .Pp
- Regardless of
- .Sy l2arc_noprefetch ,
- some MFU buffers might be evicted from ARC,
- accessed later on as prefetches and transition to MRU as prefetches.
- If accessed again they are counted as MRU and the
- .Sy l2arc_mru_asize No arcstat will not be Sy 0 .
- .Pp
- The ARC status of L2ARC buffers when they were first cached in
- L2ARC can be seen in the
- .Sy l2arc_mru_asize , Sy l2arc_mfu_asize , No and Sy l2arc_prefetch_asize
- arcstats when importing the pool or onlining a cache
- device if persistent L2ARC is enabled.
- .Pp
- The
- .Sy evict_l2_eligible_mru
- arcstat does not take into account if this option is enabled as the information
- provided by the
- .Sy evict_l2_eligible_m[rf]u
- arcstats can be used to decide if toggling this option is appropriate
- for the current workload.
- .
- .It Sy l2arc_meta_percent Ns = Ns Sy 33 Ns % Pq uint
- Percent of ARC size allowed for L2ARC-only headers.
- Since L2ARC buffers are not evicted on memory pressure,
- too many headers on a system with an irrationally large L2ARC
- can render it slow or unusable.
- This parameter limits L2ARC writes and rebuilds to achieve the target.
- .
- .It Sy l2arc_trim_ahead Ns = Ns Sy 0 Ns % Pq u64
- Trims ahead of the current write size
- .Pq Sy l2arc_write_max
- on L2ARC devices by this percentage of write size if we have filled the device.
- If set to
- .Sy 100
- we TRIM twice the space required to accommodate upcoming writes.
- A minimum of
- .Sy 64 MiB
- will be trimmed.
- It also enables TRIM of the whole L2ARC device upon creation
- or addition to an existing pool or if the header of the device is
- invalid upon importing a pool or onlining a cache device.
- A value of
- .Sy 0
- disables TRIM on L2ARC altogether and is the default as it can put significant
- stress on the underlying storage devices.
- This will vary depending of how well the specific device handles these commands.
- .
- .It Sy l2arc_noprefetch Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Do not write buffers to L2ARC if they were prefetched but not used by
- applications.
- In case there are prefetched buffers in L2ARC and this option
- is later set, we do not read the prefetched buffers from L2ARC.
- Unsetting this option is useful for caching sequential reads from the
- disks to L2ARC and serve those reads from L2ARC later on.
- This may be beneficial in case the L2ARC device is significantly faster
- in sequential reads than the disks of the pool.
- .Pp
- Use
- .Sy 1
- to disable and
- .Sy 0
- to enable caching/reading prefetches to/from L2ARC.
- .
- .It Sy l2arc_norw Ns = Ns Sy 0 Ns | Ns 1 Pq int
- No reads during writes.
- .
- .It Sy l2arc_write_boost Ns = Ns Sy 33554432 Ns B Po 32 MiB Pc Pq u64
- Cold L2ARC devices will have
- .Sy l2arc_write_max
- increased by this amount while they remain cold.
- .
- .It Sy l2arc_write_max Ns = Ns Sy 33554432 Ns B Po 32 MiB Pc Pq u64
- Max write bytes per interval.
- .
- .It Sy l2arc_rebuild_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Rebuild the L2ARC when importing a pool (persistent L2ARC).
- This can be disabled if there are problems importing a pool
- or attaching an L2ARC device (e.g. the L2ARC device is slow
- in reading stored log metadata, or the metadata
- has become somehow fragmented/unusable).
- .
- .It Sy l2arc_rebuild_blocks_min_l2size Ns = Ns Sy 1073741824 Ns B Po 1 GiB Pc Pq u64
- Mininum size of an L2ARC device required in order to write log blocks in it.
- The log blocks are used upon importing the pool to rebuild the persistent L2ARC.
- .Pp
- For L2ARC devices less than 1 GiB, the amount of data
- .Fn l2arc_evict
- evicts is significant compared to the amount of restored L2ARC data.
- In this case, do not write log blocks in L2ARC in order not to waste space.
- .
- .It Sy metaslab_aliquot Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
- Metaslab granularity, in bytes.
- This is roughly similar to what would be referred to as the "stripe size"
- in traditional RAID arrays.
- In normal operation, ZFS will try to write this amount of data to each disk
- before moving on to the next top-level vdev.
- .
- .It Sy metaslab_bias_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable metaslab group biasing based on their vdevs' over- or under-utilization
- relative to the pool.
- .
- .It Sy metaslab_force_ganging Ns = Ns Sy 16777217 Ns B Po 16 MiB + 1 B Pc Pq u64
- Make some blocks above a certain size be gang blocks.
- This option is used by the test suite to facilitate testing.
- .
- .It Sy metaslab_force_ganging_pct Ns = Ns Sy 3 Ns % Pq uint
- For blocks that could be forced to be a gang block (due to
- .Sy metaslab_force_ganging ) ,
- force this many of them to be gang blocks.
- .
- .It Sy brt_zap_prefetch Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Controls prefetching BRT records for blocks which are going to be cloned.
- .
- .It Sy brt_zap_default_bs Ns = Ns Sy 12 Po 4 KiB Pc Pq int
- Default BRT ZAP data block size as a power of 2. Note that changing this after
- creating a BRT on the pool will not affect existing BRTs, only newly created
- ones.
- .
- .It Sy brt_zap_default_ibs Ns = Ns Sy 12 Po 4 KiB Pc Pq int
- Default BRT ZAP indirect block size as a power of 2. Note that changing this
- after creating a BRT on the pool will not affect existing BRTs, only newly
- created ones.
- .
- .It Sy ddt_zap_default_bs Ns = Ns Sy 15 Po 32 KiB Pc Pq int
- Default DDT ZAP data block size as a power of 2. Note that changing this after
- creating a DDT on the pool will not affect existing DDTs, only newly created
- ones.
- .
- .It Sy ddt_zap_default_ibs Ns = Ns Sy 15 Po 32 KiB Pc Pq int
- Default DDT ZAP indirect block size as a power of 2. Note that changing this
- after creating a DDT on the pool will not affect existing DDTs, only newly
- created ones.
- .
- .It Sy zfs_default_bs Ns = Ns Sy 9 Po 512 B Pc Pq int
- Default dnode block size as a power of 2.
- .
- .It Sy zfs_default_ibs Ns = Ns Sy 17 Po 128 KiB Pc Pq int
- Default dnode indirect block size as a power of 2.
- .
- .It Sy zfs_dio_enabled Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Enable Direct I/O.
- If this setting is 0, then all I/O requests will be directed through the ARC
- acting as though the dataset property
- .Sy direct
- was set to
- .Sy disabled .
- .
- .It Sy zfs_history_output_max Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
- When attempting to log an output nvlist of an ioctl in the on-disk history,
- the output will not be stored if it is larger than this size (in bytes).
- This must be less than
- .Sy DMU_MAX_ACCESS Pq 64 MiB .
- This applies primarily to
- .Fn zfs_ioc_channel_program Pq cf. Xr zfs-program 8 .
- .
- .It Sy zfs_keep_log_spacemaps_at_export Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Prevent log spacemaps from being destroyed during pool exports and destroys.
- .
- .It Sy zfs_metaslab_segment_weight_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable/disable segment-based metaslab selection.
- .
- .It Sy zfs_metaslab_switch_threshold Ns = Ns Sy 2 Pq int
- When using segment-based metaslab selection, continue allocating
- from the active metaslab until this option's
- worth of buckets have been exhausted.
- .
- .It Sy metaslab_debug_load Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Load all metaslabs during pool import.
- .
- .It Sy metaslab_debug_unload Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Prevent metaslabs from being unloaded.
- .
- .It Sy metaslab_fragmentation_factor_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable use of the fragmentation metric in computing metaslab weights.
- .
- .It Sy metaslab_df_max_search Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
- Maximum distance to search forward from the last offset.
- Without this limit, fragmented pools can see
- .Em >100`000
- iterations and
- .Fn metaslab_block_picker
- becomes the performance limiting factor on high-performance storage.
- .Pp
- With the default setting of
- .Sy 16 MiB ,
- we typically see less than
- .Em 500
- iterations, even with very fragmented
- .Sy ashift Ns = Ns Sy 9
- pools.
- The maximum number of iterations possible is
- .Sy metaslab_df_max_search / 2^(ashift+1) .
- With the default setting of
- .Sy 16 MiB
- this is
- .Em 16*1024 Pq with Sy ashift Ns = Ns Sy 9
- or
- .Em 2*1024 Pq with Sy ashift Ns = Ns Sy 12 .
- .
- .It Sy metaslab_df_use_largest_segment Ns = Ns Sy 0 Ns | Ns 1 Pq int
- If not searching forward (due to
- .Sy metaslab_df_max_search , metaslab_df_free_pct ,
- .No or Sy metaslab_df_alloc_threshold ) ,
- this tunable controls which segment is used.
- If set, we will use the largest free segment.
- If unset, we will use a segment of at least the requested size.
- .
- .It Sy zfs_metaslab_max_size_cache_sec Ns = Ns Sy 3600 Ns s Po 1 hour Pc Pq u64
- When we unload a metaslab, we cache the size of the largest free chunk.
- We use that cached size to determine whether or not to load a metaslab
- for a given allocation.
- As more frees accumulate in that metaslab while it's unloaded,
- the cached max size becomes less and less accurate.
- After a number of seconds controlled by this tunable,
- we stop considering the cached max size and start
- considering only the histogram instead.
- .
- .It Sy zfs_metaslab_mem_limit Ns = Ns Sy 25 Ns % Pq uint
- When we are loading a new metaslab, we check the amount of memory being used
- to store metaslab range trees.
- If it is over a threshold, we attempt to unload the least recently used metaslab
- to prevent the system from clogging all of its memory with range trees.
- This tunable sets the percentage of total system memory that is the threshold.
- .
- .It Sy zfs_metaslab_try_hard_before_gang Ns = Ns Sy 0 Ns | Ns 1 Pq int
- .Bl -item -compact
- .It
- If unset, we will first try normal allocation.
- .It
- If that fails then we will do a gang allocation.
- .It
- If that fails then we will do a "try hard" gang allocation.
- .It
- If that fails then we will have a multi-layer gang block.
- .El
- .Pp
- .Bl -item -compact
- .It
- If set, we will first try normal allocation.
- .It
- If that fails then we will do a "try hard" allocation.
- .It
- If that fails we will do a gang allocation.
- .It
- If that fails we will do a "try hard" gang allocation.
- .It
- If that fails then we will have a multi-layer gang block.
- .El
- .
- .It Sy zfs_metaslab_find_max_tries Ns = Ns Sy 100 Pq uint
- When not trying hard, we only consider this number of the best metaslabs.
- This improves performance, especially when there are many metaslabs per vdev
- and the allocation can't actually be satisfied
- (so we would otherwise iterate all metaslabs).
- .
- .It Sy zfs_vdev_default_ms_count Ns = Ns Sy 200 Pq uint
- When a vdev is added, target this number of metaslabs per top-level vdev.
- .
- .It Sy zfs_vdev_default_ms_shift Ns = Ns Sy 29 Po 512 MiB Pc Pq uint
- Default lower limit for metaslab size.
- .
- .It Sy zfs_vdev_max_ms_shift Ns = Ns Sy 34 Po 16 GiB Pc Pq uint
- Default upper limit for metaslab size.
- .
- .It Sy zfs_vdev_max_auto_ashift Ns = Ns Sy 14 Pq uint
- Maximum ashift used when optimizing for logical \[->] physical sector size on
- new
- top-level vdevs.
- May be increased up to
- .Sy ASHIFT_MAX Po 16 Pc ,
- but this may negatively impact pool space efficiency.
- .
- .It Sy zfs_vdev_direct_write_verify Ns = Ns Sy Linux 1 | FreeBSD 0 Pq uint
- If non-zero, then a Direct I/O write's checksum will be verified every
- time the write is issued and before it is commited to the block pointer.
- In the event the checksum is not valid then the I/O operation will return EIO.
- This module parameter can be used to detect if the
- contents of the users buffer have changed in the process of doing a Direct I/O
- write.
- It can also help to identify if reported checksum errors are tied to Direct I/O
- writes.
- Each verify error causes a
- .Sy dio_verify_wr
- zevent.
- Direct Write I/O checkum verify errors can be seen with
- .Nm zpool Cm status Fl d .
- The default value for this is 1 on Linux, but is 0 for
- .Fx
- because user pages can be placed under write protection in
- .Fx
- before the Direct I/O write is issued.
- .
- .It Sy zfs_vdev_min_auto_ashift Ns = Ns Sy ASHIFT_MIN Po 9 Pc Pq uint
- Minimum ashift used when creating new top-level vdevs.
- .
- .It Sy zfs_vdev_min_ms_count Ns = Ns Sy 16 Pq uint
- Minimum number of metaslabs to create in a top-level vdev.
- .
- .It Sy vdev_validate_skip Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Skip label validation steps during pool import.
- Changing is not recommended unless you know what you're doing
- and are recovering a damaged label.
- .
- .It Sy zfs_vdev_ms_count_limit Ns = Ns Sy 131072 Po 128k Pc Pq uint
- Practical upper limit of total metaslabs per top-level vdev.
- .
- .It Sy metaslab_preload_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable metaslab group preloading.
- .
- .It Sy metaslab_preload_limit Ns = Ns Sy 10 Pq uint
- Maximum number of metaslabs per group to preload
- .
- .It Sy metaslab_preload_pct Ns = Ns Sy 50 Pq uint
- Percentage of CPUs to run a metaslab preload taskq
- .
- .It Sy metaslab_lba_weighting_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Give more weight to metaslabs with lower LBAs,
- assuming they have greater bandwidth,
- as is typically the case on a modern constant angular velocity disk drive.
- .
- .It Sy metaslab_unload_delay Ns = Ns Sy 32 Pq uint
- After a metaslab is used, we keep it loaded for this many TXGs, to attempt to
- reduce unnecessary reloading.
- Note that both this many TXGs and
- .Sy metaslab_unload_delay_ms
- milliseconds must pass before unloading will occur.
- .
- .It Sy metaslab_unload_delay_ms Ns = Ns Sy 600000 Ns ms Po 10 min Pc Pq uint
- After a metaslab is used, we keep it loaded for this many milliseconds,
- to attempt to reduce unnecessary reloading.
- Note, that both this many milliseconds and
- .Sy metaslab_unload_delay
- TXGs must pass before unloading will occur.
- .
- .It Sy reference_history Ns = Ns Sy 3 Pq uint
- Maximum reference holders being tracked when reference_tracking_enable is
- active.
- .It Sy raidz_expand_max_copy_bytes Ns = Ns Sy 160MB Pq ulong
- Max amount of memory to use for RAID-Z expansion I/O.
- This limits how much I/O can be outstanding at once.
- .
- .It Sy raidz_expand_max_reflow_bytes Ns = Ns Sy 0 Pq ulong
- For testing, pause RAID-Z expansion when reflow amount reaches this value.
- .
- .It Sy raidz_io_aggregate_rows Ns = Ns Sy 4 Pq ulong
- For expanded RAID-Z, aggregate reads that have more rows than this.
- .
- .It Sy reference_history Ns = Ns Sy 3 Pq int
- Maximum reference holders being tracked when reference_tracking_enable is
- active.
- .
- .It Sy reference_tracking_enable Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Track reference holders to
- .Sy refcount_t
- objects (debug builds only).
- .
- .It Sy send_holes_without_birth_time Ns = Ns Sy 1 Ns | Ns 0 Pq int
- When set, the
- .Sy hole_birth
- optimization will not be used, and all holes will always be sent during a
- .Nm zfs Cm send .
- This is useful if you suspect your datasets are affected by a bug in
- .Sy hole_birth .
- .
- .It Sy spa_config_path Ns = Ns Pa /etc/zfs/zpool.cache Pq charp
- SPA config file.
- .
- .It Sy spa_asize_inflation Ns = Ns Sy 24 Pq uint
- Multiplication factor used to estimate actual disk consumption from the
- size of data being written.
- The default value is a worst case estimate,
- but lower values may be valid for a given pool depending on its configuration.
- Pool administrators who understand the factors involved
- may wish to specify a more realistic inflation factor,
- particularly if they operate close to quota or capacity limits.
- .
- .It Sy spa_load_print_vdev_tree Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Whether to print the vdev tree in the debugging message buffer during pool
- import.
- .
- .It Sy spa_load_verify_data Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Whether to traverse data blocks during an "extreme rewind"
- .Pq Fl X
- import.
- .Pp
- An extreme rewind import normally performs a full traversal of all
- blocks in the pool for verification.
- If this parameter is unset, the traversal skips non-metadata blocks.
- It can be toggled once the
- import has started to stop or start the traversal of non-metadata blocks.
- .
- .It Sy spa_load_verify_metadata Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Whether to traverse blocks during an "extreme rewind"
- .Pq Fl X
- pool import.
- .Pp
- An extreme rewind import normally performs a full traversal of all
- blocks in the pool for verification.
- If this parameter is unset, the traversal is not performed.
- It can be toggled once the import has started to stop or start the traversal.
- .
- .It Sy spa_load_verify_shift Ns = Ns Sy 4 Po 1/16th Pc Pq uint
- Sets the maximum number of bytes to consume during pool import to the log2
- fraction of the target ARC size.
- .
- .It Sy spa_slop_shift Ns = Ns Sy 5 Po 1/32nd Pc Pq int
- Normally, we don't allow the last
- .Sy 3.2% Pq Sy 1/2^spa_slop_shift
- of space in the pool to be consumed.
- This ensures that we don't run the pool completely out of space,
- due to unaccounted changes (e.g. to the MOS).
- It also limits the worst-case time to allocate space.
- If we have less than this amount of free space,
- most ZPL operations (e.g. write, create) will return
- .Sy ENOSPC .
- .
- .It Sy spa_num_allocators Ns = Ns Sy 4 Pq int
- Determines the number of block alloctators to use per spa instance.
- Capped by the number of actual CPUs in the system via
- .Sy spa_cpus_per_allocator .
- .Pp
- Note that setting this value too high could result in performance
- degredation and/or excess fragmentation.
- Set value only applies to pools imported/created after that.
- .
- .It Sy spa_cpus_per_allocator Ns = Ns Sy 4 Pq int
- Determines the minimum number of CPUs in a system for block alloctator
- per spa instance.
- Set value only applies to pools imported/created after that.
- .
- .It Sy spa_upgrade_errlog_limit Ns = Ns Sy 0 Pq uint
- Limits the number of on-disk error log entries that will be converted to the
- new format when enabling the
- .Sy head_errlog
- feature.
- The default is to convert all log entries.
- .
- .It Sy vdev_removal_max_span Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq uint
- During top-level vdev removal, chunks of data are copied from the vdev
- which may include free space in order to trade bandwidth for IOPS.
- This parameter determines the maximum span of free space, in bytes,
- which will be included as "unnecessary" data in a chunk of copied data.
- .Pp
- The default value here was chosen to align with
- .Sy zfs_vdev_read_gap_limit ,
- which is a similar concept when doing
- regular reads (but there's no reason it has to be the same).
- .
- .It Sy vdev_file_logical_ashift Ns = Ns Sy 9 Po 512 B Pc Pq u64
- Logical ashift for file-based devices.
- .
- .It Sy vdev_file_physical_ashift Ns = Ns Sy 9 Po 512 B Pc Pq u64
- Physical ashift for file-based devices.
- .
- .It Sy zap_iterate_prefetch Ns = Ns Sy 1 Ns | Ns 0 Pq int
- If set, when we start iterating over a ZAP object,
- prefetch the entire object (all leaf blocks).
- However, this is limited by
- .Sy dmu_prefetch_max .
- .
- .It Sy zap_micro_max_size Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq int
- Maximum micro ZAP size.
- A "micro" ZAP is upgraded to a "fat" ZAP once it grows beyond the specified
- size.
- Sizes higher than 128KiB will be clamped to 128KiB unless the
- .Sy large_microzap
- feature is enabled.
- .
- .It Sy zap_shrink_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- If set, adjacent empty ZAP blocks will be collapsed, reducing disk space.
- .
- .It Sy zfetch_min_distance Ns = Ns Sy 4194304 Ns B Po 4 MiB Pc Pq uint
- Min bytes to prefetch per stream.
- Prefetch distance starts from the demand access size and quickly grows to
- this value, doubling on each hit.
- After that it may grow further by 1/8 per hit, but only if some prefetch
- since last time haven't completed in time to satisfy demand request, i.e.
- prefetch depth didn't cover the read latency or the pool got saturated.
- .
- .It Sy zfetch_max_distance Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq uint
- Max bytes to prefetch per stream.
- .
- .It Sy zfetch_max_idistance Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq uint
- Max bytes to prefetch indirects for per stream.
- .
- .It Sy zfetch_max_reorder Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
- Requests within this byte distance from the current prefetch stream position
- are considered parts of the stream, reordered due to parallel processing.
- Such requests do not advance the stream position immediately unless
- .Sy zfetch_hole_shift
- fill threshold is reached, but saved to fill holes in the stream later.
- .
- .It Sy zfetch_max_streams Ns = Ns Sy 8 Pq uint
- Max number of streams per zfetch (prefetch streams per file).
- .
- .It Sy zfetch_min_sec_reap Ns = Ns Sy 1 Pq uint
- Min time before inactive prefetch stream can be reclaimed
- .
- .It Sy zfetch_max_sec_reap Ns = Ns Sy 2 Pq uint
- Max time before inactive prefetch stream can be deleted
- .
- .It Sy zfs_abd_scatter_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enables ARC from using scatter/gather lists and forces all allocations to be
- linear in kernel memory.
- Disabling can improve performance in some code paths
- at the expense of fragmented kernel memory.
- .
- .It Sy zfs_abd_scatter_max_order Ns = Ns Sy MAX_ORDER\-1 Pq uint
- Maximum number of consecutive memory pages allocated in a single block for
- scatter/gather lists.
- .Pp
- The value of
- .Sy MAX_ORDER
- depends on kernel configuration.
- .
- .It Sy zfs_abd_scatter_min_size Ns = Ns Sy 1536 Ns B Po 1.5 KiB Pc Pq uint
- This is the minimum allocation size that will use scatter (page-based) ABDs.
- Smaller allocations will use linear ABDs.
- .
- .It Sy zfs_arc_dnode_limit Ns = Ns Sy 0 Ns B Pq u64
- When the number of bytes consumed by dnodes in the ARC exceeds this number of
- bytes, try to unpin some of it in response to demand for non-metadata.
- This value acts as a ceiling to the amount of dnode metadata, and defaults to
- .Sy 0 ,
- which indicates that a percent which is based on
- .Sy zfs_arc_dnode_limit_percent
- of the ARC meta buffers that may be used for dnodes.
- .It Sy zfs_arc_dnode_limit_percent Ns = Ns Sy 10 Ns % Pq u64
- Percentage that can be consumed by dnodes of ARC meta buffers.
- .Pp
- See also
- .Sy zfs_arc_dnode_limit ,
- which serves a similar purpose but has a higher priority if nonzero.
- .
- .It Sy zfs_arc_dnode_reduce_percent Ns = Ns Sy 10 Ns % Pq u64
- Percentage of ARC dnodes to try to scan in response to demand for non-metadata
- when the number of bytes consumed by dnodes exceeds
- .Sy zfs_arc_dnode_limit .
- .
- .It Sy zfs_arc_average_blocksize Ns = Ns Sy 8192 Ns B Po 8 KiB Pc Pq uint
- The ARC's buffer hash table is sized based on the assumption of an average
- block size of this value.
- This works out to roughly 1 MiB of hash table per 1 GiB of physical memory
- with 8-byte pointers.
- For configurations with a known larger average block size,
- this value can be increased to reduce the memory footprint.
- .
- .It Sy zfs_arc_eviction_pct Ns = Ns Sy 200 Ns % Pq uint
- When
- .Fn arc_is_overflowing ,
- .Fn arc_get_data_impl
- waits for this percent of the requested amount of data to be evicted.
- For example, by default, for every
- .Em 2 KiB
- that's evicted,
- .Em 1 KiB
- of it may be "reused" by a new allocation.
- Since this is above
- .Sy 100 Ns % ,
- it ensures that progress is made towards getting
- .Sy arc_size No under Sy arc_c .
- Since this is finite, it ensures that allocations can still happen,
- even during the potentially long time that
- .Sy arc_size No is more than Sy arc_c .
- .
- .It Sy zfs_arc_evict_batch_limit Ns = Ns Sy 10 Pq uint
- Number ARC headers to evict per sub-list before proceeding to another sub-list.
- This batch-style operation prevents entire sub-lists from being evicted at once
- but comes at a cost of additional unlocking and locking.
- .
- .It Sy zfs_arc_grow_retry Ns = Ns Sy 0 Ns s Pq uint
- If set to a non zero value, it will replace the
- .Sy arc_grow_retry
- value with this value.
- The
- .Sy arc_grow_retry
- .No value Pq default Sy 5 Ns s
- is the number of seconds the ARC will wait before
- trying to resume growth after a memory pressure event.
- .
- .It Sy zfs_arc_lotsfree_percent Ns = Ns Sy 10 Ns % Pq int
- Throttle I/O when free system memory drops below this percentage of total
- system memory.
- Setting this value to
- .Sy 0
- will disable the throttle.
- .
- .It Sy zfs_arc_max Ns = Ns Sy 0 Ns B Pq u64
- Max size of ARC in bytes.
- If
- .Sy 0 ,
- then the max size of ARC is determined by the amount of system memory installed.
- The larger of
- .Sy all_system_memory No \- Sy 1 GiB
- and
- .Sy 5/8 No \(mu Sy all_system_memory
- will be used as the limit.
- This value must be at least
- .Sy 67108864 Ns B Pq 64 MiB .
- .Pp
- This value can be changed dynamically, with some caveats.
- It cannot be set back to
- .Sy 0
- while running, and reducing it below the current ARC size will not cause
- the ARC to shrink without memory pressure to induce shrinking.
- .
- .It Sy zfs_arc_meta_balance Ns = Ns Sy 500 Pq uint
- Balance between metadata and data on ghost hits.
- Values above 100 increase metadata caching by proportionally reducing effect
- of ghost data hits on target data/metadata rate.
- .
- .It Sy zfs_arc_min Ns = Ns Sy 0 Ns B Pq u64
- Min size of ARC in bytes.
- .No If set to Sy 0 , arc_c_min
- will default to consuming the larger of
- .Sy 32 MiB
- and
- .Sy all_system_memory No / Sy 32 .
- .
- .It Sy zfs_arc_min_prefetch_ms Ns = Ns Sy 0 Ns ms Ns Po Ns ≡ Ns 1s Pc Pq uint
- Minimum time prefetched blocks are locked in the ARC.
- .
- .It Sy zfs_arc_min_prescient_prefetch_ms Ns = Ns Sy 0 Ns ms Ns Po Ns ≡ Ns 6s Pc Pq uint
- Minimum time "prescient prefetched" blocks are locked in the ARC.
- These blocks are meant to be prefetched fairly aggressively ahead of
- the code that may use them.
- .
- .It Sy zfs_arc_prune_task_threads Ns = Ns Sy 1 Pq int
- Number of arc_prune threads.
- .Fx
- does not need more than one.
- Linux may theoretically use one per mount point up to number of CPUs,
- but that was not proven to be useful.
- .
- .It Sy zfs_max_missing_tvds Ns = Ns Sy 0 Pq int
- Number of missing top-level vdevs which will be allowed during
- pool import (only in read-only mode).
- .
- .It Sy zfs_max_nvlist_src_size Ns = Sy 0 Pq u64
- Maximum size in bytes allowed to be passed as
- .Sy zc_nvlist_src_size
- for ioctls on
- .Pa /dev/zfs .
- This prevents a user from causing the kernel to allocate
- an excessive amount of memory.
- When the limit is exceeded, the ioctl fails with
- .Sy EINVAL
- and a description of the error is sent to the
- .Pa zfs-dbgmsg
- log.
- This parameter should not need to be touched under normal circumstances.
- If
- .Sy 0 ,
- equivalent to a quarter of the user-wired memory limit under
- .Fx
- and to
- .Sy 134217728 Ns B Pq 128 MiB
- under Linux.
- .
- .It Sy zfs_multilist_num_sublists Ns = Ns Sy 0 Pq uint
- To allow more fine-grained locking, each ARC state contains a series
- of lists for both data and metadata objects.
- Locking is performed at the level of these "sub-lists".
- This parameters controls the number of sub-lists per ARC state,
- and also applies to other uses of the multilist data structure.
- .Pp
- If
- .Sy 0 ,
- equivalent to the greater of the number of online CPUs and
- .Sy 4 .
- .
- .It Sy zfs_arc_overflow_shift Ns = Ns Sy 8 Pq int
- The ARC size is considered to be overflowing if it exceeds the current
- ARC target size
- .Pq Sy arc_c
- by thresholds determined by this parameter.
- Exceeding by
- .Sy ( arc_c No >> Sy zfs_arc_overflow_shift ) No / Sy 2
- starts ARC reclamation process.
- If that appears insufficient, exceeding by
- .Sy ( arc_c No >> Sy zfs_arc_overflow_shift ) No \(mu Sy 1.5
- blocks new buffer allocation until the reclaim thread catches up.
- Started reclamation process continues till ARC size returns below the
- target size.
- .Pp
- The default value of
- .Sy 8
- causes the ARC to start reclamation if it exceeds the target size by
- .Em 0.2%
- of the target size, and block allocations by
- .Em 0.6% .
- .
- .It Sy zfs_arc_shrink_shift Ns = Ns Sy 0 Pq uint
- If nonzero, this will update
- .Sy arc_shrink_shift Pq default Sy 7
- with the new value.
- .
- .It Sy zfs_arc_pc_percent Ns = Ns Sy 0 Ns % Po off Pc Pq uint
- Percent of pagecache to reclaim ARC to.
- .Pp
- This tunable allows the ZFS ARC to play more nicely
- with the kernel's LRU pagecache.
- It can guarantee that the ARC size won't collapse under scanning
- pressure on the pagecache, yet still allows the ARC to be reclaimed down to
- .Sy zfs_arc_min
- if necessary.
- This value is specified as percent of pagecache size (as measured by
- .Sy NR_FILE_PAGES ) ,
- where that percent may exceed
- .Sy 100 .
- This
- only operates during memory pressure/reclaim.
- .
- .It Sy zfs_arc_shrinker_limit Ns = Ns Sy 0 Pq int
- This is a limit on how many pages the ARC shrinker makes available for
- eviction in response to one page allocation attempt.
- Note that in practice, the kernel's shrinker can ask us to evict
- up to about four times this for one allocation attempt.
- To reduce OOM risk, this limit is applied for kswapd reclaims only.
- .Pp
- For example a value of
- .Sy 10000 Pq in practice, Em 160 MiB No per allocation attempt with 4 KiB pages
- limits the amount of time spent attempting to reclaim ARC memory to
- less than 100 ms per allocation attempt,
- even with a small average compressed block size of ~8 KiB.
- .Pp
- The parameter can be set to 0 (zero) to disable the limit,
- and only applies on Linux.
- .
- .It Sy zfs_arc_shrinker_seeks Ns = Ns Sy 2 Pq int
- Relative cost of ARC eviction on Linux, AKA number of seeks needed to
- restore evicted page.
- Bigger values make ARC more precious and evictions smaller, comparing to
- other kernel subsystems.
- Value of 4 means parity with page cache.
- .
- .It Sy zfs_arc_sys_free Ns = Ns Sy 0 Ns B Pq u64
- The target number of bytes the ARC should leave as free memory on the system.
- If zero, equivalent to the bigger of
- .Sy 512 KiB No and Sy all_system_memory/64 .
- .
- .It Sy zfs_autoimport_disable Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Disable pool import at module load by ignoring the cache file
- .Pq Sy spa_config_path .
- .
- .It Sy zfs_checksum_events_per_second Ns = Ns Sy 20 Ns /s Pq uint
- Rate limit checksum events to this many per second.
- Note that this should not be set below the ZED thresholds
- (currently 10 checksums over 10 seconds)
- or else the daemon may not trigger any action.
- .
- .It Sy zfs_commit_timeout_pct Ns = Ns Sy 10 Ns % Pq uint
- This controls the amount of time that a ZIL block (lwb) will remain "open"
- when it isn't "full", and it has a thread waiting for it to be committed to
- stable storage.
- The timeout is scaled based on a percentage of the last lwb
- latency to avoid significantly impacting the latency of each individual
- transaction record (itx).
- .
- .It Sy zfs_condense_indirect_commit_entry_delay_ms Ns = Ns Sy 0 Ns ms Pq int
- Vdev indirection layer (used for device removal) sleeps for this many
- milliseconds during mapping generation.
- Intended for use with the test suite to throttle vdev removal speed.
- .
- .It Sy zfs_condense_indirect_obsolete_pct Ns = Ns Sy 25 Ns % Pq uint
- Minimum percent of obsolete bytes in vdev mapping required to attempt to
- condense
- .Pq see Sy zfs_condense_indirect_vdevs_enable .
- Intended for use with the test suite
- to facilitate triggering condensing as needed.
- .
- .It Sy zfs_condense_indirect_vdevs_enable Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable condensing indirect vdev mappings.
- When set, attempt to condense indirect vdev mappings
- if the mapping uses more than
- .Sy zfs_condense_min_mapping_bytes
- bytes of memory and if the obsolete space map object uses more than
- .Sy zfs_condense_max_obsolete_bytes
- bytes on-disk.
- The condensing process is an attempt to save memory by removing obsolete
- mappings.
- .
- .It Sy zfs_condense_max_obsolete_bytes Ns = Ns Sy 1073741824 Ns B Po 1 GiB Pc Pq u64
- Only attempt to condense indirect vdev mappings if the on-disk size
- of the obsolete space map object is greater than this number of bytes
- .Pq see Sy zfs_condense_indirect_vdevs_enable .
- .
- .It Sy zfs_condense_min_mapping_bytes Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq u64
- Minimum size vdev mapping to attempt to condense
- .Pq see Sy zfs_condense_indirect_vdevs_enable .
- .
- .It Sy zfs_dbgmsg_enable Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Internally ZFS keeps a small log to facilitate debugging.
- The log is enabled by default, and can be disabled by unsetting this option.
- The contents of the log can be accessed by reading
- .Pa /proc/spl/kstat/zfs/dbgmsg .
- Writing
- .Sy 0
- to the file clears the log.
- .Pp
- This setting does not influence debug prints due to
- .Sy zfs_flags .
- .
- .It Sy zfs_dbgmsg_maxsize Ns = Ns Sy 4194304 Ns B Po 4 MiB Pc Pq uint
- Maximum size of the internal ZFS debug log.
- .
- .It Sy zfs_dbuf_state_index Ns = Ns Sy 0 Pq int
- Historically used for controlling what reporting was available under
- .Pa /proc/spl/kstat/zfs .
- No effect.
- .
- .It Sy zfs_deadman_checktime_ms Ns = Ns Sy 60000 Ns ms Po 1 min Pc Pq u64
- Check time in milliseconds.
- This defines the frequency at which we check for hung I/O requests
- and potentially invoke the
- .Sy zfs_deadman_failmode
- behavior.
- .
- .It Sy zfs_deadman_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- When a pool sync operation takes longer than
- .Sy zfs_deadman_synctime_ms ,
- or when an individual I/O operation takes longer than
- .Sy zfs_deadman_ziotime_ms ,
- then the operation is considered to be "hung".
- If
- .Sy zfs_deadman_enabled
- is set, then the deadman behavior is invoked as described by
- .Sy zfs_deadman_failmode .
- By default, the deadman is enabled and set to
- .Sy wait
- which results in "hung" I/O operations only being logged.
- The deadman is automatically disabled when a pool gets suspended.
- .
- .It Sy zfs_deadman_events_per_second Ns = Ns Sy 1 Ns /s Pq int
- Rate limit deadman zevents (which report hung I/O operations) to this many per
- second.
- .
- .It Sy zfs_deadman_failmode Ns = Ns Sy wait Pq charp
- Controls the failure behavior when the deadman detects a "hung" I/O operation.
- Valid values are:
- .Bl -tag -compact -offset 4n -width "continue"
- .It Sy wait
- Wait for a "hung" operation to complete.
- For each "hung" operation a "deadman" event will be posted
- describing that operation.
- .It Sy continue
- Attempt to recover from a "hung" operation by re-dispatching it
- to the I/O pipeline if possible.
- .It Sy panic
- Panic the system.
- This can be used to facilitate automatic fail-over
- to a properly configured fail-over partner.
- .El
- .
- .It Sy zfs_deadman_synctime_ms Ns = Ns Sy 600000 Ns ms Po 10 min Pc Pq u64
- Interval in milliseconds after which the deadman is triggered and also
- the interval after which a pool sync operation is considered to be "hung".
- Once this limit is exceeded the deadman will be invoked every
- .Sy zfs_deadman_checktime_ms
- milliseconds until the pool sync completes.
- .
- .It Sy zfs_deadman_ziotime_ms Ns = Ns Sy 300000 Ns ms Po 5 min Pc Pq u64
- Interval in milliseconds after which the deadman is triggered and an
- individual I/O operation is considered to be "hung".
- As long as the operation remains "hung",
- the deadman will be invoked every
- .Sy zfs_deadman_checktime_ms
- milliseconds until the operation completes.
- .
- .It Sy zfs_dedup_prefetch Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Enable prefetching dedup-ed blocks which are going to be freed.
- .
- .It Sy zfs_dedup_log_flush_passes_max Ns = Ns Sy 8 Ns Pq uint
- Maximum number of dedup log flush passes (iterations) each transaction.
- .Pp
- At the start of each transaction, OpenZFS will estimate how many entries it
- needs to flush out to keep up with the change rate, taking the amount and time
- taken to flush on previous txgs into account (see
- .Sy zfs_dedup_log_flush_flow_rate_txgs ) .
- It will spread this amount into a number of passes.
- At each pass, it will use the amount already flushed and the total time taken
- by flushing and by other IO to recompute how much it should do for the remainder
- of the txg.
- .Pp
- Reducing the max number of passes will make flushing more aggressive, flushing
- out more entries on each pass.
- This can be faster, but also more likely to compete with other IO.
- Increasing the max number of passes will put fewer entries onto each pass,
- keeping the overhead of dedup changes to a minimum but possibly causing a large
- number of changes to be dumped on the last pass, which can blow out the txg
- sync time beyond
- .Sy zfs_txg_timeout .
- .
- .It Sy zfs_dedup_log_flush_min_time_ms Ns = Ns Sy 1000 Ns Pq uint
- Minimum time to spend on dedup log flush each transaction.
- .Pp
- At least this long will be spent flushing dedup log entries each transaction,
- up to
- .Sy zfs_txg_timeout .
- This occurs even if doing so would delay the transaction, that is, other IO
- completes under this time.
- .
- .It Sy zfs_dedup_log_flush_entries_min Ns = Ns Sy 1000 Ns Pq uint
- Flush at least this many entries each transaction.
- .Pp
- OpenZFS will estimate how many entries it needs to flush each transaction to
- keep up with the ingest rate (see
- .Sy zfs_dedup_log_flush_flow_rate_txgs ) .
- This sets the minimum for that estimate.
- Raising it can force OpenZFS to flush more aggressively, keeping the log small
- and so reducing pool import times, but can make it less able to back off if
- log flushing would compete with other IO too much.
- .
- .It Sy zfs_dedup_log_flush_flow_rate_txgs Ns = Ns Sy 10 Ns Pq uint
- Number of transactions to use to compute the flow rate.
- .Pp
- OpenZFS will estimate how many entries it needs to flush each transaction by
- monitoring the number of entries changed (ingest rate), number of entries
- flushed (flush rate) and time spent flushing (flush time rate) and combining
- these into an overall "flow rate".
- It will use an exponential weighted moving average over some number of recent
- transactions to compute these rates.
- This sets the number of transactions to compute these averages over.
- Setting it higher can help to smooth out the flow rate in the face of spiky
- workloads, but will take longer for the flow rate to adjust to a sustained
- change in the ingress rate.
- .
- .It Sy zfs_dedup_log_txg_max Ns = Ns Sy 8 Ns Pq uint
- Max transactions to before starting to flush dedup logs.
- .Pp
- OpenZFS maintains two dedup logs, one receiving new changes, one flushing.
- If there is nothing to flush, it will accumulate changes for no more than this
- many transactions before switching the logs and starting to flush entries out.
- .
- .It Sy zfs_dedup_log_mem_max Ns = Ns Sy 0 Ns Pq u64
- Max memory to use for dedup logs.
- .Pp
- OpenZFS will spend no more than this much memory on maintaining the in-memory
- dedup log.
- Flushing will begin when around half this amount is being spent on logs.
- The default value of
- .Sy 0
- will cause it to be set by
- .Sy zfs_dedup_log_mem_max_percent
- instead.
- .
- .It Sy zfs_dedup_log_mem_max_percent Ns = Ns Sy 1 Ns % Pq uint
- Max memory to use for dedup logs, as a percentage of total memory.
- .Pp
- If
- .Sy zfs_dedup_log_mem_max
- is not set, it will be initialised as a percentage of the total memory in the
- system.
- .
- .It Sy zfs_delay_min_dirty_percent Ns = Ns Sy 60 Ns % Pq uint
- Start to delay each transaction once there is this amount of dirty data,
- expressed as a percentage of
- .Sy zfs_dirty_data_max .
- This value should be at least
- .Sy zfs_vdev_async_write_active_max_dirty_percent .
- .No See Sx ZFS TRANSACTION DELAY .
- .
- .It Sy zfs_delay_scale Ns = Ns Sy 500000 Pq int
- This controls how quickly the transaction delay approaches infinity.
- Larger values cause longer delays for a given amount of dirty data.
- .Pp
- For the smoothest delay, this value should be about 1 billion divided
- by the maximum number of operations per second.
- This will smoothly handle between ten times and a tenth of this number.
- .No See Sx ZFS TRANSACTION DELAY .
- .Pp
- .Sy zfs_delay_scale No \(mu Sy zfs_dirty_data_max Em must No be smaller than Sy 2^64 .
- .
- .It Sy zfs_dio_write_verify_events_per_second Ns = Ns Sy 20 Ns /s Pq uint
- Rate limit Direct I/O write verify events to this many per second.
- .
- .It Sy zfs_disable_ivset_guid_check Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disables requirement for IVset GUIDs to be present and match when doing a raw
- receive of encrypted datasets.
- Intended for users whose pools were created with
- OpenZFS pre-release versions and now have compatibility issues.
- .
- .It Sy zfs_key_max_salt_uses Ns = Ns Sy 400000000 Po 4*10^8 Pc Pq ulong
- Maximum number of uses of a single salt value before generating a new one for
- encrypted datasets.
- The default value is also the maximum.
- .
- .It Sy zfs_object_mutex_size Ns = Ns Sy 64 Pq uint
- Size of the znode hashtable used for holds.
- .Pp
- Due to the need to hold locks on objects that may not exist yet, kernel mutexes
- are not created per-object and instead a hashtable is used where collisions
- will result in objects waiting when there is not actually contention on the
- same object.
- .
- .It Sy zfs_slow_io_events_per_second Ns = Ns Sy 20 Ns /s Pq int
- Rate limit delay zevents (which report slow I/O operations) to this many per
- second.
- .
- .It Sy zfs_unflushed_max_mem_amt Ns = Ns Sy 1073741824 Ns B Po 1 GiB Pc Pq u64
- Upper-bound limit for unflushed metadata changes to be held by the
- log spacemap in memory, in bytes.
- .
- .It Sy zfs_unflushed_max_mem_ppm Ns = Ns Sy 1000 Ns ppm Po 0.1% Pc Pq u64
- Part of overall system memory that ZFS allows to be used
- for unflushed metadata changes by the log spacemap, in millionths.
- .
- .It Sy zfs_unflushed_log_block_max Ns = Ns Sy 131072 Po 128k Pc Pq u64
- Describes the maximum number of log spacemap blocks allowed for each pool.
- The default value means that the space in all the log spacemaps
- can add up to no more than
- .Sy 131072
- blocks (which means
- .Em 16 GiB
- of logical space before compression and ditto blocks,
- assuming that blocksize is
- .Em 128 KiB ) .
- .Pp
- This tunable is important because it involves a trade-off between import
- time after an unclean export and the frequency of flushing metaslabs.
- The higher this number is, the more log blocks we allow when the pool is
- active which means that we flush metaslabs less often and thus decrease
- the number of I/O operations for spacemap updates per TXG.
- At the same time though, that means that in the event of an unclean export,
- there will be more log spacemap blocks for us to read, inducing overhead
- in the import time of the pool.
- The lower the number, the amount of flushing increases, destroying log
- blocks quicker as they become obsolete faster, which leaves less blocks
- to be read during import time after a crash.
- .Pp
- Each log spacemap block existing during pool import leads to approximately
- one extra logical I/O issued.
- This is the reason why this tunable is exposed in terms of blocks rather
- than space used.
- .
- .It Sy zfs_unflushed_log_block_min Ns = Ns Sy 1000 Pq u64
- If the number of metaslabs is small and our incoming rate is high,
- we could get into a situation that we are flushing all our metaslabs every TXG.
- Thus we always allow at least this many log blocks.
- .
- .It Sy zfs_unflushed_log_block_pct Ns = Ns Sy 400 Ns % Pq u64
- Tunable used to determine the number of blocks that can be used for
- the spacemap log, expressed as a percentage of the total number of
- unflushed metaslabs in the pool.
- .
- .It Sy zfs_unflushed_log_txg_max Ns = Ns Sy 1000 Pq u64
- Tunable limiting maximum time in TXGs any metaslab may remain unflushed.
- It effectively limits maximum number of unflushed per-TXG spacemap logs
- that need to be read after unclean pool export.
- .
- .It Sy zfs_unlink_suspend_progress Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- When enabled, files will not be asynchronously removed from the list of pending
- unlinks and the space they consume will be leaked.
- Once this option has been disabled and the dataset is remounted,
- the pending unlinks will be processed and the freed space returned to the pool.
- This option is used by the test suite.
- .
- .It Sy zfs_delete_blocks Ns = Ns Sy 20480 Pq ulong
- This is the used to define a large file for the purposes of deletion.
- Files containing more than
- .Sy zfs_delete_blocks
- will be deleted asynchronously, while smaller files are deleted synchronously.
- Decreasing this value will reduce the time spent in an
- .Xr unlink 2
- system call, at the expense of a longer delay before the freed space is
- available.
- This only applies on Linux.
- .
- .It Sy zfs_dirty_data_max Ns = Pq int
- Determines the dirty space limit in bytes.
- Once this limit is exceeded, new writes are halted until space frees up.
- This parameter takes precedence over
- .Sy zfs_dirty_data_max_percent .
- .No See Sx ZFS TRANSACTION DELAY .
- .Pp
- Defaults to
- .Sy physical_ram/10 ,
- capped at
- .Sy zfs_dirty_data_max_max .
- .
- .It Sy zfs_dirty_data_max_max Ns = Pq int
- Maximum allowable value of
- .Sy zfs_dirty_data_max ,
- expressed in bytes.
- This limit is only enforced at module load time, and will be ignored if
- .Sy zfs_dirty_data_max
- is later changed.
- This parameter takes precedence over
- .Sy zfs_dirty_data_max_max_percent .
- .No See Sx ZFS TRANSACTION DELAY .
- .Pp
- Defaults to
- .Sy min(physical_ram/4, 4GiB) ,
- or
- .Sy min(physical_ram/4, 1GiB)
- for 32-bit systems.
- .
- .It Sy zfs_dirty_data_max_max_percent Ns = Ns Sy 25 Ns % Pq uint
- Maximum allowable value of
- .Sy zfs_dirty_data_max ,
- expressed as a percentage of physical RAM.
- This limit is only enforced at module load time, and will be ignored if
- .Sy zfs_dirty_data_max
- is later changed.
- The parameter
- .Sy zfs_dirty_data_max_max
- takes precedence over this one.
- .No See Sx ZFS TRANSACTION DELAY .
- .
- .It Sy zfs_dirty_data_max_percent Ns = Ns Sy 10 Ns % Pq uint
- Determines the dirty space limit, expressed as a percentage of all memory.
- Once this limit is exceeded, new writes are halted until space frees up.
- The parameter
- .Sy zfs_dirty_data_max
- takes precedence over this one.
- .No See Sx ZFS TRANSACTION DELAY .
- .Pp
- Subject to
- .Sy zfs_dirty_data_max_max .
- .
- .It Sy zfs_dirty_data_sync_percent Ns = Ns Sy 20 Ns % Pq uint
- Start syncing out a transaction group if there's at least this much dirty data
- .Pq as a percentage of Sy zfs_dirty_data_max .
- This should be less than
- .Sy zfs_vdev_async_write_active_min_dirty_percent .
- .
- .It Sy zfs_wrlog_data_max Ns = Pq int
- The upper limit of write-transaction zil log data size in bytes.
- Write operations are throttled when approaching the limit until log data is
- cleared out after transaction group sync.
- Because of some overhead, it should be set at least 2 times the size of
- .Sy zfs_dirty_data_max
- .No to prevent harming normal write throughput .
- It also should be smaller than the size of the slog device if slog is present.
- .Pp
- Defaults to
- .Sy zfs_dirty_data_max*2
- .
- .It Sy zfs_fallocate_reserve_percent Ns = Ns Sy 110 Ns % Pq uint
- Since ZFS is a copy-on-write filesystem with snapshots, blocks cannot be
- preallocated for a file in order to guarantee that later writes will not
- run out of space.
- Instead,
- .Xr fallocate 2
- space preallocation only checks that sufficient space is currently available
- in the pool or the user's project quota allocation,
- and then creates a sparse file of the requested size.
- The requested space is multiplied by
- .Sy zfs_fallocate_reserve_percent
- to allow additional space for indirect blocks and other internal metadata.
- Setting this to
- .Sy 0
- disables support for
- .Xr fallocate 2
- and causes it to return
- .Sy EOPNOTSUPP .
- .
- .It Sy zfs_fletcher_4_impl Ns = Ns Sy fastest Pq string
- Select a fletcher 4 implementation.
- .Pp
- Supported selectors are:
- .Sy fastest , scalar , sse2 , ssse3 , avx2 , avx512f , avx512bw ,
- .No and Sy aarch64_neon .
- All except
- .Sy fastest No and Sy scalar
- require instruction set extensions to be available,
- and will only appear if ZFS detects that they are present at runtime.
- If multiple implementations of fletcher 4 are available, the
- .Sy fastest
- will be chosen using a micro benchmark.
- Selecting
- .Sy scalar
- results in the original CPU-based calculation being used.
- Selecting any option other than
- .Sy fastest No or Sy scalar
- results in vector instructions
- from the respective CPU instruction set being used.
- .
- .It Sy zfs_bclone_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enables access to the block cloning feature.
- If this setting is 0, then even if feature@block_cloning is enabled,
- using functions and system calls that attempt to clone blocks will act as
- though the feature is disabled.
- .
- .It Sy zfs_bclone_wait_dirty Ns = Ns Sy 0 Ns | Ns 1 Pq int
- When set to 1 the FICLONE and FICLONERANGE ioctls wait for dirty data to be
- written to disk.
- This allows the clone operation to reliably succeed when a file is
- modified and then immediately cloned.
- For small files this may be slower than making a copy of the file.
- Therefore, this setting defaults to 0 which causes a clone operation to
- immediately fail when encountering a dirty block.
- .
- .It Sy zfs_blake3_impl Ns = Ns Sy fastest Pq string
- Select a BLAKE3 implementation.
- .Pp
- Supported selectors are:
- .Sy cycle , fastest , generic , sse2 , sse41 , avx2 , avx512 .
- All except
- .Sy cycle , fastest No and Sy generic
- require instruction set extensions to be available,
- and will only appear if ZFS detects that they are present at runtime.
- If multiple implementations of BLAKE3 are available, the
- .Sy fastest will be chosen using a micro benchmark. You can see the
- benchmark results by reading this kstat file:
- .Pa /proc/spl/kstat/zfs/chksum_bench .
- .
- .It Sy zfs_free_bpobj_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable/disable the processing of the free_bpobj object.
- .
- .It Sy zfs_async_block_max_blocks Ns = Ns Sy UINT64_MAX Po unlimited Pc Pq u64
- Maximum number of blocks freed in a single TXG.
- .
- .It Sy zfs_max_async_dedup_frees Ns = Ns Sy 100000 Po 10^5 Pc Pq u64
- Maximum number of dedup blocks freed in a single TXG.
- .
- .It Sy zfs_vdev_async_read_max_active Ns = Ns Sy 3 Pq uint
- Maximum asynchronous read I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_async_read_min_active Ns = Ns Sy 1 Pq uint
- Minimum asynchronous read I/O operation active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_async_write_active_max_dirty_percent Ns = Ns Sy 60 Ns % Pq uint
- When the pool has more than this much dirty data, use
- .Sy zfs_vdev_async_write_max_active
- to limit active async writes.
- If the dirty data is between the minimum and maximum,
- the active I/O limit is linearly interpolated.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_async_write_active_min_dirty_percent Ns = Ns Sy 30 Ns % Pq uint
- When the pool has less than this much dirty data, use
- .Sy zfs_vdev_async_write_min_active
- to limit active async writes.
- If the dirty data is between the minimum and maximum,
- the active I/O limit is linearly
- interpolated.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_async_write_max_active Ns = Ns Sy 10 Pq uint
- Maximum asynchronous write I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_async_write_min_active Ns = Ns Sy 2 Pq uint
- Minimum asynchronous write I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .Pp
- Lower values are associated with better latency on rotational media but poorer
- resilver performance.
- The default value of
- .Sy 2
- was chosen as a compromise.
- A value of
- .Sy 3
- has been shown to improve resilver performance further at a cost of
- further increasing latency.
- .
- .It Sy zfs_vdev_initializing_max_active Ns = Ns Sy 1 Pq uint
- Maximum initializing I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_initializing_min_active Ns = Ns Sy 1 Pq uint
- Minimum initializing I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_max_active Ns = Ns Sy 1000 Pq uint
- The maximum number of I/O operations active to each device.
- Ideally, this will be at least the sum of each queue's
- .Sy max_active .
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_open_timeout_ms Ns = Ns Sy 1000 Pq uint
- Timeout value to wait before determining a device is missing
- during import.
- This is helpful for transient missing paths due
- to links being briefly removed and recreated in response to
- udev events.
- .
- .It Sy zfs_vdev_rebuild_max_active Ns = Ns Sy 3 Pq uint
- Maximum sequential resilver I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_rebuild_min_active Ns = Ns Sy 1 Pq uint
- Minimum sequential resilver I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_removal_max_active Ns = Ns Sy 2 Pq uint
- Maximum removal I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_removal_min_active Ns = Ns Sy 1 Pq uint
- Minimum removal I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_scrub_max_active Ns = Ns Sy 2 Pq uint
- Maximum scrub I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_scrub_min_active Ns = Ns Sy 1 Pq uint
- Minimum scrub I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_sync_read_max_active Ns = Ns Sy 10 Pq uint
- Maximum synchronous read I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_sync_read_min_active Ns = Ns Sy 10 Pq uint
- Minimum synchronous read I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_sync_write_max_active Ns = Ns Sy 10 Pq uint
- Maximum synchronous write I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_sync_write_min_active Ns = Ns Sy 10 Pq uint
- Minimum synchronous write I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_trim_max_active Ns = Ns Sy 2 Pq uint
- Maximum trim/discard I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_trim_min_active Ns = Ns Sy 1 Pq uint
- Minimum trim/discard I/O operations active to each device.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_nia_delay Ns = Ns Sy 5 Pq uint
- For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
- the number of concurrently-active I/O operations is limited to
- .Sy zfs_*_min_active ,
- unless the vdev is "idle".
- When there are no interactive I/O operations active (synchronous or otherwise),
- and
- .Sy zfs_vdev_nia_delay
- operations have completed since the last interactive operation,
- then the vdev is considered to be "idle",
- and the number of concurrently-active non-interactive operations is increased to
- .Sy zfs_*_max_active .
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_nia_credit Ns = Ns Sy 5 Pq uint
- Some HDDs tend to prioritize sequential I/O so strongly, that concurrent
- random I/O latency reaches several seconds.
- On some HDDs this happens even if sequential I/O operations
- are submitted one at a time, and so setting
- .Sy zfs_*_max_active Ns = Sy 1
- does not help.
- To prevent non-interactive I/O, like scrub,
- from monopolizing the device, no more than
- .Sy zfs_vdev_nia_credit operations can be sent
- while there are outstanding incomplete interactive operations.
- This enforced wait ensures the HDD services the interactive I/O
- within a reasonable amount of time.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_queue_depth_pct Ns = Ns Sy 1000 Ns % Pq uint
- Maximum number of queued allocations per top-level vdev expressed as
- a percentage of
- .Sy zfs_vdev_async_write_max_active ,
- which allows the system to detect devices that are more capable
- of handling allocations and to allocate more blocks to those devices.
- This allows for dynamic allocation distribution when devices are imbalanced,
- as fuller devices will tend to be slower than empty devices.
- .Pp
- Also see
- .Sy zio_dva_throttle_enabled .
- .
- .It Sy zfs_vdev_def_queue_depth Ns = Ns Sy 32 Pq uint
- Default queue depth for each vdev IO allocator.
- Higher values allow for better coalescing of sequential writes before sending
- them to the disk, but can increase transaction commit times.
- .
- .It Sy zfs_vdev_failfast_mask Ns = Ns Sy 1 Pq uint
- Defines if the driver should retire on a given error type.
- The following options may be bitwise-ored together:
- .TS
- box;
- lbz r l l .
- Value Name Description
- _
- 1 Device No driver retries on device errors
- 2 Transport No driver retries on transport errors.
- 4 Driver No driver retries on driver errors.
- .TE
- .
- .It Sy zfs_vdev_disk_max_segs Ns = Ns Sy 0 Pq uint
- Maximum number of segments to add to a BIO (min 4).
- If this is higher than the maximum allowed by the device queue or the kernel
- itself, it will be clamped.
- Setting it to zero will cause the kernel's ideal size to be used.
- This parameter only applies on Linux.
- This parameter is ignored if
- .Sy zfs_vdev_disk_classic Ns = Ns Sy 1 .
- .
- .It Sy zfs_vdev_disk_classic Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- If set to 1, OpenZFS will submit IO to Linux using the method it used in 2.2
- and earlier.
- This "classic" method has known issues with highly fragmented IO requests and
- is slower on many workloads, but it has been in use for many years and is known
- to be very stable.
- If you set this parameter, please also open a bug report why you did so,
- including the workload involved and any error messages.
- .Pp
- This parameter and the classic submission method will be removed once we have
- total confidence in the new method.
- .Pp
- This parameter only applies on Linux, and can only be set at module load time.
- .
- .It Sy zfs_expire_snapshot Ns = Ns Sy 300 Ns s Pq int
- Time before expiring
- .Pa .zfs/snapshot .
- .
- .It Sy zfs_admin_snapshot Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Allow the creation, removal, or renaming of entries in the
- .Sy .zfs/snapshot
- directory to cause the creation, destruction, or renaming of snapshots.
- When enabled, this functionality works both locally and over NFS exports
- which have the
- .Em no_root_squash
- option set.
- .
- .It Sy zfs_snapshot_no_setuid Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Whether to disable
- .Em setuid/setgid
- support for snapshot mounts triggered by access to the
- .Sy .zfs/snapshot
- directory by setting the
- .Em nosuid
- mount option.
- .
- .It Sy zfs_flags Ns = Ns Sy 0 Pq int
- Set additional debugging flags.
- The following flags may be bitwise-ored together:
- .TS
- box;
- lbz r l l .
- Value Name Description
- _
- 1 ZFS_DEBUG_DPRINTF Enable dprintf entries in the debug log.
- * 2 ZFS_DEBUG_DBUF_VERIFY Enable extra dbuf verifications.
- * 4 ZFS_DEBUG_DNODE_VERIFY Enable extra dnode verifications.
- 8 ZFS_DEBUG_SNAPNAMES Enable snapshot name verification.
- * 16 ZFS_DEBUG_MODIFY Check for illegally modified ARC buffers.
- 64 ZFS_DEBUG_ZIO_FREE Enable verification of block frees.
- 128 ZFS_DEBUG_HISTOGRAM_VERIFY Enable extra spacemap histogram verifications.
- 256 ZFS_DEBUG_METASLAB_VERIFY Verify space accounting on disk matches in-memory \fBrange_trees\fP.
- 512 ZFS_DEBUG_SET_ERROR Enable \fBSET_ERROR\fP and dprintf entries in the debug log.
- 1024 ZFS_DEBUG_INDIRECT_REMAP Verify split blocks created by device removal.
- 2048 ZFS_DEBUG_TRIM Verify TRIM ranges are always within the allocatable range tree.
- 4096 ZFS_DEBUG_LOG_SPACEMAP Verify that the log summary is consistent with the spacemap log
- and enable \fBzfs_dbgmsgs\fP for metaslab loading and flushing.
- .TE
- .Sy \& * No Requires debug build .
- .
- .It Sy zfs_btree_verify_intensity Ns = Ns Sy 0 Pq uint
- Enables btree verification.
- The following settings are culminative:
- .TS
- box;
- lbz r l l .
- Value Description
- 1 Verify height.
- 2 Verify pointers from children to parent.
- 3 Verify element counts.
- 4 Verify element order. (expensive)
- * 5 Verify unused memory is poisoned. (expensive)
- .TE
- .Sy \& * No Requires debug build .
- .
- .It Sy zfs_free_leak_on_eio Ns = Ns Sy 0 Ns | Ns 1 Pq int
- If destroy encounters an
- .Sy EIO
- while reading metadata (e.g. indirect blocks),
- space referenced by the missing metadata can not be freed.
- Normally this causes the background destroy to become "stalled",
- as it is unable to make forward progress.
- While in this stalled state, all remaining space to free
- from the error-encountering filesystem is "temporarily leaked".
- Set this flag to cause it to ignore the
- .Sy EIO ,
- permanently leak the space from indirect blocks that can not be read,
- and continue to free everything else that it can.
- .Pp
- The default "stalling" behavior is useful if the storage partially
- fails (i.e. some but not all I/O operations fail), and then later recovers.
- In this case, we will be able to continue pool operations while it is
- partially failed, and when it recovers, we can continue to free the
- space, with no leaks.
- Note, however, that this case is actually fairly rare.
- .Pp
- Typically pools either
- .Bl -enum -compact -offset 4n -width "1."
- .It
- fail completely (but perhaps temporarily,
- e.g. due to a top-level vdev going offline), or
- .It
- have localized, permanent errors (e.g. disk returns the wrong data
- due to bit flip or firmware bug).
- .El
- In the former case, this setting does not matter because the
- pool will be suspended and the sync thread will not be able to make
- forward progress regardless.
- In the latter, because the error is permanent, the best we can do
- is leak the minimum amount of space,
- which is what setting this flag will do.
- It is therefore reasonable for this flag to normally be set,
- but we chose the more conservative approach of not setting it,
- so that there is no possibility of
- leaking space in the "partial temporary" failure case.
- .
- .It Sy zfs_free_min_time_ms Ns = Ns Sy 1000 Ns ms Po 1s Pc Pq uint
- During a
- .Nm zfs Cm destroy
- operation using the
- .Sy async_destroy
- feature,
- a minimum of this much time will be spent working on freeing blocks per TXG.
- .
- .It Sy zfs_obsolete_min_time_ms Ns = Ns Sy 500 Ns ms Pq uint
- Similar to
- .Sy zfs_free_min_time_ms ,
- but for cleanup of old indirection records for removed vdevs.
- .
- .It Sy zfs_immediate_write_sz Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq s64
- Largest data block to write to the ZIL.
- Larger blocks will be treated as if the dataset being written to had the
- .Sy logbias Ns = Ns Sy throughput
- property set.
- .
- .It Sy zfs_initialize_value Ns = Ns Sy 16045690984833335022 Po 0xDEADBEEFDEADBEEE Pc Pq u64
- Pattern written to vdev free space by
- .Xr zpool-initialize 8 .
- .
- .It Sy zfs_initialize_chunk_size Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
- Size of writes used by
- .Xr zpool-initialize 8 .
- This option is used by the test suite.
- .
- .It Sy zfs_livelist_max_entries Ns = Ns Sy 500000 Po 5*10^5 Pc Pq u64
- The threshold size (in block pointers) at which we create a new sub-livelist.
- Larger sublists are more costly from a memory perspective but the fewer
- sublists there are, the lower the cost of insertion.
- .
- .It Sy zfs_livelist_min_percent_shared Ns = Ns Sy 75 Ns % Pq int
- If the amount of shared space between a snapshot and its clone drops below
- this threshold, the clone turns off the livelist and reverts to the old
- deletion method.
- This is in place because livelists no long give us a benefit
- once a clone has been overwritten enough.
- .
- .It Sy zfs_livelist_condense_new_alloc Ns = Ns Sy 0 Pq int
- Incremented each time an extra ALLOC blkptr is added to a livelist entry while
- it is being condensed.
- This option is used by the test suite to track race conditions.
- .
- .It Sy zfs_livelist_condense_sync_cancel Ns = Ns Sy 0 Pq int
- Incremented each time livelist condensing is canceled while in
- .Fn spa_livelist_condense_sync .
- This option is used by the test suite to track race conditions.
- .
- .It Sy zfs_livelist_condense_sync_pause Ns = Ns Sy 0 Ns | Ns 1 Pq int
- When set, the livelist condense process pauses indefinitely before
- executing the synctask \(em
- .Fn spa_livelist_condense_sync .
- This option is used by the test suite to trigger race conditions.
- .
- .It Sy zfs_livelist_condense_zthr_cancel Ns = Ns Sy 0 Pq int
- Incremented each time livelist condensing is canceled while in
- .Fn spa_livelist_condense_cb .
- This option is used by the test suite to track race conditions.
- .
- .It Sy zfs_livelist_condense_zthr_pause Ns = Ns Sy 0 Ns | Ns 1 Pq int
- When set, the livelist condense process pauses indefinitely before
- executing the open context condensing work in
- .Fn spa_livelist_condense_cb .
- This option is used by the test suite to trigger race conditions.
- .
- .It Sy zfs_lua_max_instrlimit Ns = Ns Sy 100000000 Po 10^8 Pc Pq u64
- The maximum execution time limit that can be set for a ZFS channel program,
- specified as a number of Lua instructions.
- .
- .It Sy zfs_lua_max_memlimit Ns = Ns Sy 104857600 Po 100 MiB Pc Pq u64
- The maximum memory limit that can be set for a ZFS channel program, specified
- in bytes.
- .
- .It Sy zfs_max_dataset_nesting Ns = Ns Sy 50 Pq int
- The maximum depth of nested datasets.
- This value can be tuned temporarily to
- fix existing datasets that exceed the predefined limit.
- .
- .It Sy zfs_max_log_walking Ns = Ns Sy 5 Pq u64
- The number of past TXGs that the flushing algorithm of the log spacemap
- feature uses to estimate incoming log blocks.
- .
- .It Sy zfs_max_logsm_summary_length Ns = Ns Sy 10 Pq u64
- Maximum number of rows allowed in the summary of the spacemap log.
- .
- .It Sy zfs_max_recordsize Ns = Ns Sy 16777216 Po 16 MiB Pc Pq uint
- We currently support block sizes from
- .Em 512 Po 512 B Pc No to Em 16777216 Po 16 MiB Pc .
- The benefits of larger blocks, and thus larger I/O,
- need to be weighed against the cost of COWing a giant block to modify one byte.
- Additionally, very large blocks can have an impact on I/O latency,
- and also potentially on the memory allocator.
- Therefore, we formerly forbade creating blocks larger than 1M.
- Larger blocks could be created by changing it,
- and pools with larger blocks can always be imported and used,
- regardless of this setting.
- .Pp
- Note that it is still limited by default to
- .Ar 1 MiB
- on x86_32, because Linux's
- 3/1 memory split doesn't leave much room for 16M chunks.
- .
- .It Sy zfs_allow_redacted_dataset_mount Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Allow datasets received with redacted send/receive to be mounted.
- Normally disabled because these datasets may be missing key data.
- .
- .It Sy zfs_min_metaslabs_to_flush Ns = Ns Sy 1 Pq u64
- Minimum number of metaslabs to flush per dirty TXG.
- .
- .It Sy zfs_metaslab_fragmentation_threshold Ns = Ns Sy 70 Ns % Pq uint
- Allow metaslabs to keep their active state as long as their fragmentation
- percentage is no more than this value.
- An active metaslab that exceeds this threshold
- will no longer keep its active status allowing better metaslabs to be selected.
- .
- .It Sy zfs_mg_fragmentation_threshold Ns = Ns Sy 95 Ns % Pq uint
- Metaslab groups are considered eligible for allocations if their
- fragmentation metric (measured as a percentage) is less than or equal to
- this value.
- If a metaslab group exceeds this threshold then it will be
- skipped unless all metaslab groups within the metaslab class have also
- crossed this threshold.
- .
- .It Sy zfs_mg_noalloc_threshold Ns = Ns Sy 0 Ns % Pq uint
- Defines a threshold at which metaslab groups should be eligible for allocations.
- The value is expressed as a percentage of free space
- beyond which a metaslab group is always eligible for allocations.
- If a metaslab group's free space is less than or equal to the
- threshold, the allocator will avoid allocating to that group
- unless all groups in the pool have reached the threshold.
- Once all groups have reached the threshold, all groups are allowed to accept
- allocations.
- The default value of
- .Sy 0
- disables the feature and causes all metaslab groups to be eligible for
- allocations.
- .Pp
- This parameter allows one to deal with pools having heavily imbalanced
- vdevs such as would be the case when a new vdev has been added.
- Setting the threshold to a non-zero percentage will stop allocations
- from being made to vdevs that aren't filled to the specified percentage
- and allow lesser filled vdevs to acquire more allocations than they
- otherwise would under the old
- .Sy zfs_mg_alloc_failures
- facility.
- .
- .It Sy zfs_ddt_data_is_special Ns = Ns Sy 1 Ns | Ns 0 Pq int
- If enabled, ZFS will place DDT data into the special allocation class.
- .
- .It Sy zfs_user_indirect_is_special Ns = Ns Sy 1 Ns | Ns 0 Pq int
- If enabled, ZFS will place user data indirect blocks
- into the special allocation class.
- .
- .It Sy zfs_multihost_history Ns = Ns Sy 0 Pq uint
- Historical statistics for this many latest multihost updates will be available
- in
- .Pa /proc/spl/kstat/zfs/ Ns Ao Ar pool Ac Ns Pa /multihost .
- .
- .It Sy zfs_multihost_interval Ns = Ns Sy 1000 Ns ms Po 1 s Pc Pq u64
- Used to control the frequency of multihost writes which are performed when the
- .Sy multihost
- pool property is on.
- This is one of the factors used to determine the
- length of the activity check during import.
- .Pp
- The multihost write period is
- .Sy zfs_multihost_interval No / Sy leaf-vdevs .
- On average a multihost write will be issued for each leaf vdev
- every
- .Sy zfs_multihost_interval
- milliseconds.
- In practice, the observed period can vary with the I/O load
- and this observed value is the delay which is stored in the uberblock.
- .
- .It Sy zfs_multihost_import_intervals Ns = Ns Sy 20 Pq uint
- Used to control the duration of the activity test on import.
- Smaller values of
- .Sy zfs_multihost_import_intervals
- will reduce the import time but increase
- the risk of failing to detect an active pool.
- The total activity check time is never allowed to drop below one second.
- .Pp
- On import the activity check waits a minimum amount of time determined by
- .Sy zfs_multihost_interval No \(mu Sy zfs_multihost_import_intervals ,
- or the same product computed on the host which last had the pool imported,
- whichever is greater.
- The activity check time may be further extended if the value of MMP
- delay found in the best uberblock indicates actual multihost updates happened
- at longer intervals than
- .Sy zfs_multihost_interval .
- A minimum of
- .Em 100 ms
- is enforced.
- .Pp
- .Sy 0 No is equivalent to Sy 1 .
- .
- .It Sy zfs_multihost_fail_intervals Ns = Ns Sy 10 Pq uint
- Controls the behavior of the pool when multihost write failures or delays are
- detected.
- .Pp
- When
- .Sy 0 ,
- multihost write failures or delays are ignored.
- The failures will still be reported to the ZED which depending on
- its configuration may take action such as suspending the pool or offlining a
- device.
- .Pp
- Otherwise, the pool will be suspended if
- .Sy zfs_multihost_fail_intervals No \(mu Sy zfs_multihost_interval
- milliseconds pass without a successful MMP write.
- This guarantees the activity test will see MMP writes if the pool is imported.
- .Sy 1 No is equivalent to Sy 2 ;
- this is necessary to prevent the pool from being suspended
- due to normal, small I/O latency variations.
- .
- .It Sy zfs_no_scrub_io Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Set to disable scrub I/O.
- This results in scrubs not actually scrubbing data and
- simply doing a metadata crawl of the pool instead.
- .
- .It Sy zfs_no_scrub_prefetch Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Set to disable block prefetching for scrubs.
- .
- .It Sy zfs_nocacheflush Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable cache flush operations on disks when writing.
- Setting this will cause pool corruption on power loss
- if a volatile out-of-order write cache is enabled.
- .
- .It Sy zfs_nopwrite_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Allow no-operation writes.
- The occurrence of nopwrites will further depend on other pool properties
- .Pq i.a. the checksumming and compression algorithms .
- .
- .It Sy zfs_dmu_offset_next_sync Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Enable forcing TXG sync to find holes.
- When enabled forces ZFS to sync data when
- .Sy SEEK_HOLE No or Sy SEEK_DATA
- flags are used allowing holes in a file to be accurately reported.
- When disabled holes will not be reported in recently dirtied files.
- .
- .It Sy zfs_pd_bytes_max Ns = Ns Sy 52428800 Ns B Po 50 MiB Pc Pq int
- The number of bytes which should be prefetched during a pool traversal, like
- .Nm zfs Cm send
- or other data crawling operations.
- .
- .It Sy zfs_traverse_indirect_prefetch_limit Ns = Ns Sy 32 Pq uint
- The number of blocks pointed by indirect (non-L0) block which should be
- prefetched during a pool traversal, like
- .Nm zfs Cm send
- or other data crawling operations.
- .
- .It Sy zfs_per_txg_dirty_frees_percent Ns = Ns Sy 30 Ns % Pq u64
- Control percentage of dirtied indirect blocks from frees allowed into one TXG.
- After this threshold is crossed, additional frees will wait until the next TXG.
- .Sy 0 No disables this throttle .
- .
- .It Sy zfs_prefetch_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable predictive prefetch.
- Note that it leaves "prescient" prefetch
- .Pq for, e.g., Nm zfs Cm send
- intact.
- Unlike predictive prefetch, prescient prefetch never issues I/O
- that ends up not being needed, so it can't hurt performance.
- .
- .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable QAT hardware acceleration for SHA256 checksums.
- May be unset after the ZFS modules have been loaded to initialize the QAT
- hardware as long as support is compiled in and the QAT driver is present.
- .
- .It Sy zfs_qat_compress_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable QAT hardware acceleration for gzip compression.
- May be unset after the ZFS modules have been loaded to initialize the QAT
- hardware as long as support is compiled in and the QAT driver is present.
- .
- .It Sy zfs_qat_encrypt_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable QAT hardware acceleration for AES-GCM encryption.
- May be unset after the ZFS modules have been loaded to initialize the QAT
- hardware as long as support is compiled in and the QAT driver is present.
- .
- .It Sy zfs_vnops_read_chunk_size Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
- Bytes to read per chunk.
- .
- .It Sy zfs_read_history Ns = Ns Sy 0 Pq uint
- Historical statistics for this many latest reads will be available in
- .Pa /proc/spl/kstat/zfs/ Ns Ao Ar pool Ac Ns Pa /reads .
- .
- .It Sy zfs_read_history_hits Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Include cache hits in read history
- .
- .It Sy zfs_rebuild_max_segment Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq u64
- Maximum read segment size to issue when sequentially resilvering a
- top-level vdev.
- .
- .It Sy zfs_rebuild_scrub_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Automatically start a pool scrub when the last active sequential resilver
- completes in order to verify the checksums of all blocks which have been
- resilvered.
- This is enabled by default and strongly recommended.
- .
- .It Sy zfs_rebuild_vdev_limit Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq u64
- Maximum amount of I/O that can be concurrently issued for a sequential
- resilver per leaf device, given in bytes.
- .
- .It Sy zfs_reconstruct_indirect_combinations_max Ns = Ns Sy 4096 Pq int
- If an indirect split block contains more than this many possible unique
- combinations when being reconstructed, consider it too computationally
- expensive to check them all.
- Instead, try at most this many randomly selected
- combinations each time the block is accessed.
- This allows all segment copies to participate fairly
- in the reconstruction when all combinations
- cannot be checked and prevents repeated use of one bad copy.
- .
- .It Sy zfs_recover Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Set to attempt to recover from fatal errors.
- This should only be used as a last resort,
- as it typically results in leaked space, or worse.
- .
- .It Sy zfs_removal_ignore_errors Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Ignore hard I/O errors during device removal.
- When set, if a device encounters a hard I/O error during the removal process
- the removal will not be cancelled.
- This can result in a normally recoverable block becoming permanently damaged
- and is hence not recommended.
- This should only be used as a last resort when the
- pool cannot be returned to a healthy state prior to removing the device.
- .
- .It Sy zfs_removal_suspend_progress Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- This is used by the test suite so that it can ensure that certain actions
- happen while in the middle of a removal.
- .
- .It Sy zfs_remove_max_segment Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
- The largest contiguous segment that we will attempt to allocate when removing
- a device.
- If there is a performance problem with attempting to allocate large blocks,
- consider decreasing this.
- The default value is also the maximum.
- .
- .It Sy zfs_resilver_disable_defer Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Ignore the
- .Sy resilver_defer
- feature, causing an operation that would start a resilver to
- immediately restart the one in progress.
- .
- .It Sy zfs_resilver_defer_percent Ns = Ns Sy 10 Ns % Pq uint
- If the ongoing resilver progress is below this threshold, a new resilver will
- restart from scratch instead of being deferred after the current one finishes,
- even if the
- .Sy resilver_defer
- feature is enabled.
- .
- .It Sy zfs_resilver_min_time_ms Ns = Ns Sy 3000 Ns ms Po 3 s Pc Pq uint
- Resilvers are processed by the sync thread.
- While resilvering, it will spend at least this much time
- working on a resilver between TXG flushes.
- .
- .It Sy zfs_scan_ignore_errors Ns = Ns Sy 0 Ns | Ns 1 Pq int
- If set, remove the DTL (dirty time list) upon completion of a pool scan (scrub),
- even if there were unrepairable errors.
- Intended to be used during pool repair or recovery to
- stop resilvering when the pool is next imported.
- .
- .It Sy zfs_scrub_after_expand Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Automatically start a pool scrub after a RAIDZ expansion completes
- in order to verify the checksums of all blocks which have been
- copied during the expansion.
- This is enabled by default and strongly recommended.
- .
- .It Sy zfs_scrub_min_time_ms Ns = Ns Sy 1000 Ns ms Po 1 s Pc Pq uint
- Scrubs are processed by the sync thread.
- While scrubbing, it will spend at least this much time
- working on a scrub between TXG flushes.
- .
- .It Sy zfs_scrub_error_blocks_per_txg Ns = Ns Sy 4096 Pq uint
- Error blocks to be scrubbed in one txg.
- .
- .It Sy zfs_scan_checkpoint_intval Ns = Ns Sy 7200 Ns s Po 2 hour Pc Pq uint
- To preserve progress across reboots, the sequential scan algorithm periodically
- needs to stop metadata scanning and issue all the verification I/O to disk.
- The frequency of this flushing is determined by this tunable.
- .
- .It Sy zfs_scan_fill_weight Ns = Ns Sy 3 Pq uint
- This tunable affects how scrub and resilver I/O segments are ordered.
- A higher number indicates that we care more about how filled in a segment is,
- while a lower number indicates we care more about the size of the extent without
- considering the gaps within a segment.
- This value is only tunable upon module insertion.
- Changing the value afterwards will have no effect on scrub or resilver
- performance.
- .
- .It Sy zfs_scan_issue_strategy Ns = Ns Sy 0 Pq uint
- Determines the order that data will be verified while scrubbing or resilvering:
- .Bl -tag -compact -offset 4n -width "a"
- .It Sy 1
- Data will be verified as sequentially as possible, given the
- amount of memory reserved for scrubbing
- .Pq see Sy zfs_scan_mem_lim_fact .
- This may improve scrub performance if the pool's data is very fragmented.
- .It Sy 2
- The largest mostly-contiguous chunk of found data will be verified first.
- By deferring scrubbing of small segments, we may later find adjacent data
- to coalesce and increase the segment size.
- .It Sy 0
- .No Use strategy Sy 1 No during normal verification
- .No and strategy Sy 2 No while taking a checkpoint .
- .El
- .
- .It Sy zfs_scan_legacy Ns = Ns Sy 0 Ns | Ns 1 Pq int
- If unset, indicates that scrubs and resilvers will gather metadata in
- memory before issuing sequential I/O.
- Otherwise indicates that the legacy algorithm will be used,
- where I/O is initiated as soon as it is discovered.
- Unsetting will not affect scrubs or resilvers that are already in progress.
- .
- .It Sy zfs_scan_max_ext_gap Ns = Ns Sy 2097152 Ns B Po 2 MiB Pc Pq int
- Sets the largest gap in bytes between scrub/resilver I/O operations
- that will still be considered sequential for sorting purposes.
- Changing this value will not
- affect scrubs or resilvers that are already in progress.
- .
- .It Sy zfs_scan_mem_lim_fact Ns = Ns Sy 20 Ns ^-1 Pq uint
- Maximum fraction of RAM used for I/O sorting by sequential scan algorithm.
- This tunable determines the hard limit for I/O sorting memory usage.
- When the hard limit is reached we stop scanning metadata and start issuing
- data verification I/O.
- This is done until we get below the soft limit.
- .
- .It Sy zfs_scan_mem_lim_soft_fact Ns = Ns Sy 20 Ns ^-1 Pq uint
- The fraction of the hard limit used to determined the soft limit for I/O sorting
- by the sequential scan algorithm.
- When we cross this limit from below no action is taken.
- When we cross this limit from above it is because we are issuing verification
- I/O.
- In this case (unless the metadata scan is done) we stop issuing verification I/O
- and start scanning metadata again until we get to the hard limit.
- .
- .It Sy zfs_scan_report_txgs Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- When reporting resilver throughput and estimated completion time use the
- performance observed over roughly the last
- .Sy zfs_scan_report_txgs
- TXGs.
- When set to zero performance is calculated over the time between checkpoints.
- .
- .It Sy zfs_scan_strict_mem_lim Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Enforce tight memory limits on pool scans when a sequential scan is in progress.
- When disabled, the memory limit may be exceeded by fast disks.
- .
- .It Sy zfs_scan_suspend_progress Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Freezes a scrub/resilver in progress without actually pausing it.
- Intended for testing/debugging.
- .
- .It Sy zfs_scan_vdev_limit Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq int
- Maximum amount of data that can be concurrently issued at once for scrubs and
- resilvers per leaf device, given in bytes.
- .
- .It Sy zfs_send_corrupt_data Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Allow sending of corrupt data (ignore read/checksum errors when sending).
- .
- .It Sy zfs_send_unmodified_spill_blocks Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Include unmodified spill blocks in the send stream.
- Under certain circumstances, previous versions of ZFS could incorrectly
- remove the spill block from an existing object.
- Including unmodified copies of the spill blocks creates a backwards-compatible
- stream which will recreate a spill block if it was incorrectly removed.
- .
- .It Sy zfs_send_no_prefetch_queue_ff Ns = Ns Sy 20 Ns ^\-1 Pq uint
- The fill fraction of the
- .Nm zfs Cm send
- internal queues.
- The fill fraction controls the timing with which internal threads are woken up.
- .
- .It Sy zfs_send_no_prefetch_queue_length Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq uint
- The maximum number of bytes allowed in
- .Nm zfs Cm send Ns 's
- internal queues.
- .
- .It Sy zfs_send_queue_ff Ns = Ns Sy 20 Ns ^\-1 Pq uint
- The fill fraction of the
- .Nm zfs Cm send
- prefetch queue.
- The fill fraction controls the timing with which internal threads are woken up.
- .
- .It Sy zfs_send_queue_length Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
- The maximum number of bytes allowed that will be prefetched by
- .Nm zfs Cm send .
- This value must be at least twice the maximum block size in use.
- .
- .It Sy zfs_recv_queue_ff Ns = Ns Sy 20 Ns ^\-1 Pq uint
- The fill fraction of the
- .Nm zfs Cm receive
- queue.
- The fill fraction controls the timing with which internal threads are woken up.
- .
- .It Sy zfs_recv_queue_length Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq uint
- The maximum number of bytes allowed in the
- .Nm zfs Cm receive
- queue.
- This value must be at least twice the maximum block size in use.
- .
- .It Sy zfs_recv_write_batch_size Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq uint
- The maximum amount of data, in bytes, that
- .Nm zfs Cm receive
- will write in one DMU transaction.
- This is the uncompressed size, even when receiving a compressed send stream.
- This setting will not reduce the write size below a single block.
- Capped at a maximum of
- .Sy 32 MiB .
- .
- .It Sy zfs_recv_best_effort_corrective Ns = Ns Sy 0 Pq int
- When this variable is set to non-zero a corrective receive:
- .Bl -enum -compact -offset 4n -width "1."
- .It
- Does not enforce the restriction of source & destination snapshot GUIDs
- matching.
- .It
- If there is an error during healing, the healing receive is not
- terminated instead it moves on to the next record.
- .El
- .
- .It Sy zfs_override_estimate_recordsize Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- Setting this variable overrides the default logic for estimating block
- sizes when doing a
- .Nm zfs Cm send .
- The default heuristic is that the average block size
- will be the current recordsize.
- Override this value if most data in your dataset is not of that size
- and you require accurate zfs send size estimates.
- .
- .It Sy zfs_sync_pass_deferred_free Ns = Ns Sy 2 Pq uint
- Flushing of data to disk is done in passes.
- Defer frees starting in this pass.
- .
- .It Sy zfs_spa_discard_memory_limit Ns = Ns Sy 16777216 Ns B Po 16 MiB Pc Pq int
- Maximum memory used for prefetching a checkpoint's space map on each
- vdev while discarding the checkpoint.
- .
- .It Sy zfs_special_class_metadata_reserve_pct Ns = Ns Sy 25 Ns % Pq uint
- Only allow small data blocks to be allocated on the special and dedup vdev
- types when the available free space percentage on these vdevs exceeds this
- value.
- This ensures reserved space is available for pool metadata as the
- special vdevs approach capacity.
- .
- .It Sy zfs_sync_pass_dont_compress Ns = Ns Sy 8 Pq uint
- Starting in this sync pass, disable compression (including of metadata).
- With the default setting, in practice, we don't have this many sync passes,
- so this has no effect.
- .Pp
- The original intent was that disabling compression would help the sync passes
- to converge.
- However, in practice, disabling compression increases
- the average number of sync passes; because when we turn compression off,
- many blocks' size will change, and thus we have to re-allocate
- (not overwrite) them.
- It also increases the number of
- .Em 128 KiB
- allocations (e.g. for indirect blocks and spacemaps)
- because these will not be compressed.
- The
- .Em 128 KiB
- allocations are especially detrimental to performance
- on highly fragmented systems, which may have very few free segments of this
- size,
- and may need to load new metaslabs to satisfy these allocations.
- .
- .It Sy zfs_sync_pass_rewrite Ns = Ns Sy 2 Pq uint
- Rewrite new block pointers starting in this pass.
- .
- .It Sy zfs_trim_extent_bytes_max Ns = Ns Sy 134217728 Ns B Po 128 MiB Pc Pq uint
- Maximum size of TRIM command.
- Larger ranges will be split into chunks no larger than this value before
- issuing.
- .
- .It Sy zfs_trim_extent_bytes_min Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq uint
- Minimum size of TRIM commands.
- TRIM ranges smaller than this will be skipped,
- unless they're part of a larger range which was chunked.
- This is done because it's common for these small TRIMs
- to negatively impact overall performance.
- .
- .It Sy zfs_trim_metaslab_skip Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- Skip uninitialized metaslabs during the TRIM process.
- This option is useful for pools constructed from large thinly-provisioned
- devices
- where TRIM operations are slow.
- As a pool ages, an increasing fraction of the pool's metaslabs
- will be initialized, progressively degrading the usefulness of this option.
- This setting is stored when starting a manual TRIM and will
- persist for the duration of the requested TRIM.
- .
- .It Sy zfs_trim_queue_limit Ns = Ns Sy 10 Pq uint
- Maximum number of queued TRIMs outstanding per leaf vdev.
- The number of concurrent TRIM commands issued to the device is controlled by
- .Sy zfs_vdev_trim_min_active No and Sy zfs_vdev_trim_max_active .
- .
- .It Sy zfs_trim_txg_batch Ns = Ns Sy 32 Pq uint
- The number of transaction groups' worth of frees which should be aggregated
- before TRIM operations are issued to the device.
- This setting represents a trade-off between issuing larger,
- more efficient TRIM operations and the delay
- before the recently trimmed space is available for use by the device.
- .Pp
- Increasing this value will allow frees to be aggregated for a longer time.
- This will result is larger TRIM operations and potentially increased memory
- usage.
- Decreasing this value will have the opposite effect.
- The default of
- .Sy 32
- was determined to be a reasonable compromise.
- .
- .It Sy zfs_txg_history Ns = Ns Sy 100 Pq uint
- Historical statistics for this many latest TXGs will be available in
- .Pa /proc/spl/kstat/zfs/ Ns Ao Ar pool Ac Ns Pa /TXGs .
- .
- .It Sy zfs_txg_timeout Ns = Ns Sy 5 Ns s Pq uint
- Flush dirty data to disk at least every this many seconds (maximum TXG
- duration).
- .
- .It Sy zfs_vdev_aggregation_limit Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq uint
- Max vdev I/O aggregation size.
- .
- .It Sy zfs_vdev_aggregation_limit_non_rotating Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq uint
- Max vdev I/O aggregation size for non-rotating media.
- .
- .It Sy zfs_vdev_mirror_rotating_inc Ns = Ns Sy 0 Pq int
- A number by which the balancing algorithm increments the load calculation for
- the purpose of selecting the least busy mirror member when an I/O operation
- immediately follows its predecessor on rotational vdevs
- for the purpose of making decisions based on load.
- .
- .It Sy zfs_vdev_mirror_rotating_seek_inc Ns = Ns Sy 5 Pq int
- A number by which the balancing algorithm increments the load calculation for
- the purpose of selecting the least busy mirror member when an I/O operation
- lacks locality as defined by
- .Sy zfs_vdev_mirror_rotating_seek_offset .
- Operations within this that are not immediately following the previous operation
- are incremented by half.
- .
- .It Sy zfs_vdev_mirror_rotating_seek_offset Ns = Ns Sy 1048576 Ns B Po 1 MiB Pc Pq int
- The maximum distance for the last queued I/O operation in which
- the balancing algorithm considers an operation to have locality.
- .No See Sx ZFS I/O SCHEDULER .
- .
- .It Sy zfs_vdev_mirror_non_rotating_inc Ns = Ns Sy 0 Pq int
- A number by which the balancing algorithm increments the load calculation for
- the purpose of selecting the least busy mirror member on non-rotational vdevs
- when I/O operations do not immediately follow one another.
- .
- .It Sy zfs_vdev_mirror_non_rotating_seek_inc Ns = Ns Sy 1 Pq int
- A number by which the balancing algorithm increments the load calculation for
- the purpose of selecting the least busy mirror member when an I/O operation
- lacks
- locality as defined by the
- .Sy zfs_vdev_mirror_rotating_seek_offset .
- Operations within this that are not immediately following the previous operation
- are incremented by half.
- .
- .It Sy zfs_vdev_read_gap_limit Ns = Ns Sy 32768 Ns B Po 32 KiB Pc Pq uint
- Aggregate read I/O operations if the on-disk gap between them is within this
- threshold.
- .
- .It Sy zfs_vdev_write_gap_limit Ns = Ns Sy 4096 Ns B Po 4 KiB Pc Pq uint
- Aggregate write I/O operations if the on-disk gap between them is within this
- threshold.
- .
- .It Sy zfs_vdev_raidz_impl Ns = Ns Sy fastest Pq string
- Select the raidz parity implementation to use.
- .Pp
- Variants that don't depend on CPU-specific features
- may be selected on module load, as they are supported on all systems.
- The remaining options may only be set after the module is loaded,
- as they are available only if the implementations are compiled in
- and supported on the running system.
- .Pp
- Once the module is loaded,
- .Pa /sys/module/zfs/parameters/zfs_vdev_raidz_impl
- will show the available options,
- with the currently selected one enclosed in square brackets.
- .Pp
- .TS
- lb l l .
- fastest selected by built-in benchmark
- original original implementation
- scalar scalar implementation
- sse2 SSE2 instruction set 64-bit x86
- ssse3 SSSE3 instruction set 64-bit x86
- avx2 AVX2 instruction set 64-bit x86
- avx512f AVX512F instruction set 64-bit x86
- avx512bw AVX512F & AVX512BW instruction sets 64-bit x86
- aarch64_neon NEON Aarch64/64-bit ARMv8
- aarch64_neonx2 NEON with more unrolling Aarch64/64-bit ARMv8
- powerpc_altivec Altivec PowerPC
- .TE
- .
- .It Sy zfs_vdev_scheduler Pq charp
- .Sy DEPRECATED .
- Prints warning to kernel log for compatibility.
- .
- .It Sy zfs_zevent_len_max Ns = Ns Sy 512 Pq uint
- Max event queue length.
- Events in the queue can be viewed with
- .Xr zpool-events 8 .
- .
- .It Sy zfs_zevent_retain_max Ns = Ns Sy 2000 Pq int
- Maximum recent zevent records to retain for duplicate checking.
- Setting this to
- .Sy 0
- disables duplicate detection.
- .
- .It Sy zfs_zevent_retain_expire_secs Ns = Ns Sy 900 Ns s Po 15 min Pc Pq int
- Lifespan for a recent ereport that was retained for duplicate checking.
- .
- .It Sy zfs_zil_clean_taskq_maxalloc Ns = Ns Sy 1048576 Pq int
- The maximum number of taskq entries that are allowed to be cached.
- When this limit is exceeded transaction records (itxs)
- will be cleaned synchronously.
- .
- .It Sy zfs_zil_clean_taskq_minalloc Ns = Ns Sy 1024 Pq int
- The number of taskq entries that are pre-populated when the taskq is first
- created and are immediately available for use.
- .
- .It Sy zfs_zil_clean_taskq_nthr_pct Ns = Ns Sy 100 Ns % Pq int
- This controls the number of threads used by
- .Sy dp_zil_clean_taskq .
- The default value of
- .Sy 100%
- will create a maximum of one thread per cpu.
- .
- .It Sy zil_maxblocksize Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq uint
- This sets the maximum block size used by the ZIL.
- On very fragmented pools, lowering this
- .Pq typically to Sy 36 KiB
- can improve performance.
- .
- .It Sy zil_maxcopied Ns = Ns Sy 7680 Ns B Po 7.5 KiB Pc Pq uint
- This sets the maximum number of write bytes logged via WR_COPIED.
- It tunes a tradeoff between additional memory copy and possibly worse log
- space efficiency vs additional range lock/unlock.
- .
- .It Sy zil_nocacheflush Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable the cache flush commands that are normally sent to disk by
- the ZIL after an LWB write has completed.
- Setting this will cause ZIL corruption on power loss
- if a volatile out-of-order write cache is enabled.
- .
- .It Sy zil_replay_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Disable intent logging replay.
- Can be disabled for recovery from corrupted ZIL.
- .
- .It Sy zil_slog_bulk Ns = Ns Sy 67108864 Ns B Po 64 MiB Pc Pq u64
- Limit SLOG write size per commit executed with synchronous priority.
- Any writes above that will be executed with lower (asynchronous) priority
- to limit potential SLOG device abuse by single active ZIL writer.
- .
- .It Sy zfs_zil_saxattr Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Setting this tunable to zero disables ZIL logging of new
- .Sy xattr Ns = Ns Sy sa
- records if the
- .Sy org.openzfs:zilsaxattr
- feature is enabled on the pool.
- This would only be necessary to work around bugs in the ZIL logging or replay
- code for this record type.
- The tunable has no effect if the feature is disabled.
- .
- .It Sy zfs_embedded_slog_min_ms Ns = Ns Sy 64 Pq uint
- Usually, one metaslab from each normal-class vdev is dedicated for use by
- the ZIL to log synchronous writes.
- However, if there are fewer than
- .Sy zfs_embedded_slog_min_ms
- metaslabs in the vdev, this functionality is disabled.
- This ensures that we don't set aside an unreasonable amount of space for the
- ZIL.
- .
- .It Sy zstd_earlyabort_pass Ns = Ns Sy 1 Pq uint
- Whether heuristic for detection of incompressible data with zstd levels >= 3
- using LZ4 and zstd-1 passes is enabled.
- .
- .It Sy zstd_abort_size Ns = Ns Sy 131072 Pq uint
- Minimal uncompressed size (inclusive) of a record before the early abort
- heuristic will be attempted.
- .
- .It Sy zio_deadman_log_all Ns = Ns Sy 0 Ns | Ns 1 Pq int
- If non-zero, the zio deadman will produce debugging messages
- .Pq see Sy zfs_dbgmsg_enable
- for all zios, rather than only for leaf zios possessing a vdev.
- This is meant to be used by developers to gain
- diagnostic information for hang conditions which don't involve a mutex
- or other locking primitive: typically conditions in which a thread in
- the zio pipeline is looping indefinitely.
- .
- .It Sy zio_slow_io_ms Ns = Ns Sy 30000 Ns ms Po 30 s Pc Pq int
- When an I/O operation takes more than this much time to complete,
- it's marked as slow.
- Each slow operation causes a delay zevent.
- Slow I/O counters can be seen with
- .Nm zpool Cm status Fl s .
- .
- .It Sy zio_dva_throttle_enabled Ns = Ns Sy 1 Ns | Ns 0 Pq int
- Throttle block allocations in the I/O pipeline.
- This allows for dynamic allocation distribution when devices are imbalanced.
- When enabled, the maximum number of pending allocations per top-level vdev
- is limited by
- .Sy zfs_vdev_queue_depth_pct .
- .
- .It Sy zfs_xattr_compat Ns = Ns 0 Ns | Ns 1 Pq int
- Control the naming scheme used when setting new xattrs in the user namespace.
- If
- .Sy 0
- .Pq the default on Linux ,
- user namespace xattr names are prefixed with the namespace, to be backwards
- compatible with previous versions of ZFS on Linux.
- If
- .Sy 1
- .Pq the default on Fx ,
- user namespace xattr names are not prefixed, to be backwards compatible with
- previous versions of ZFS on illumos and
- .Fx .
- .Pp
- Either naming scheme can be read on this and future versions of ZFS, regardless
- of this tunable, but legacy ZFS on illumos or
- .Fx
- are unable to read user namespace xattrs written in the Linux format, and
- legacy versions of ZFS on Linux are unable to read user namespace xattrs written
- in the legacy ZFS format.
- .Pp
- An existing xattr with the alternate naming scheme is removed when overwriting
- the xattr so as to not accumulate duplicates.
- .
- .It Sy zio_requeue_io_start_cut_in_line Ns = Ns Sy 0 Ns | Ns 1 Pq int
- Prioritize requeued I/O.
- .
- .It Sy zio_taskq_batch_pct Ns = Ns Sy 80 Ns % Pq uint
- Percentage of online CPUs which will run a worker thread for I/O.
- These workers are responsible for I/O work such as compression, encryption,
- checksum and parity calculations.
- Fractional number of CPUs will be rounded down.
- .Pp
- The default value of
- .Sy 80%
- was chosen to avoid using all CPUs which can result in
- latency issues and inconsistent application performance,
- especially when slower compression and/or checksumming is enabled.
- Set value only applies to pools imported/created after that.
- .
- .It Sy zio_taskq_batch_tpq Ns = Ns Sy 0 Pq uint
- Number of worker threads per taskq.
- Higher values improve I/O ordering and CPU utilization,
- while lower reduce lock contention.
- Set value only applies to pools imported/created after that.
- .Pp
- If
- .Sy 0 ,
- generate a system-dependent value close to 6 threads per taskq.
- Set value only applies to pools imported/created after that.
- .
- .It Sy zio_taskq_write_tpq Ns = Ns Sy 16 Pq uint
- Determines the minumum number of threads per write issue taskq.
- Higher values improve CPU utilization on high throughput,
- while lower reduce taskq locks contention on high IOPS.
- Set value only applies to pools imported/created after that.
- .
- .It Sy zio_taskq_read Ns = Ns Sy fixed,1,8 null scale null Pq charp
- Set the queue and thread configuration for the IO read queues.
- This is an advanced debugging parameter.
- Don't change this unless you understand what it does.
- Set values only apply to pools imported/created after that.
- .
- .It Sy zio_taskq_write Ns = Ns Sy sync null scale null Pq charp
- Set the queue and thread configuration for the IO write queues.
- This is an advanced debugging parameter.
- Don't change this unless you understand what it does.
- Set values only apply to pools imported/created after that.
- .
- .It Sy zvol_inhibit_dev Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- Do not create zvol device nodes.
- This may slightly improve startup time on
- systems with a very large number of zvols.
- .
- .It Sy zvol_major Ns = Ns Sy 230 Pq uint
- Major number for zvol block devices.
- .
- .It Sy zvol_max_discard_blocks Ns = Ns Sy 16384 Pq long
- Discard (TRIM) operations done on zvols will be done in batches of this
- many blocks, where block size is determined by the
- .Sy volblocksize
- property of a zvol.
- .
- .It Sy zvol_prefetch_bytes Ns = Ns Sy 131072 Ns B Po 128 KiB Pc Pq uint
- When adding a zvol to the system, prefetch this many bytes
- from the start and end of the volume.
- Prefetching these regions of the volume is desirable,
- because they are likely to be accessed immediately by
- .Xr blkid 8
- or the kernel partitioner.
- .
- .It Sy zvol_request_sync Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- When processing I/O requests for a zvol, submit them synchronously.
- This effectively limits the queue depth to
- .Em 1
- for each I/O submitter.
- When unset, requests are handled asynchronously by a thread pool.
- The number of requests which can be handled concurrently is controlled by
- .Sy zvol_threads .
- .Sy zvol_request_sync
- is ignored when running on a kernel that supports block multiqueue
- .Pq Li blk-mq .
- .
- .It Sy zvol_num_taskqs Ns = Ns Sy 0 Pq uint
- Number of zvol taskqs.
- If
- .Sy 0
- (the default) then scaling is done internally to prefer 6 threads per taskq.
- This only applies on Linux.
- .
- .It Sy zvol_threads Ns = Ns Sy 0 Pq uint
- The number of system wide threads to use for processing zvol block IOs.
- If
- .Sy 0
- (the default) then internally set
- .Sy zvol_threads
- to the number of CPUs present or 32 (whichever is greater).
- .
- .It Sy zvol_blk_mq_threads Ns = Ns Sy 0 Pq uint
- The number of threads per zvol to use for queuing IO requests.
- This parameter will only appear if your kernel supports
- .Li blk-mq
- and is only read and assigned to a zvol at zvol load time.
- If
- .Sy 0
- (the default) then internally set
- .Sy zvol_blk_mq_threads
- to the number of CPUs present.
- .
- .It Sy zvol_use_blk_mq Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- Set to
- .Sy 1
- to use the
- .Li blk-mq
- API for zvols.
- Set to
- .Sy 0
- (the default) to use the legacy zvol APIs.
- This setting can give better or worse zvol performance depending on
- the workload.
- This parameter will only appear if your kernel supports
- .Li blk-mq
- and is only read and assigned to a zvol at zvol load time.
- .
- .It Sy zvol_blk_mq_blocks_per_thread Ns = Ns Sy 8 Pq uint
- If
- .Sy zvol_use_blk_mq
- is enabled, then process this number of
- .Sy volblocksize Ns -sized blocks per zvol thread.
- This tunable can be use to favor better performance for zvol reads (lower
- values) or writes (higher values).
- If set to
- .Sy 0 ,
- then the zvol layer will process the maximum number of blocks
- per thread that it can.
- This parameter will only appear if your kernel supports
- .Li blk-mq
- and is only applied at each zvol's load time.
- .
- .It Sy zvol_blk_mq_queue_depth Ns = Ns Sy 0 Pq uint
- The queue_depth value for the zvol
- .Li blk-mq
- interface.
- This parameter will only appear if your kernel supports
- .Li blk-mq
- and is only applied at each zvol's load time.
- If
- .Sy 0
- (the default) then use the kernel's default queue depth.
- Values are clamped to the kernel's
- .Dv BLKDEV_MIN_RQ
- and
- .Dv BLKDEV_MAX_RQ Ns / Ns Dv BLKDEV_DEFAULT_RQ
- limits.
- .
- .It Sy zvol_volmode Ns = Ns Sy 1 Pq uint
- Defines zvol block devices behaviour when
- .Sy volmode Ns = Ns Sy default :
- .Bl -tag -compact -offset 4n -width "a"
- .It Sy 1
- .No equivalent to Sy full
- .It Sy 2
- .No equivalent to Sy dev
- .It Sy 3
- .No equivalent to Sy none
- .El
- .
- .It Sy zvol_enforce_quotas Ns = Ns Sy 0 Ns | Ns 1 Pq uint
- Enable strict ZVOL quota enforcement.
- The strict quota enforcement may have a performance impact.
- .El
- .
- .Sh ZFS I/O SCHEDULER
- ZFS issues I/O operations to leaf vdevs to satisfy and complete I/O operations.
- The scheduler determines when and in what order those operations are issued.
- The scheduler divides operations into five I/O classes,
- prioritized in the following order: sync read, sync write, async read,
- async write, and scrub/resilver.
- Each queue defines the minimum and maximum number of concurrent operations
- that may be issued to the device.
- In addition, the device has an aggregate maximum,
- .Sy zfs_vdev_max_active .
- Note that the sum of the per-queue minima must not exceed the aggregate maximum.
- If the sum of the per-queue maxima exceeds the aggregate maximum,
- then the number of active operations may reach
- .Sy zfs_vdev_max_active ,
- in which case no further operations will be issued,
- regardless of whether all per-queue minima have been met.
- .Pp
- For many physical devices, throughput increases with the number of
- concurrent operations, but latency typically suffers.
- Furthermore, physical devices typically have a limit
- at which more concurrent operations have no
- effect on throughput or can actually cause it to decrease.
- .Pp
- The scheduler selects the next operation to issue by first looking for an
- I/O class whose minimum has not been satisfied.
- Once all are satisfied and the aggregate maximum has not been hit,
- the scheduler looks for classes whose maximum has not been satisfied.
- Iteration through the I/O classes is done in the order specified above.
- No further operations are issued
- if the aggregate maximum number of concurrent operations has been hit,
- or if there are no operations queued for an I/O class that has not hit its
- maximum.
- Every time an I/O operation is queued or an operation completes,
- the scheduler looks for new operations to issue.
- .Pp
- In general, smaller
- .Sy max_active Ns s
- will lead to lower latency of synchronous operations.
- Larger
- .Sy max_active Ns s
- may lead to higher overall throughput, depending on underlying storage.
- .Pp
- The ratio of the queues'
- .Sy max_active Ns s
- determines the balance of performance between reads, writes, and scrubs.
- For example, increasing
- .Sy zfs_vdev_scrub_max_active
- will cause the scrub or resilver to complete more quickly,
- but reads and writes to have higher latency and lower throughput.
- .Pp
- All I/O classes have a fixed maximum number of outstanding operations,
- except for the async write class.
- Asynchronous writes represent the data that is committed to stable storage
- during the syncing stage for transaction groups.
- Transaction groups enter the syncing state periodically,
- so the number of queued async writes will quickly burst up
- and then bleed down to zero.
- Rather than servicing them as quickly as possible,
- the I/O scheduler changes the maximum number of active async write operations
- according to the amount of dirty data in the pool.
- Since both throughput and latency typically increase with the number of
- concurrent operations issued to physical devices, reducing the
- burstiness in the number of simultaneous operations also stabilizes the
- response time of operations from other queues, in particular synchronous ones.
- In broad strokes, the I/O scheduler will issue more concurrent operations
- from the async write queue as there is more dirty data in the pool.
- .
- .Ss Async Writes
- The number of concurrent operations issued for the async write I/O class
- follows a piece-wise linear function defined by a few adjustable points:
- .Bd -literal
- | o---------| <-- \fBzfs_vdev_async_write_max_active\fP
- ^ | /^ |
- | | / | |
- active | / | |
- I/O | / | |
- count | / | |
- | / | |
- |-------o | | <-- \fBzfs_vdev_async_write_min_active\fP
- 0|_______^______|_________|
- 0% | | 100% of \fBzfs_dirty_data_max\fP
- | |
- | `-- \fBzfs_vdev_async_write_active_max_dirty_percent\fP
- `--------- \fBzfs_vdev_async_write_active_min_dirty_percent\fP
- .Ed
- .Pp
- Until the amount of dirty data exceeds a minimum percentage of the dirty
- data allowed in the pool, the I/O scheduler will limit the number of
- concurrent operations to the minimum.
- As that threshold is crossed, the number of concurrent operations issued
- increases linearly to the maximum at the specified maximum percentage
- of the dirty data allowed in the pool.
- .Pp
- Ideally, the amount of dirty data on a busy pool will stay in the sloped
- part of the function between
- .Sy zfs_vdev_async_write_active_min_dirty_percent
- and
- .Sy zfs_vdev_async_write_active_max_dirty_percent .
- If it exceeds the maximum percentage,
- this indicates that the rate of incoming data is
- greater than the rate that the backend storage can handle.
- In this case, we must further throttle incoming writes,
- as described in the next section.
- .
- .Sh ZFS TRANSACTION DELAY
- We delay transactions when we've determined that the backend storage
- isn't able to accommodate the rate of incoming writes.
- .Pp
- If there is already a transaction waiting, we delay relative to when
- that transaction will finish waiting.
- This way the calculated delay time
- is independent of the number of threads concurrently executing transactions.
- .Pp
- If we are the only waiter, wait relative to when the transaction started,
- rather than the current time.
- This credits the transaction for "time already served",
- e.g. reading indirect blocks.
- .Pp
- The minimum time for a transaction to take is calculated as
- .D1 min_time = min( Ns Sy zfs_delay_scale No \(mu Po Sy dirty No \- Sy min Pc / Po Sy max No \- Sy dirty Pc , 100ms)
- .Pp
- The delay has two degrees of freedom that can be adjusted via tunables.
- The percentage of dirty data at which we start to delay is defined by
- .Sy zfs_delay_min_dirty_percent .
- This should typically be at or above
- .Sy zfs_vdev_async_write_active_max_dirty_percent ,
- so that we only start to delay after writing at full speed
- has failed to keep up with the incoming write rate.
- The scale of the curve is defined by
- .Sy zfs_delay_scale .
- Roughly speaking, this variable determines the amount of delay at the midpoint
- of the curve.
- .Bd -literal
- delay
- 10ms +-------------------------------------------------------------*+
- | *|
- 9ms + *+
- | *|
- 8ms + *+
- | * |
- 7ms + * +
- | * |
- 6ms + * +
- | * |
- 5ms + * +
- | * |
- 4ms + * +
- | * |
- 3ms + * +
- | * |
- 2ms + (midpoint) * +
- | | ** |
- 1ms + v *** +
- | \fBzfs_delay_scale\fP ----------> ******** |
- 0 +-------------------------------------*********----------------+
- 0% <- \fBzfs_dirty_data_max\fP -> 100%
- .Ed
- .Pp
- Note, that since the delay is added to the outstanding time remaining on the
- most recent transaction it's effectively the inverse of IOPS.
- Here, the midpoint of
- .Em 500 us
- translates to
- .Em 2000 IOPS .
- The shape of the curve
- was chosen such that small changes in the amount of accumulated dirty data
- in the first three quarters of the curve yield relatively small differences
- in the amount of delay.
- .Pp
- The effects can be easier to understand when the amount of delay is
- represented on a logarithmic scale:
- .Bd -literal
- delay
- 100ms +-------------------------------------------------------------++
- + +
- | |
- + *+
- 10ms + *+
- + ** +
- | (midpoint) ** |
- + | ** +
- 1ms + v **** +
- + \fBzfs_delay_scale\fP ----------> ***** +
- | **** |
- + **** +
- 100us + ** +
- + * +
- | * |
- + * +
- 10us + * +
- + +
- | |
- + +
- +--------------------------------------------------------------+
- 0% <- \fBzfs_dirty_data_max\fP -> 100%
- .Ed
- .Pp
- Note here that only as the amount of dirty data approaches its limit does
- the delay start to increase rapidly.
- The goal of a properly tuned system should be to keep the amount of dirty data
- out of that range by first ensuring that the appropriate limits are set
- for the I/O scheduler to reach optimal throughput on the back-end storage,
- and then by changing the value of
- .Sy zfs_delay_scale
- to increase the steepness of the curve.