zpoolconcepts.7 (20045B)
- .\"
- .\" CDDL HEADER START
- .\"
- .\" The contents of this file are subject to the terms of the
- .\" Common Development and Distribution License (the "License").
- .\" You may not use this file except in compliance with the License.
- .\"
- .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
- .\" or https://opensource.org/licenses/CDDL-1.0.
- .\" See the License for the specific language governing permissions
- .\" and limitations under the License.
- .\"
- .\" When distributing Covered Code, include this CDDL HEADER in each
- .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
- .\" If applicable, add the following below this CDDL HEADER, with the
- .\" fields enclosed by brackets "[]" replaced with your own identifying
- .\" information: Portions Copyright [yyyy] [name of copyright owner]
- .\"
- .\" CDDL HEADER END
- .\"
- .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
- .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
- .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
- .\" Copyright (c) 2017 Datto Inc.
- .\" Copyright (c) 2018 George Melikov. All Rights Reserved.
- .\" Copyright 2017 Nexenta Systems, Inc.
- .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
- .\"
- .Dd April 7, 2023
- .Dt ZPOOLCONCEPTS 7
- .Os
- .
- .Sh NAME
- .Nm zpoolconcepts
- .Nd overview of ZFS storage pools
- .
- .Sh DESCRIPTION
- .Ss Virtual Devices (vdevs)
- A "virtual device" describes a single device or a collection of devices,
- organized according to certain performance and fault characteristics.
- The following virtual devices are supported:
- .Bl -tag -width "special"
- .It Sy disk
- A block device, typically located under
- .Pa /dev .
- ZFS can use individual slices or partitions, though the recommended mode of
- operation is to use whole disks.
- A disk can be specified by a full path, or it can be a shorthand name
- .Po the relative portion of the path under
- .Pa /dev
- .Pc .
- A whole disk can be specified by omitting the slice or partition designation.
- For example,
- .Pa sda
- is equivalent to
- .Pa /dev/sda .
- When given a whole disk, ZFS automatically labels the disk, if necessary.
- .It Sy file
- A regular file.
- The use of files as a backing store is strongly discouraged.
- It is designed primarily for experimental purposes, as the fault tolerance of a
- file is only as good as the file system on which it resides.
- A file must be specified by a full path.
- .It Sy mirror
- A mirror of two or more devices.
- Data is replicated in an identical fashion across all components of a mirror.
- A mirror with
- .Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
- devices failing, without losing data.
- .It Sy raidz , raidz1 , raidz2 , raidz3
- A distributed-parity layout, similar to RAID-5/6, with improved distribution of
- parity, and which does not suffer from the RAID-5/6
- .Qq write hole ,
- .Pq in which data and parity become inconsistent after a power loss .
- Data and parity is striped across all disks within a raidz group, though not
- necessarily in a consistent stripe width.
- .Pp
- A raidz group can have single, double, or triple parity, meaning that the
- raidz group can sustain one, two, or three failures, respectively, without
- losing any data.
- The
- .Sy raidz1
- vdev type specifies a single-parity raidz group; the
- .Sy raidz2
- vdev type specifies a double-parity raidz group; and the
- .Sy raidz3
- vdev type specifies a triple-parity raidz group.
- The
- .Sy raidz
- vdev type is an alias for
- .Sy raidz1 .
- .Pp
- A raidz group with
- .Em N No disks of size Em X No with Em P No parity disks can hold approximately
- .Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
- The minimum number of devices in a raidz group is one more than the number of
- parity disks.
- The recommended number is between 3 and 9 to help increase performance.
- .It Sy draid , draid1 , draid2 , draid3
- A variant of raidz that provides integrated distributed hot spares, allowing
- for faster resilvering, while retaining the benefits of raidz.
- A dRAID vdev is constructed from multiple internal raidz groups, each with
- .Em D No data devices and Em P No parity devices .
- These groups are distributed over all of the children in order to fully
- utilize the available disk performance.
- .Pp
- Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
- zeros) to allow fully sequential resilvering.
- This fixed stripe width significantly affects both usable capacity and IOPS.
- For example, with the default
- .Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
- If using compression, this relatively large allocation size can reduce the
- effective compression ratio.
- When using ZFS volumes (zvols) and dRAID, the default of the
- .Sy volblocksize
- property is increased to account for the allocation size.
- If a dRAID pool will hold a significant amount of small blocks, it is
- recommended to also add a mirrored
- .Sy special
- vdev to store those blocks.
- .Pp
- In regards to I/O, performance is similar to raidz since, for any read, all
- .Em D No data disks must be accessed .
- Delivered random IOPS can be reasonably approximated as
- .Sy floor((N-S)/(D+P))*single_drive_IOPS .
- .Pp
- Like raidz, a dRAID can have single-, double-, or triple-parity.
- The
- .Sy draid1 ,
- .Sy draid2 ,
- and
- .Sy draid3
- types can be used to specify the parity level.
- The
- .Sy draid
- vdev type is an alias for
- .Sy draid1 .
- .Pp
- A dRAID with
- .Em N No disks of size Em X , D No data disks per redundancy group , Em P
- .No parity level, and Em S No distributed hot spares can hold approximately
- .Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
- devices failing without losing data.
- .It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
- A non-default dRAID configuration can be specified by appending one or more
- of the following optional arguments to the
- .Sy draid
- keyword:
- .Bl -tag -compact -width "children"
- .It Ar parity
- The parity level (1-3).
- .It Ar data
- The number of data devices per redundancy group.
- In general, a smaller value of
- .Em D No will increase IOPS, improve the compression ratio ,
- and speed up resilvering at the expense of total usable capacity.
- Defaults to
- .Em 8 , No unless Em N-P-S No is less than Em 8 .
- .It Ar children
- The expected number of children.
- Useful as a cross-check when listing a large number of devices.
- An error is returned when the provided number of children differs.
- .It Ar spares
- The number of distributed hot spares.
- Defaults to zero.
- .El
- .It Sy spare
- A pseudo-vdev which keeps track of available hot spares for a pool.
- For more information, see the
- .Sx Hot Spares
- section.
- .It Sy log
- A separate intent log device.
- If more than one log device is specified, then writes are load-balanced between
- devices.
- Log devices can be mirrored.
- However, raidz vdev types are not supported for the intent log.
- For more information, see the
- .Sx Intent Log
- section.
- .It Sy dedup
- A device solely dedicated for deduplication tables.
- The redundancy of this device should match the redundancy of the other normal
- devices in the pool.
- If more than one dedup device is specified, then
- allocations are load-balanced between those devices.
- .It Sy special
- A device dedicated solely for allocating various kinds of internal metadata,
- and optionally small file blocks.
- The redundancy of this device should match the redundancy of the other normal
- devices in the pool.
- If more than one special device is specified, then
- allocations are load-balanced between those devices.
- .Pp
- For more information on special allocations, see the
- .Sx Special Allocation Class
- section.
- .It Sy cache
- A device used to cache storage pool data.
- A cache device cannot be configured as a mirror or raidz group.
- For more information, see the
- .Sx Cache Devices
- section.
- .El
- .Pp
- Virtual devices cannot be nested arbitrarily.
- A mirror, raidz or draid virtual device can only be created with files or disks.
- Mirrors of mirrors or other such combinations are not allowed.
- .Pp
- A pool can have any number of virtual devices at the top of the configuration
- .Po known as
- .Qq root vdevs
- .Pc .
- Data is dynamically distributed across all top-level devices to balance data
- among devices.
- As new virtual devices are added, ZFS automatically places data on the newly
- available devices.
- .Pp
- Virtual devices are specified one at a time on the command line,
- separated by whitespace.
- Keywords like
- .Sy mirror No and Sy raidz
- are used to distinguish where a group ends and another begins.
- For example, the following creates a pool with two root vdevs,
- each a mirror of two disks:
- .Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
- .
- .Ss Device Failure and Recovery
- ZFS supports a rich set of mechanisms for handling device failure and data
- corruption.
- All metadata and data is checksummed, and ZFS automatically repairs bad data
- from a good copy, when corruption is detected.
- .Pp
- In order to take advantage of these features, a pool must make use of some form
- of redundancy, using either mirrored or raidz groups.
- While ZFS supports running in a non-redundant configuration, where each root
- vdev is simply a disk or file, this is strongly discouraged.
- A single case of bit corruption can render some or all of your data unavailable.
- .Pp
- A pool's health status is described by one of three states:
- .Sy online , degraded , No or Sy faulted .
- An online pool has all devices operating normally.
- A degraded pool is one in which one or more devices have failed, but the data is
- still available due to a redundant configuration.
- A faulted pool has corrupted metadata, or one or more faulted devices, and
- insufficient replicas to continue functioning.
- .Pp
- The health of the top-level vdev, such as a mirror or raidz device,
- is potentially impacted by the state of its associated vdevs
- or component devices.
- A top-level vdev or component device is in one of the following states:
- .Bl -tag -width "DEGRADED"
- .It Sy DEGRADED
- One or more top-level vdevs is in the degraded state because one or more
- component devices are offline.
- Sufficient replicas exist to continue functioning.
- .Pp
- One or more component devices is in the degraded or faulted state, but
- sufficient replicas exist to continue functioning.
- The underlying conditions are as follows:
- .Bl -bullet -compact
- .It
- The number of checksum errors or slow I/Os exceeds acceptable levels and the
- device is degraded as an indication that something may be wrong.
- ZFS continues to use the device as necessary.
- .It
- The number of I/O errors exceeds acceptable levels.
- The device could not be marked as faulted because there are insufficient
- replicas to continue functioning.
- .El
- .It Sy FAULTED
- One or more top-level vdevs is in the faulted state because one or more
- component devices are offline.
- Insufficient replicas exist to continue functioning.
- .Pp
- One or more component devices is in the faulted state, and insufficient
- replicas exist to continue functioning.
- The underlying conditions are as follows:
- .Bl -bullet -compact
- .It
- The device could be opened, but the contents did not match expected values.
- .It
- The number of I/O errors exceeds acceptable levels and the device is faulted to
- prevent further use of the device.
- .El
- .It Sy OFFLINE
- The device was explicitly taken offline by the
- .Nm zpool Cm offline
- command.
- .It Sy ONLINE
- The device is online and functioning.
- .It Sy REMOVED
- The device was physically removed while the system was running.
- Device removal detection is hardware-dependent and may not be supported on all
- platforms.
- .It Sy UNAVAIL
- The device could not be opened.
- If a pool is imported when a device was unavailable, then the device will be
- identified by a unique identifier instead of its path since the path was never
- correct in the first place.
- .El
- .Pp
- Checksum errors represent events where a disk returned data that was expected
- to be correct, but was not.
- In other words, these are instances of silent data corruption.
- The checksum errors are reported in
- .Nm zpool Cm status
- and
- .Nm zpool Cm events .
- When a block is stored redundantly, a damaged block may be reconstructed
- (e.g. from raidz parity or a mirrored copy).
- In this case, ZFS reports the checksum error against the disks that contained
- damaged data.
- If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
- in a raidz2 group), it is not possible to determine which disks were silently
- corrupted.
- In this case, checksum errors are reported for all disks on which the block
- is stored.
- .Pp
- If a device is removed and later re-attached to the system,
- ZFS attempts to bring the device online automatically.
- Device attachment detection is hardware-dependent
- and might not be supported on all platforms.
- .
- .Ss Hot Spares
- ZFS allows devices to be associated with pools as
- .Qq hot spares .
- These devices are not actively used in the pool.
- But, when an active device
- fails, it is automatically replaced by a hot spare.
- To create a pool with hot spares, specify a
- .Sy spare
- vdev with any number of devices.
- For example,
- .Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
- .Pp
- Spares can be shared across multiple pools, and can be added with the
- .Nm zpool Cm add
- command and removed with the
- .Nm zpool Cm remove
- command.
- Once a spare replacement is initiated, a new
- .Sy spare
- vdev is created within the configuration that will remain there until the
- original device is replaced.
- At this point, the hot spare becomes available again, if another device fails.
- .Pp
- If a pool has a shared spare that is currently being used, the pool cannot be
- exported, since other pools may use this shared spare, which may lead to
- potential data corruption.
- .Pp
- Shared spares add some risk.
- If the pools are imported on different hosts,
- and both pools suffer a device failure at the same time,
- both could attempt to use the spare at the same time.
- This may not be detected, resulting in data corruption.
- .Pp
- An in-progress spare replacement can be cancelled by detaching the hot spare.
- If the original faulted device is detached, then the hot spare assumes its
- place in the configuration, and is removed from the spare list of all active
- pools.
- .Pp
- The
- .Sy draid
- vdev type provides distributed hot spares.
- These hot spares are named after the dRAID vdev they're a part of
- .Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
- .No which is a single parity dRAID Pc
- and may only be used by that dRAID vdev.
- Otherwise, they behave the same as normal hot spares.
- .Pp
- Spares cannot replace log devices.
- .
- .Ss Intent Log
- The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
- transactions.
- For instance, databases often require their transactions to be on stable storage
- devices when returning from a system call.
- NFS and other applications can also use
- .Xr fsync 2
- to ensure data stability.
- By default, the intent log is allocated from blocks within the main pool.
- However, it might be possible to get better performance using separate intent
- log devices such as NVRAM or a dedicated disk.
- For example:
- .Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
- .Pp
- Multiple log devices can also be specified, and they can be mirrored.
- See the
- .Sx EXAMPLES
- section for an example of mirroring multiple log devices.
- .Pp
- Log devices can be added, replaced, attached, detached, and removed.
- In addition, log devices are imported and exported as part of the pool
- that contains them.
- Mirrored devices can be removed by specifying the top-level mirror vdev.
- .
- .Ss Cache Devices
- Devices can be added to a storage pool as
- .Qq cache devices .
- These devices provide an additional layer of caching between main memory and
- disk.
- For read-heavy workloads, where the working set size is much larger than what
- can be cached in main memory, using cache devices allows much more of this
- working set to be served from low latency media.
- Using cache devices provides the greatest performance improvement for random
- read-workloads of mostly static content.
- .Pp
- To create a pool with cache devices, specify a
- .Sy cache
- vdev with any number of devices.
- For example:
- .Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
- .Pp
- Cache devices cannot be mirrored or part of a raidz configuration.
- If a read error is encountered on a cache device, that read I/O is reissued to
- the original storage pool device, which might be part of a mirrored or raidz
- configuration.
- .Pp
- The content of the cache devices is persistent across reboots and restored
- asynchronously when importing the pool in L2ARC (persistent L2ARC).
- This can be disabled by setting
- .Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
- For cache devices smaller than
- .Em 1 GiB ,
- ZFS does not write the metadata structures
- required for rebuilding the L2ARC, to conserve space.
- This can be changed with
- .Sy l2arc_rebuild_blocks_min_l2size .
- The cache device header
- .Pq Em 512 B
- is updated even if no metadata structures are written.
- Setting
- .Sy l2arc_headroom Ns = Ns Sy 0
- will result in scanning the full-length ARC lists for cacheable content to be
- written in L2ARC (persistent ARC).
- If a cache device is added with
- .Nm zpool Cm add ,
- its label and header will be overwritten and its contents will not be
- restored in L2ARC, even if the device was previously part of the pool.
- If a cache device is onlined with
- .Nm zpool Cm online ,
- its contents will be restored in L2ARC.
- This is useful in case of memory pressure,
- where the contents of the cache device are not fully restored in L2ARC.
- The user can off- and online the cache device when there is less memory
- pressure, to fully restore its contents to L2ARC.
- .
- .Ss Pool checkpoint
- Before starting critical procedures that include destructive actions
- .Pq like Nm zfs Cm destroy ,
- an administrator can checkpoint the pool's state and, in the case of a
- mistake or failure, rewind the entire pool back to the checkpoint.
- Otherwise, the checkpoint can be discarded when the procedure has completed
- successfully.
- .Pp
- A pool checkpoint can be thought of as a pool-wide snapshot and should be used
- with care as it contains every part of the pool's state, from properties to vdev
- configuration.
- Thus, certain operations are not allowed while a pool has a checkpoint.
- Specifically, vdev removal/attach/detach, mirror splitting, and
- changing the pool's GUID.
- Adding a new vdev is supported, but in the case of a rewind it will have to be
- added again.
- Finally, users of this feature should keep in mind that scrubs in a pool that
- has a checkpoint do not repair checkpointed data.
- .Pp
- To create a checkpoint for a pool:
- .Dl # Nm zpool Cm checkpoint Ar pool
- .Pp
- To later rewind to its checkpointed state, you need to first export it and
- then rewind it during import:
- .Dl # Nm zpool Cm export Ar pool
- .Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
- .Pp
- To discard the checkpoint from a pool:
- .Dl # Nm zpool Cm checkpoint Fl d Ar pool
- .Pp
- Dataset reservations (controlled by the
- .Sy reservation No and Sy refreservation
- properties) may be unenforceable while a checkpoint exists, because the
- checkpoint is allowed to consume the dataset's reservation.
- Finally, data that is part of the checkpoint but has been freed in the
- current state of the pool won't be scanned during a scrub.
- .
- .Ss Special Allocation Class
- Allocations in the special class are dedicated to specific block types.
- By default, this includes all metadata, the indirect blocks of user data, and
- any deduplication tables.
- The class can also be provisioned to accept small file blocks.
- .Pp
- A pool must always have at least one normal
- .Pq non- Ns Sy dedup Ns /- Ns Sy special
- vdev before
- other devices can be assigned to the special class.
- If the
- .Sy special
- class becomes full, then allocations intended for it
- will spill back into the normal class.
- .Pp
- Deduplication tables can be excluded from the special class by unsetting the
- .Sy zfs_ddt_data_is_special
- ZFS module parameter.
- .Pp
- Inclusion of small file blocks in the special class is opt-in.
- Each dataset can control the size of small file blocks allowed
- in the special class by setting the
- .Sy special_small_blocks
- property to nonzero.
- See
- .Xr zfsprops 7
- for more info on this property.