logo

oasis-root

Compiled tree of Oasis Linux based on own branch at <https://hacktivis.me/git/oasis/> git clone https://anongit.hacktivis.me/git/oasis-root.git

zpoolconcepts.7 (20045B)


  1. .\"
  2. .\" CDDL HEADER START
  3. .\"
  4. .\" The contents of this file are subject to the terms of the
  5. .\" Common Development and Distribution License (the "License").
  6. .\" You may not use this file except in compliance with the License.
  7. .\"
  8. .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  9. .\" or https://opensource.org/licenses/CDDL-1.0.
  10. .\" See the License for the specific language governing permissions
  11. .\" and limitations under the License.
  12. .\"
  13. .\" When distributing Covered Code, include this CDDL HEADER in each
  14. .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15. .\" If applicable, add the following below this CDDL HEADER, with the
  16. .\" fields enclosed by brackets "[]" replaced with your own identifying
  17. .\" information: Portions Copyright [yyyy] [name of copyright owner]
  18. .\"
  19. .\" CDDL HEADER END
  20. .\"
  21. .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
  22. .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
  23. .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
  24. .\" Copyright (c) 2017 Datto Inc.
  25. .\" Copyright (c) 2018 George Melikov. All Rights Reserved.
  26. .\" Copyright 2017 Nexenta Systems, Inc.
  27. .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
  28. .\"
  29. .Dd April 7, 2023
  30. .Dt ZPOOLCONCEPTS 7
  31. .Os
  32. .
  33. .Sh NAME
  34. .Nm zpoolconcepts
  35. .Nd overview of ZFS storage pools
  36. .
  37. .Sh DESCRIPTION
  38. .Ss Virtual Devices (vdevs)
  39. A "virtual device" describes a single device or a collection of devices,
  40. organized according to certain performance and fault characteristics.
  41. The following virtual devices are supported:
  42. .Bl -tag -width "special"
  43. .It Sy disk
  44. A block device, typically located under
  45. .Pa /dev .
  46. ZFS can use individual slices or partitions, though the recommended mode of
  47. operation is to use whole disks.
  48. A disk can be specified by a full path, or it can be a shorthand name
  49. .Po the relative portion of the path under
  50. .Pa /dev
  51. .Pc .
  52. A whole disk can be specified by omitting the slice or partition designation.
  53. For example,
  54. .Pa sda
  55. is equivalent to
  56. .Pa /dev/sda .
  57. When given a whole disk, ZFS automatically labels the disk, if necessary.
  58. .It Sy file
  59. A regular file.
  60. The use of files as a backing store is strongly discouraged.
  61. It is designed primarily for experimental purposes, as the fault tolerance of a
  62. file is only as good as the file system on which it resides.
  63. A file must be specified by a full path.
  64. .It Sy mirror
  65. A mirror of two or more devices.
  66. Data is replicated in an identical fashion across all components of a mirror.
  67. A mirror with
  68. .Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
  69. devices failing, without losing data.
  70. .It Sy raidz , raidz1 , raidz2 , raidz3
  71. A distributed-parity layout, similar to RAID-5/6, with improved distribution of
  72. parity, and which does not suffer from the RAID-5/6
  73. .Qq write hole ,
  74. .Pq in which data and parity become inconsistent after a power loss .
  75. Data and parity is striped across all disks within a raidz group, though not
  76. necessarily in a consistent stripe width.
  77. .Pp
  78. A raidz group can have single, double, or triple parity, meaning that the
  79. raidz group can sustain one, two, or three failures, respectively, without
  80. losing any data.
  81. The
  82. .Sy raidz1
  83. vdev type specifies a single-parity raidz group; the
  84. .Sy raidz2
  85. vdev type specifies a double-parity raidz group; and the
  86. .Sy raidz3
  87. vdev type specifies a triple-parity raidz group.
  88. The
  89. .Sy raidz
  90. vdev type is an alias for
  91. .Sy raidz1 .
  92. .Pp
  93. A raidz group with
  94. .Em N No disks of size Em X No with Em P No parity disks can hold approximately
  95. .Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
  96. The minimum number of devices in a raidz group is one more than the number of
  97. parity disks.
  98. The recommended number is between 3 and 9 to help increase performance.
  99. .It Sy draid , draid1 , draid2 , draid3
  100. A variant of raidz that provides integrated distributed hot spares, allowing
  101. for faster resilvering, while retaining the benefits of raidz.
  102. A dRAID vdev is constructed from multiple internal raidz groups, each with
  103. .Em D No data devices and Em P No parity devices .
  104. These groups are distributed over all of the children in order to fully
  105. utilize the available disk performance.
  106. .Pp
  107. Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
  108. zeros) to allow fully sequential resilvering.
  109. This fixed stripe width significantly affects both usable capacity and IOPS.
  110. For example, with the default
  111. .Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
  112. If using compression, this relatively large allocation size can reduce the
  113. effective compression ratio.
  114. When using ZFS volumes (zvols) and dRAID, the default of the
  115. .Sy volblocksize
  116. property is increased to account for the allocation size.
  117. If a dRAID pool will hold a significant amount of small blocks, it is
  118. recommended to also add a mirrored
  119. .Sy special
  120. vdev to store those blocks.
  121. .Pp
  122. In regards to I/O, performance is similar to raidz since, for any read, all
  123. .Em D No data disks must be accessed .
  124. Delivered random IOPS can be reasonably approximated as
  125. .Sy floor((N-S)/(D+P))*single_drive_IOPS .
  126. .Pp
  127. Like raidz, a dRAID can have single-, double-, or triple-parity.
  128. The
  129. .Sy draid1 ,
  130. .Sy draid2 ,
  131. and
  132. .Sy draid3
  133. types can be used to specify the parity level.
  134. The
  135. .Sy draid
  136. vdev type is an alias for
  137. .Sy draid1 .
  138. .Pp
  139. A dRAID with
  140. .Em N No disks of size Em X , D No data disks per redundancy group , Em P
  141. .No parity level, and Em S No distributed hot spares can hold approximately
  142. .Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
  143. devices failing without losing data.
  144. .It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
  145. A non-default dRAID configuration can be specified by appending one or more
  146. of the following optional arguments to the
  147. .Sy draid
  148. keyword:
  149. .Bl -tag -compact -width "children"
  150. .It Ar parity
  151. The parity level (1-3).
  152. .It Ar data
  153. The number of data devices per redundancy group.
  154. In general, a smaller value of
  155. .Em D No will increase IOPS, improve the compression ratio ,
  156. and speed up resilvering at the expense of total usable capacity.
  157. Defaults to
  158. .Em 8 , No unless Em N-P-S No is less than Em 8 .
  159. .It Ar children
  160. The expected number of children.
  161. Useful as a cross-check when listing a large number of devices.
  162. An error is returned when the provided number of children differs.
  163. .It Ar spares
  164. The number of distributed hot spares.
  165. Defaults to zero.
  166. .El
  167. .It Sy spare
  168. A pseudo-vdev which keeps track of available hot spares for a pool.
  169. For more information, see the
  170. .Sx Hot Spares
  171. section.
  172. .It Sy log
  173. A separate intent log device.
  174. If more than one log device is specified, then writes are load-balanced between
  175. devices.
  176. Log devices can be mirrored.
  177. However, raidz vdev types are not supported for the intent log.
  178. For more information, see the
  179. .Sx Intent Log
  180. section.
  181. .It Sy dedup
  182. A device solely dedicated for deduplication tables.
  183. The redundancy of this device should match the redundancy of the other normal
  184. devices in the pool.
  185. If more than one dedup device is specified, then
  186. allocations are load-balanced between those devices.
  187. .It Sy special
  188. A device dedicated solely for allocating various kinds of internal metadata,
  189. and optionally small file blocks.
  190. The redundancy of this device should match the redundancy of the other normal
  191. devices in the pool.
  192. If more than one special device is specified, then
  193. allocations are load-balanced between those devices.
  194. .Pp
  195. For more information on special allocations, see the
  196. .Sx Special Allocation Class
  197. section.
  198. .It Sy cache
  199. A device used to cache storage pool data.
  200. A cache device cannot be configured as a mirror or raidz group.
  201. For more information, see the
  202. .Sx Cache Devices
  203. section.
  204. .El
  205. .Pp
  206. Virtual devices cannot be nested arbitrarily.
  207. A mirror, raidz or draid virtual device can only be created with files or disks.
  208. Mirrors of mirrors or other such combinations are not allowed.
  209. .Pp
  210. A pool can have any number of virtual devices at the top of the configuration
  211. .Po known as
  212. .Qq root vdevs
  213. .Pc .
  214. Data is dynamically distributed across all top-level devices to balance data
  215. among devices.
  216. As new virtual devices are added, ZFS automatically places data on the newly
  217. available devices.
  218. .Pp
  219. Virtual devices are specified one at a time on the command line,
  220. separated by whitespace.
  221. Keywords like
  222. .Sy mirror No and Sy raidz
  223. are used to distinguish where a group ends and another begins.
  224. For example, the following creates a pool with two root vdevs,
  225. each a mirror of two disks:
  226. .Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
  227. .
  228. .Ss Device Failure and Recovery
  229. ZFS supports a rich set of mechanisms for handling device failure and data
  230. corruption.
  231. All metadata and data is checksummed, and ZFS automatically repairs bad data
  232. from a good copy, when corruption is detected.
  233. .Pp
  234. In order to take advantage of these features, a pool must make use of some form
  235. of redundancy, using either mirrored or raidz groups.
  236. While ZFS supports running in a non-redundant configuration, where each root
  237. vdev is simply a disk or file, this is strongly discouraged.
  238. A single case of bit corruption can render some or all of your data unavailable.
  239. .Pp
  240. A pool's health status is described by one of three states:
  241. .Sy online , degraded , No or Sy faulted .
  242. An online pool has all devices operating normally.
  243. A degraded pool is one in which one or more devices have failed, but the data is
  244. still available due to a redundant configuration.
  245. A faulted pool has corrupted metadata, or one or more faulted devices, and
  246. insufficient replicas to continue functioning.
  247. .Pp
  248. The health of the top-level vdev, such as a mirror or raidz device,
  249. is potentially impacted by the state of its associated vdevs
  250. or component devices.
  251. A top-level vdev or component device is in one of the following states:
  252. .Bl -tag -width "DEGRADED"
  253. .It Sy DEGRADED
  254. One or more top-level vdevs is in the degraded state because one or more
  255. component devices are offline.
  256. Sufficient replicas exist to continue functioning.
  257. .Pp
  258. One or more component devices is in the degraded or faulted state, but
  259. sufficient replicas exist to continue functioning.
  260. The underlying conditions are as follows:
  261. .Bl -bullet -compact
  262. .It
  263. The number of checksum errors or slow I/Os exceeds acceptable levels and the
  264. device is degraded as an indication that something may be wrong.
  265. ZFS continues to use the device as necessary.
  266. .It
  267. The number of I/O errors exceeds acceptable levels.
  268. The device could not be marked as faulted because there are insufficient
  269. replicas to continue functioning.
  270. .El
  271. .It Sy FAULTED
  272. One or more top-level vdevs is in the faulted state because one or more
  273. component devices are offline.
  274. Insufficient replicas exist to continue functioning.
  275. .Pp
  276. One or more component devices is in the faulted state, and insufficient
  277. replicas exist to continue functioning.
  278. The underlying conditions are as follows:
  279. .Bl -bullet -compact
  280. .It
  281. The device could be opened, but the contents did not match expected values.
  282. .It
  283. The number of I/O errors exceeds acceptable levels and the device is faulted to
  284. prevent further use of the device.
  285. .El
  286. .It Sy OFFLINE
  287. The device was explicitly taken offline by the
  288. .Nm zpool Cm offline
  289. command.
  290. .It Sy ONLINE
  291. The device is online and functioning.
  292. .It Sy REMOVED
  293. The device was physically removed while the system was running.
  294. Device removal detection is hardware-dependent and may not be supported on all
  295. platforms.
  296. .It Sy UNAVAIL
  297. The device could not be opened.
  298. If a pool is imported when a device was unavailable, then the device will be
  299. identified by a unique identifier instead of its path since the path was never
  300. correct in the first place.
  301. .El
  302. .Pp
  303. Checksum errors represent events where a disk returned data that was expected
  304. to be correct, but was not.
  305. In other words, these are instances of silent data corruption.
  306. The checksum errors are reported in
  307. .Nm zpool Cm status
  308. and
  309. .Nm zpool Cm events .
  310. When a block is stored redundantly, a damaged block may be reconstructed
  311. (e.g. from raidz parity or a mirrored copy).
  312. In this case, ZFS reports the checksum error against the disks that contained
  313. damaged data.
  314. If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
  315. in a raidz2 group), it is not possible to determine which disks were silently
  316. corrupted.
  317. In this case, checksum errors are reported for all disks on which the block
  318. is stored.
  319. .Pp
  320. If a device is removed and later re-attached to the system,
  321. ZFS attempts to bring the device online automatically.
  322. Device attachment detection is hardware-dependent
  323. and might not be supported on all platforms.
  324. .
  325. .Ss Hot Spares
  326. ZFS allows devices to be associated with pools as
  327. .Qq hot spares .
  328. These devices are not actively used in the pool.
  329. But, when an active device
  330. fails, it is automatically replaced by a hot spare.
  331. To create a pool with hot spares, specify a
  332. .Sy spare
  333. vdev with any number of devices.
  334. For example,
  335. .Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
  336. .Pp
  337. Spares can be shared across multiple pools, and can be added with the
  338. .Nm zpool Cm add
  339. command and removed with the
  340. .Nm zpool Cm remove
  341. command.
  342. Once a spare replacement is initiated, a new
  343. .Sy spare
  344. vdev is created within the configuration that will remain there until the
  345. original device is replaced.
  346. At this point, the hot spare becomes available again, if another device fails.
  347. .Pp
  348. If a pool has a shared spare that is currently being used, the pool cannot be
  349. exported, since other pools may use this shared spare, which may lead to
  350. potential data corruption.
  351. .Pp
  352. Shared spares add some risk.
  353. If the pools are imported on different hosts,
  354. and both pools suffer a device failure at the same time,
  355. both could attempt to use the spare at the same time.
  356. This may not be detected, resulting in data corruption.
  357. .Pp
  358. An in-progress spare replacement can be cancelled by detaching the hot spare.
  359. If the original faulted device is detached, then the hot spare assumes its
  360. place in the configuration, and is removed from the spare list of all active
  361. pools.
  362. .Pp
  363. The
  364. .Sy draid
  365. vdev type provides distributed hot spares.
  366. These hot spares are named after the dRAID vdev they're a part of
  367. .Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
  368. .No which is a single parity dRAID Pc
  369. and may only be used by that dRAID vdev.
  370. Otherwise, they behave the same as normal hot spares.
  371. .Pp
  372. Spares cannot replace log devices.
  373. .
  374. .Ss Intent Log
  375. The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
  376. transactions.
  377. For instance, databases often require their transactions to be on stable storage
  378. devices when returning from a system call.
  379. NFS and other applications can also use
  380. .Xr fsync 2
  381. to ensure data stability.
  382. By default, the intent log is allocated from blocks within the main pool.
  383. However, it might be possible to get better performance using separate intent
  384. log devices such as NVRAM or a dedicated disk.
  385. For example:
  386. .Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
  387. .Pp
  388. Multiple log devices can also be specified, and they can be mirrored.
  389. See the
  390. .Sx EXAMPLES
  391. section for an example of mirroring multiple log devices.
  392. .Pp
  393. Log devices can be added, replaced, attached, detached, and removed.
  394. In addition, log devices are imported and exported as part of the pool
  395. that contains them.
  396. Mirrored devices can be removed by specifying the top-level mirror vdev.
  397. .
  398. .Ss Cache Devices
  399. Devices can be added to a storage pool as
  400. .Qq cache devices .
  401. These devices provide an additional layer of caching between main memory and
  402. disk.
  403. For read-heavy workloads, where the working set size is much larger than what
  404. can be cached in main memory, using cache devices allows much more of this
  405. working set to be served from low latency media.
  406. Using cache devices provides the greatest performance improvement for random
  407. read-workloads of mostly static content.
  408. .Pp
  409. To create a pool with cache devices, specify a
  410. .Sy cache
  411. vdev with any number of devices.
  412. For example:
  413. .Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
  414. .Pp
  415. Cache devices cannot be mirrored or part of a raidz configuration.
  416. If a read error is encountered on a cache device, that read I/O is reissued to
  417. the original storage pool device, which might be part of a mirrored or raidz
  418. configuration.
  419. .Pp
  420. The content of the cache devices is persistent across reboots and restored
  421. asynchronously when importing the pool in L2ARC (persistent L2ARC).
  422. This can be disabled by setting
  423. .Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
  424. For cache devices smaller than
  425. .Em 1 GiB ,
  426. ZFS does not write the metadata structures
  427. required for rebuilding the L2ARC, to conserve space.
  428. This can be changed with
  429. .Sy l2arc_rebuild_blocks_min_l2size .
  430. The cache device header
  431. .Pq Em 512 B
  432. is updated even if no metadata structures are written.
  433. Setting
  434. .Sy l2arc_headroom Ns = Ns Sy 0
  435. will result in scanning the full-length ARC lists for cacheable content to be
  436. written in L2ARC (persistent ARC).
  437. If a cache device is added with
  438. .Nm zpool Cm add ,
  439. its label and header will be overwritten and its contents will not be
  440. restored in L2ARC, even if the device was previously part of the pool.
  441. If a cache device is onlined with
  442. .Nm zpool Cm online ,
  443. its contents will be restored in L2ARC.
  444. This is useful in case of memory pressure,
  445. where the contents of the cache device are not fully restored in L2ARC.
  446. The user can off- and online the cache device when there is less memory
  447. pressure, to fully restore its contents to L2ARC.
  448. .
  449. .Ss Pool checkpoint
  450. Before starting critical procedures that include destructive actions
  451. .Pq like Nm zfs Cm destroy ,
  452. an administrator can checkpoint the pool's state and, in the case of a
  453. mistake or failure, rewind the entire pool back to the checkpoint.
  454. Otherwise, the checkpoint can be discarded when the procedure has completed
  455. successfully.
  456. .Pp
  457. A pool checkpoint can be thought of as a pool-wide snapshot and should be used
  458. with care as it contains every part of the pool's state, from properties to vdev
  459. configuration.
  460. Thus, certain operations are not allowed while a pool has a checkpoint.
  461. Specifically, vdev removal/attach/detach, mirror splitting, and
  462. changing the pool's GUID.
  463. Adding a new vdev is supported, but in the case of a rewind it will have to be
  464. added again.
  465. Finally, users of this feature should keep in mind that scrubs in a pool that
  466. has a checkpoint do not repair checkpointed data.
  467. .Pp
  468. To create a checkpoint for a pool:
  469. .Dl # Nm zpool Cm checkpoint Ar pool
  470. .Pp
  471. To later rewind to its checkpointed state, you need to first export it and
  472. then rewind it during import:
  473. .Dl # Nm zpool Cm export Ar pool
  474. .Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
  475. .Pp
  476. To discard the checkpoint from a pool:
  477. .Dl # Nm zpool Cm checkpoint Fl d Ar pool
  478. .Pp
  479. Dataset reservations (controlled by the
  480. .Sy reservation No and Sy refreservation
  481. properties) may be unenforceable while a checkpoint exists, because the
  482. checkpoint is allowed to consume the dataset's reservation.
  483. Finally, data that is part of the checkpoint but has been freed in the
  484. current state of the pool won't be scanned during a scrub.
  485. .
  486. .Ss Special Allocation Class
  487. Allocations in the special class are dedicated to specific block types.
  488. By default, this includes all metadata, the indirect blocks of user data, and
  489. any deduplication tables.
  490. The class can also be provisioned to accept small file blocks.
  491. .Pp
  492. A pool must always have at least one normal
  493. .Pq non- Ns Sy dedup Ns /- Ns Sy special
  494. vdev before
  495. other devices can be assigned to the special class.
  496. If the
  497. .Sy special
  498. class becomes full, then allocations intended for it
  499. will spill back into the normal class.
  500. .Pp
  501. Deduplication tables can be excluded from the special class by unsetting the
  502. .Sy zfs_ddt_data_is_special
  503. ZFS module parameter.
  504. .Pp
  505. Inclusion of small file blocks in the special class is opt-in.
  506. Each dataset can control the size of small file blocks allowed
  507. in the special class by setting the
  508. .Sy special_small_blocks
  509. property to nonzero.
  510. See
  511. .Xr zfsprops 7
  512. for more info on this property.