logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

How-to-store-data-forever.md (11664B)


  1. ---
  2. date: 2020-04-22
  3. title: How to store data forever
  4. layout: post
  5. ---
  6. As someone who has been often maligned by the disappearance of my data for
  7. various reasons — companies going under, hard drive failure, etc —
  8. and as someone who is responsible for the safekeeping of other people's data,
  9. I've put a lot of thought into solutions for long-term data retention.
  10. There are two kinds of long-term storage, with different concerns: cold storage
  11. and hot storage. The former is like a hard drive in your safe — it stores
  12. your data, but you're not actively using it or putting wear on the storage
  13. medium. By contrast, hot storage is storage which is available immediately and
  14. undergoing frequent reads and writes.
  15. ## What storage medium to use?
  16. There are some bad ways to do it. The worst way I can think of is to store
  17. it on a microSD card. These fail *a lot*. I couldn't find any hard data, but
  18. anecdotally, 4 out of 5 microSD cards I've used have experienced failures
  19. resulting in permanent data loss. Low volume writes, such as from a digital
  20. camera, are unlikely to cause failure. However, microSD cards have a tendency to
  21. get hot with prolonged writes, and they'll quickly leave their safe operating
  22. temperature and start to accumulate damage. Nearly all microSD cards will let
  23. you perform writes fast enough to drive up the temperature beyond the operating
  24. limits — after all, writes per second is a marketable feature — so
  25. if you want to safely move lots of data onto or off of a microSD card, you need
  26. to monitor the temperature and throttle your read/write operations.
  27. A more reliable solution is to store the data on a hard drive[^1]. However, hard
  28. drives are rated for a limited number of read/write cycles, and can be expected
  29. to fail eventually. Backblaze publishes some great articles on [hard drive
  30. failure rates](https://www.backblaze.com/blog/hard-drive-stats-for-2019/) across
  31. their fleet. According to them, the average annual failure rate of hard drives
  32. is almost 2%. Of course, the exact rate will vary with the frequency of use and
  33. storage conditions. Even in cold storage, the shelf life of a magnetic platter
  34. is not indefinite.
  35. [^1]: Or SSDs, which I will refer to interchangeably with HDDs in this article. They have their own considerations, but we'll get to that.
  36. There are other solutions, like optical media, tape drives, or more novel
  37. mediums, like the [Rosetta Disk](https://en.wikipedia.org/wiki/Rosetta_Project).
  38. For most readers, a hard drive will be the best balance of practical and
  39. reliable. For serious long-term storage, if expense isn't a concern, I would
  40. also recommend hot storage over cold storage because it introduces the
  41. possibility of active monitoring.
  42. ## Redundancy with RAID
  43. One solution to this is redundancy — storing the same data across multiple
  44. hard drives. For cold storage, this is often as simple as copying the data onto
  45. a second hard drive, like an external backup HDD. Other solutions exist for hot
  46. storage. The most common standard is [RAID][RAID], which offers different
  47. features with different numbers of hard drives. With two hard drives (RAID1), for
  48. example, it utilizes mirroring, which writes the same data to both disks. RAID
  49. gets more creative with three or more hard drives, utilizing *parity*, which
  50. allows it to reconstruct the contents of failed hard drives from still-online
  51. drives. The basic idea relies on the XOR operation. Let's say you write the
  52. following byte to drive A: `0b11100111`, and to drive B: `0b10101100`. By XORing
  53. these values together:
  54. [RAID]: https://en.wikipedia.org/wiki/RAID
  55. 11100111 A
  56. ^ 10101100 B
  57. = 01001011 C
  58. We obtain the value to write to drive C. If any of these three drives fail, we
  59. can XOR the remaining two values again to obtain the third.
  60. 11100111 A
  61. ^ 01001011 C
  62. = 10101100 B
  63. 10101100 B
  64. ^ 01001011 C
  65. = 11100111 A
  66. This allows any drive to fail while still being able to recover its contents,
  67. and the recovery can be performed online. However, it's often not that simple.
  68. Drive failure can dramatically reduce the performance of the array while it's
  69. being rebuilt — the disks are going to be seeking constantly to find the
  70. parity data to rebuild the failed disk, and any attempts to read from the disk
  71. that's being rebuilt will require computing the recovered value on the fly. This
  72. can be improved upon by using lots of drives and multiple levels of redundancy,
  73. but it is still likely to have an impact on the availability of your data if not
  74. carefully planned for.
  75. You should also be monitoring your drives and preparing for their failure in
  76. advance. Failing disks can show signs of it in advance — degraded
  77. performance, or via S.M.A.R.T reports. Learn the tools for monitoring your
  78. storage medium, such as smartmontools, and set it up to report failures to you
  79. (and *test* the mechanisms by which the failures are reported to you).
  80. ### Other RAID failure modes
  81. There are other common ways a RAID can fail that result in permanent data loss.
  82. One example is using hardware RAID — there was an argument to be made for
  83. them at one point, but these days hardware RAID is *almost always* a mistake.
  84. Most operating systems have software RAID implementations which can achieve the
  85. same results without a dedicated RAID card. With hardware RAID, if the RAID card
  86. itself fails (and they often do), you might have to find the exact same card to
  87. be able to read from your disks again. You'll be paying for new hardware, which
  88. might be expensive or out of production, and waiting for it to arrive before you
  89. can start recovering data. With software RAID, the hard drives are portable
  90. between machines and you can always interpret the data with general purpose
  91. software.
  92. Another common failure is *cascading* drive failures. RAID can tolerate partial
  93. drive failure thanks to parity and mirroring, but if the failures start to pile
  94. up, you can suffer permanent data loss. Many a sad administrator has been in
  95. panic mode, recovering a RAID from a disk failure, and at their lowest
  96. moment... another disk fails. Then another. They've suddenly lost their data,
  97. and the challenge of recovering what remains has become ten times harder. When
  98. you've been distributing read and write operations consistently across all of
  99. your drives over the lifetime of the hardware, they've been receiving a similar
  100. level of wear, and failing together is not uncommon.
  101. Often, failures like this can be attributed to using many hard drives from the
  102. same batch. One strategy I recommend to avoid this scenario is to use drives
  103. from a mix of vendors, model numbers, and so on. Using a RAID improves
  104. performance by distributing reads and writes across drives, using the time one
  105. drive is busy to utilize an alternate. Accordingly, any differences in the
  106. performance characteristics of different kinds of drives will be smoothed out in
  107. the wash.
  108. ## ZFS
  109. RAID is complicated, and getting it right is difficult. You don't want to wait
  110. until your drives are failing to learn about a gap in your understanding of
  111. RAID. For this reason, I recommend ZFS to most. It automatically makes good
  112. decisions for you with respect to mirroring and parity, and gracefully handles
  113. rebuilds, sudden power loss, and other failures. It also has features which are
  114. helpful for other failure modes, like snapshots.
  115. Set up Zed to email you reports from ZFS. Zed has a debug mode, which will send
  116. you emails even for working disks — I recommend leaving this on, so that
  117. their conspicuous absence might alert you to a problem with the monitoring
  118. mechanism. Set up a cronjob to do monthly scrubs and review the Zed reports when
  119. they arrive. ZFS snapshots are cheap - set up a cronjob to take one every 5
  120. minutes, perhaps with [zfs-auto-snapshot][zfs-auto].
  121. [zfs-auto]: https://github.com/zfsonlinux/zfs-auto-snapshot
  122. ## Human failures and existential threats
  123. Even if you've addressed hardware failure, you're not done yet. There are other
  124. ways still in which your storage may fail. Maybe your server fans fail and burn
  125. out all of your hard drives at once. Or, your datacenter could suffer a total
  126. existence failure — what if a fire burns down the building?
  127. There's also the problem of human failure. What if you accidentally `rm -rf / *`
  128. the server? Your RAID array will faithfully remove the data from all of the hard
  129. drives for you. What if you send the sysop out to the datacenter to decommission
  130. a machine, and no one notices that they decommissioned the wrong one until it's
  131. too late?
  132. This is where off-site backups come into play. For this purpose, I recommend
  133. [Borg backup][borg]. It has sophisticated features for compression and
  134. encryption, and allows you to mount any version of your backups as a filesystem
  135. to recover the data from. Set this up on a cronjob as well for as frequently as
  136. you feel the need to make backups, and send them off-site to another location,
  137. which itself should have storage facilities following the rest of the
  138. recommendations from this article. Set up another cronjob to run `borg check`
  139. and send you the results on a schedule, so that their conspicuous absence may
  140. indicate that something fishy is going on. I also use [Prometheus][prom] with
  141. [Pushgateway][pushgateway] to make a note every time that a backup is run, and
  142. set up an alarm which goes off if the backup age exceeds 48 hours. I also have
  143. periodic test alarms, so that the alert manager's own failures are noticed.
  144. [borg]: https://www.borgbackup.org/
  145. [prom]: https://prometheus.io/
  146. [pushgateway]: https://github.com/prometheus/pushgateway
  147. ## Are you prepared for the failure?
  148. When your disks are failing and everything is on fire and the sky is falling,
  149. this is the worst time to be your first rodeo. You should have *practiced* these
  150. problems before they became problems. Do training with anyone expected to deal
  151. with failures. Yank out a hard drive and tell them to fix it. Have someone in
  152. sales come yell at them partway through because the website is unbearably slow
  153. while the RAID is rebuilding and the company is losing $100 per minute as a
  154. result of the outage.
  155. Periodically produce a working system from your backups. This proves (1) the
  156. backups are still working, (2) the backups have coverage over everything which
  157. would need to be restored, and (3) you know how to restore them. Bonus: if
  158. you're confident in your backups, you should be able to replace the production
  159. system with the restored one and allow service to continue as normal.
  160. ## Actually storing data *forever*
  161. Let's say you've managed to keep your data around. But will you still know how
  162. to interpret that data in the future? Is it in a file format which requires
  163. specialized software to use? Will that software still be relevant in the future?
  164. Is that software open-source, so you can update it yourself? Will it still
  165. compile and run correctly on newer operating systems and hardware? Will the
  166. storage medium still be compatible with new computers?
  167. Who is going to be around to watch the monitoring systems you've put in place?
  168. Who's going to replace the failing hard drives after you're gone? How will they
  169. be paid? Will the dataset still be comprehensible after 500 years of evolution
  170. of written language? The dataset requires constant maintenance to remain intact,
  171. but also to remain useful.
  172. And ultimately, there is one factor to long-term data retention that you cannot
  173. control: future generations will decide what data is worth keeping — not
  174. us.
  175. In summary: no matter what, definitely don't do this:
  176. ![Picture of a SATA card for RAIDing 10 microSD cards together](https://l.sr.ht/ig3R.jpg)