Backups-and-redundancy-at-sr.ht.md (6966B)
- ---
- date: 2019-01-13
- layout: post
- title: "Backups & redundancy at sr.ht"
- tags: ["sourcehut", "ops"]
- ---
- [sr.ht](https://sr.ht)[^1] is [100% open source][sr.ht-code] and I encourage
- people to install it on their own infrastructure, especially if they'll be
- sending patches upstream. However, I am equally thrilled to host sr.ht for you
- on the "official" instance, and most users find this useful because the
- maintenance burden is non-trivial. Today I'll give you an idea of what your
- subscription fee pays for. In this first post on ops at sr.ht, I'll talk about
- backups and redundancy. In future posts, I'll talk about security, high
- availability, automation, and more.
- [^1]: sr.ht is a software project hosting website, with git hosting, ticket tracking, continuous integration, mailing lists, and more. [Try it out!](https://sr.ht)
- [sr.ht-code]: https://git.sr.ht/~sircmpwn?search=sr.ht
- As sr.ht is still in the alpha phase, high availability has been on the
- backburner. However, data integrity has always been of paramount importance to
- me. The very earliest versions of sr.ht, from well before it was even trying to
- be a software forge, made a point to never lose a single byte of user data.
- Outages are okay - so long as when service is restored, everything is still
- there. Over time I'm working to make outages a thing of the past, too, but let's
- start with backups.
- There are several ways that sr.ht stores data:
- - Important data on the filesystem (e.g. bare git repositories)
- - Important persistent data in PostgreSQL
- - Unimportant ephemeral data in Redis (& caches)
- - Miscellaneous filesystem storage, like the operating system
- Some of this data is important and kept redundant (PostgreSQL, git repos), and
- others are unimportant and is not redundant. For example, I store a rendered
- Markdown cache for git.sr.ht in Redis. If the Redis cluster goes *poof*, the
- source Markdown is still available, so I don't bother backing up Redis. Most
- services run in a VM and I generally don't store important data on these - the
- hosts usually only have one hard drive with no backups and no redundancy. If the
- host dies, I have to reprovision all of those VMs.
- Other data is more important. Consider PostgreSQL, which contains some of the
- most important data for sr.ht. I have one master PostgreSQL server, a dedicated
- server in the space I colocate in my home town of Philadelphia. I run sr.ht on
- this server, but I also use it for a variety of other projects - I maintain many
- myself, and I volunteer as a sysadmin for more still. This box (named Remilia)
- has four hard drives configured in a ZRAID (ZFS). I buy these hard drives from a
- variety of vendors, mostly Western Digital and Seagate, and from different
- batches - reducing the likelihood that they'll fail around the same time. ZFS is
- well-known for it's excellent design, featureset and for simply keeping your
- data intact, and I don't trust any other filesystem with important data. I take
- ZFS snapshots every 15 minutes and retain them for 30 days. These snapshots are
- important for correcting the "oh shit, I rm'd something important" mistakes -
- you can mount them later and see what the filesystem looked like at the time
- they were taken.
- On top of this, the PostgreSQL server is set up with two additional important
- features: continuous archiving and streaming replication. Continuous archiving
- has PostgreSQL writing each transaction to log files on disk, which represents a
- re-playable history of the entire database, and allows you to restore the
- database to any point in time. This helps with "oh shit, I dropped an important
- table" mistakes. Streaming replication ships changes to an off-site standby
- server, in this case set up in my second colocation in San Francisco (the main
- backup box, which we'll talk about more shortly). This takes a near real-time
- backup of the database, and has the advantage of being able to quickly failover
- to it as the primary database during maintenance and outages (more on this
- during the upcoming high availability article). Soon I'll be setting up a second
- failover server as well, on-site.
- So there are multiple layers to this:
- - ZFS & zraid prevents disk failure from causing data loss
- - ZFS snapshots allows retrieving filesystem-level data from the past
- - Continuous archiving allows retrieving database-level data from the past
- - Streaming replication prevents datacenter existence failure from causing data
- loss
- Having multiple layers of data redundancy here protects sr.ht from a wide
- variety of failure modes, and also protects each redundant system from itself -
- if any of these systems fails, there's another place to get this data from.
- The off-site backup in San Francisco (this box is called Konpaku) has a whopping
- 52T of storage in two ZFS pools, named "small" (4T) and "large" (48T). The
- PostgreSQL standby server lives in the small pool, and [borg
- backups](https://www.borgbackup.org/) live in the large pool. This has the same
- ZFS snapshotting and retention policy as Remilia, and also has drives sourced
- from a variety of vendors and batches. Borg is how important filesystem-level
- data is backed up, for example git repositories on git.sr.ht. Borg is nice
- enough to compress, encrypt, and deduplicate its backups for us, which I take
- hourly with a cronjob on the machines which own that data. The retention policy
- is hourly backups stored for 48 hours, daily backups for 2 weeks, and weekly
- backups stored indefinitely.
- There are two other crucial steps in maintaining a working backup system:
- monitoring and testing. The old wisdom is "you don't have backups until you've
- tested them". The simplest monitoring comes from cron - when I provision a new
- box, I make sure to set `MAILTO`, make sure sendmail works, and set up a
- deliberately failing cron entry to ensure I hear about it when it breaks. I also
- set up zfs-zed to email me whenever ZFS encounters issues, which also has a test
- mode you should use. For testing, I periodically provision private replicas of
- sr.ht services from backups and make sure that they work as expected. PostgreSQL
- replication is fairly new to my setup, but my intention is to switch the primary
- and standby servers on every database upgrade for HA[^2] purposes, which
- conveniently also tests that each standby is up-to-date and still replicating.
- [^2]: High availability
- To many veteran sysadmins, a lot of this is basic stuff, but it took me a long
- time to learn how all of this worked and establish a set of best practices for
- myself. With the rise in popularity of managed ops like AWS and GCP, it seems
- like ops & sysadmin roles are becoming less common. Some of us still love the
- sound of a datacenter and the greater level of control you have over your
- services, and as a bonus my users aren't worrying about $bigcorp having access
- to their data.
- The next ops thing on my todo list is high availability, which is still
- in-progress on sr.ht. When it's done, expect another blog post!