A picture of a cargo container being lifted on a dock. It has the word backup written on the side of it.

About six weeks ago, the Grumpy Home Server crashed, and crashed hard. I noticed that some random scheduled jobs had stopped running, and a docker container or two had fallen over. I instinctively tried rebooting it to see if it helped (years spent running Windows will ingrain that behaviour in you, no matter how hard you try to get rid of it). When I rebooted the machine, the OS didn’t even boot up. Upon closer examination, the main hard-drive was deader than a moa. Disaster!

This has happened to me once before, a few years ago, when the disk array I was using fell apart for some reason. This horrible feeling hits your stomach when it all goes wrong. There’s a lot of services that run on the server - photo syncing, media serving, private GitLab hosting, and more. To think that this is all broken is pretty horrifying. Still, in an effort to make lemon vodka out of citrus fruit, I realised that I’d been planning on upgrading OpenMediaVault from v5 for some time, so this was something I could do while rebuilding things.

One thing that saved my vacon during the first crash was backups. I’ve always tried to keep a reasonable backup regime going, with files from my PC and laptop backed up to the server, then having the server backed up off-site (originally to S3, now to Backblaze). I’ve always felt quite smug in having a fallback that I could turn to if everything went completely wrong. Which it had, so I was reasonably confident that I could make a reasonable recovery. I also hoped that since it was just the main drive that went on the server, the data disk array would still be usable. Time to investigate further!

Where Are Those Files Again?

I booted up the server using a USB recovery stick and took a bit of a look around the box. The disk arrays were still in a good state, and the SnapRAID array looked to be in good shape. A minor victory, but an important one. With a new SSD having arrived, I installed v7 of OMV, which looked to be quite a nice upgrade in terms of appearance, with all of the same functionality intact. Over the last few years, the folks behind OMV had started encouraging users to move to running software using docker, rather than providing native plugins for many different things. I’d gone through this exercise, and had around 15 different docker containers running on the server. I hoped that I could just fire up the docker-compose configurations, point them at the newly mounted disk arrays, and everything would just kick in…

…except that the docker-compose files were all on the failed drive. Along with my SSL certificates. And batch job scripts. And a bunch of other stuff. The sinking feeling in my stomach that had slowly dissipated over the previous few hours came back stronger than ever. It seemed that I faced more of a battle to get back to normal than I’d anticipated.

One part of my scheduled jobs included running a full disk image backup each week using the OMV backup plugin. For obvious reasons, this was being written to the disk array, so I had access to these. Copying the latest backup to the new SSD and unzipping it, I was left wondering how to get hold of what was inside it without actually restoring it to another disk (which I didn’t have). Kagi to the rescue! Some helpful soul had written a quick snippet on how to mount a full disk image onto a mount point on Linux using /dev/loop0. This ended up working flawlessly, and I immediately had access to the most recent weekly copy of all of the things I’d been looking for - scripts, certificates and docker data volumes. This was an absolutely huge relief!

Using the mounted image, I began to rebuild the core parts of the system. Nightly batch engine - check. Certificate generation - check. Dovecot service for old mail files - check. Things were going swimmingly. This, of course, would not last.

Image Issues

The next thing to setup was the docker containers on the new system. Having access to the old docker volumes from the old SSD as well as the docker-compose files meant that this should be pretty straightforward. Just re-start the containers, let docker pull down the new images, and off we go. First on the list was Cloudberry Backup (now known as MSP360). I tentatively fired this up, and it just worked. Magic! Next, the CalibreWeb container to serve up the zillions of eBooks I’ve bought on Humble Bundles that I’ll probably never read. Another easy win!

It wasn’t until I hit my GitLab instance that things went a bit wrong. Firing it up, I got a number of errors related to a corrupted database. Why was this not working when the other containers had started up just fine? It dawned on me that creating a full disk image each week is great, apart from when the disk sectors you’re backing up are in use by a running program (e.g. a database). For large files that are being written to frequently, there’s a good chance that a file may be imaged while being part-way through a write. The file is backed up in an inconsistent state, and it’s corrupted as far as the database is concerned.

A picture of a stylised database symbol, breaking apart.
Not as useful as I'd hoped.

Now, while this wasn’t a disaster (I can always re-setup the GitLab instance since I’m only using it to build documentation), it hinted at bigger problems. My backups weren’t completely usable. I couldn’t just slot all of my docker data volumes onto a new drive and pick up where I left off. I needed a way to stop the containers so that they weren’t writing to the data files, back up the data volumes, then restart them again. A quick rummage around the intarwebs didn’t find anything obvious that would do this, so in the best traditions of hacker nerds everywhere, I rolled up my sleeves to write something to do this myself.

I Can See Clearly Now

I spent a few days putting together the great Docker-Composer-Backer-Upper, a Python utility script to manage containers run by docker-compose. There were a couple of interesting things that I tried along the way that made this quite a fun little beast to put together.

The first was the choice to use uv to manage my project dependencies, and ruff as a linter and formatter. I’d been doing some reading on these over the last few weeks, and was impressed by how fast the performance was compared to other similar tools (Rust FTW!). The way the tools worked also seemed quite natural, so this was a good opportunity to try them both out for something real. They both lived up to expectation, and the fact that uv integrates nicely with both ruff and PyCharm means that I didn’t have any issues messing about with virtual environments. I’ve gone as far as removing virtualenv from my standard machine setup now and replaced it with uv, so it’s here to stay for me.

The next thing to tackle was the parsing of the docker-compose YAML files. While I could just use the yaml module to load in the file and write some dictionary inspection logic, I preferred the idea of having something that could load a compose file into an actual object with real properties and attributes. While I’d heard of Pydantic before and knew roughly what it did, someone at $COMPANY pointed me at datamodel-code-generator, which made it easy to generate Pydantic datamodel classes from JSON schemas. This, coupled with the easy availability of a schema for docker-compose, meant that it was super simple to get a class generated that did everything I wanted.

The next thing on the list was how to control docker-compose from within a Python program. Luckily, a quick Kagi led to docker-composer-v2. It took me a little while to get the hang of how it worked, but once I did, it was able to give me all the functionality I needed to inspect, stop, and start docker-compose containers. It also introduced me to loguru, an logging library that works very simply out of the box with no real setup needed (apart from some weirdness around modifying the name of the logger, would’ve thought it’d be a bit easier, but hey ho).

Putting It All Together

So, with a number of new libraries and tools in place, it was just a matter of putting it all together. My first attempt was pretty quick and crude, a simple CLI that took several params to control it. I realised fairly early on that I didn’t want to backup docker volumes, instead just dealing with bind mounts instead. Given the relatively simple nature of my docker setup (small number of containers, single machine), volumes have never added much value for me, so this wasn’t a deal breaker.

docker-composer-v2 made it easy to list the names of the compose files that were running. I could then read these in, then look through all the services for any volumes that started with a / character:

# Get Pydantic datamodel from docker compose YAML file
compose_model = _get_compose_model(compose_filename)

for service_name, service in compose_model.services.items():
    for volume in service.volumes or []:
        if volume.startswith('/'):
            volume = volume.split(':')[0]

            do_stuff_with_volume(volume)

It also became apparent that I needed some form of filtering, because a lot of the containers that I was running mounted things like photo or music directories with hundreds of gigs worth of data in them. Adding a simple substring filtering facility helped at first, but I quickly expanded this to allow regexes to be used to control whether or not a directory would be backed up. I also realised that there were some running containers that I didn’t want to bring down, as I was backing them up in different ways, so a container filter was added next. It also made sense to move the exclusions and other settings into a config.toml file, rather than passing them in on the commandline.

I’d started out by using shutil to just copy entire folders from the source location to a backup folder. However, this got a bit chunky after a while, and I ran into a few weird issues with symlinks in some of the container volumes when backing up with Cloudberry, so I looked into what it would take to back up directories to a compressed tar file. As it turns out, it’s really easy:

import tarfile

...

with tarfile.open(target_backup_file, 'w:gz') as tar:
    for volume in volumes:
        if os.path.exists(volume):
            tar.add(volume)

Finally, any program that does lots of file and container manipulation needs a dry-run flag, adding this in helped to quickly spot exclusions that needed to be added without having to wait ages for the full archiving run to complete.

Final Thoughts

I’ve released this to the public on my github repository here. I’m sure there are things that could be improved, but so far, it’s working well for me here in Grumpy Labs. With this in place, alongside some extra folders included in my nightly off-site backup, I feel a lot more confident that the next time that something blows up (and it will happen, it’s the nature of things), I’ll be in a much stronger position to get the server up and running quickly again.