DigDag - Orchestrating The Madness
Quite some time ago1, I set up my very first home server. Over the years, it’s become more and more central to our day-to-day technical needs, including (but not limited to!) photo backups from all our various devices, syncing files from desktops to the server for archival purposes, documentation management, photo album displays, and most importantly, backing up important files to an off-site location.
With all of this activity, I’ve been running more and more nightly jobs. In typical Grumpy Metal fashion, my approach to coordinating all of this has been a bit half-arsed. Yes, I managed all of this via cron by spacing my jobs to be estimated_job_time + random_interval apart. This has (unsurprisingly!) led to all sorts of problems, primarily when one of the earlier jobs has taken a bit longer to run than expected, and the next job fails for random reasons as a result.
Eventually, tired of dealing with the fallout of the random timing weirdness, I decided I’d had enough. I needed…. a better way.
DAGging Around⌗
What I needed was a dependency graph of the jobs on the server, only running new jobs when the previous jobs were complete. In computer science, this is known as a DAG - a Directed Acyclic Graph. Begin by defining a starting state/time, and build up a tree structure of jobs and their dependencies. Bonus points come from running them in a framework that would allow me to pick up from where a job batch went wrong if there was a problem, only re-running any failed nodes in the graph, rather than re-running the entire set of jobs just because something failed near the end.
A bit of Googling later, a helpful StackOverflow answer provided me with the names of a few candidate frameworks that might be useful. I examined each of them in turn, and came up with the following.
Framework Name | Description | Workable? | Reason |
---|---|---|---|
cron | The one, the only, the original. | Nope! | Already using it, dependency management not built in. |
Apache Airflow | Powerful, cloudy, scalable, widely used. | Not really. | Too heavy for my small little server. |
Dagobah | Fun name, Python-based. | Sadly not, despite the name. | Trying to avoid pip and virtual envs, keeping it simple. |
Luigi | Spotify’s DAG scheduler. | Nope. | Even Luigi’s author thinks it’s all a bit of a mess. Plus, Python again. |
Azkaban | LinkedIn’s job orchestrator. Not a social network. | No way. | Too big and chunky again, and mainly for use with Hadoop. Plus, Harry Potter references. |
DigDag | Lightweight, java-based. | Take a guess. Go on. | Well, it’s the subject of this article, so keep reading… |
DigDugDag⌗
Not to be confused with Dig Dug the video game (not my favourite game from that era, I might add), DigDag is a Java-based job orchestration engine with a surprisingly deep feature set. As well as managing job dependencies and running shell scripts, it can also work with AWS and GCP. It compiles down to a fairly small executable, and uses simple YAML text files to control job configurations. It can run as either a stand-alone utility, or as a job server, with a simple built-in web interface for basic administration. The features ticked all the boxes for what I was looking for, so I dived in, casually assuming that the various warning mentions of the documentation being a bit lacking would be nothing I couldn’t work my way around.
I’m really very good at ignoring subtle hints like that. I’d be terrible in a Hollywood slasher film.
The Getting Started guide is quick and to the point. A couple of shell commands later, and an additional JRE download, the executable was sitting there, waiting to be run.
Key Concepts⌗
DigDag organises itself into projects, each of which has one or more workflows. A workflow, in turn, has one or more tasks. The tasks can contain sequences of other tasks, and groups of tasks can be executed either synchronously, or in parallel. So, plenty of ways to structure things.
The rough approach I’ve followed is:
Prep Task (all done in parallel)
-> Clean up dangling docker images
-> Backup Prep (copying config files)
-> SSL Cert maintenance (these are run sequentially)
-> Update certificates if need be
-> Distribute certificates to hosts
-> Photos
-> Update photo timeline using my world-famous (in New Zealand) photo renaming tool
-> Trigger update of PhotoPrism indices
Raid Task (run sequentially)
-> Sync
-> Scrub
Off-site Backup
The dependencies and way of structuring the jobs is pretty straightfoward here, and works well. I’ve got this defined all as a single workflow in a standalone project. If need be though, workflows can depend on other workflows, which allows things to be broken up a bit if a single definition gets too unwieldy.
Weekly Woes⌗
The only other part that led to some slightly lateral thinking and searching through the github issues for the project was how to schedule a weekly task alongside the daily tasks. Looking at the approach outlined above, the weekly task is similar to the Backup Prep step, copying some chunkier and less frequently updated files into the offsite backup folder.
How could I ensure that the backup that runs daily waits for the weekly task to finish? Cross-workflow dependencies might have worked, but having spent a few days messing about with the various task operators and flow-control structures that DigDag provides, I felt reasonably certain that having a daily workflow rely on a weekly scheduled workflow would only allow the daily workflow to work once per week.
The key to solving this wasn’t actually on the web page describing scheduling. Instead, it was in the section on Calculating Variables in the Workflow Definition documentation. DigDag includes moment.js for handling simple javascript date/time calculations. Checking the day of the week is then pretty straightfoward in conjunction with the if operator:
+weekly_backup_task
if>: ${moment(session_time).isoWeekday() == 1}
_do:
+copy_stuff
echo>: Only runs on a Monday!
This works as you’d expect it to, so it’s possible to lump weekly tasks in with the daily tasks via an if-then-else type operator. GrumpyHappy days.
A Couple Of Small Grumps⌗
Overall, DigDag is exactly what I wanted. The workflows work well in an intuitive manner, and there are enough operators provided that it’s capable of some powerful operations. It’s also possible to write your own extensions, which adds to the potential.
There are a couple of minor concerns though. One, the documentation. People weren’t wrong when they warned it wasn’t quite up to scratch. There’s some good information in there, although sometimes, it’s just not in the place you’d expect to find it. As such, you end up scrambling around a fair bit, reading through more documentation than you’d prefer to, and sometimes through GitHub issues too, to find that elusive little nugget that tells you what you need to know. This means you end up having to just jump in and try a few different things out to see how they perform. My Test project with its Test workflow got a lot of use as I developed my formal daily workflow.
Secondly, the project itself doesn’t seem to get updated very much. This could just be a sign of the maturity of it, although at the time of writing, there are 156 open issues logged, with only sparse discussion around many of them. The project is owned by Treasure Data, a company I hadn’t heard of previously, so I guess it’s got some funding behind it, but if changes are needed, or you’re having trouble working something out and need to raise an issue, it may take a while to get a response. I’m hopefully just being overly grumpy here though, as I have no first-hand experience either way.
Closing Words⌗
Overall, DigDag has been working well for me since I started running it. Once you get the hang of it, it’s simple to use, and despite some frustrations in trying to find out exactly how to do something, I’ve managed to get everything I needed in place. Can recommend this to others needing similar functionality, primarily for the home server environment.
OVERALL 2 grumps out of 102.