Jon Moshier / Notes / Managing Toil budding
Note · From the Notebook

Managing Toil

A decision framework for business-critical manual work that the business won't fund automating: when to eliminate, automate, reduce, or just tolerate it, and how to change the funding decision.

Managing Toil

Toil is the manual, repetitive operational work that keeps a service running and produces no lasting value. The hard case is not toil you can obviously kill. It is toil that is business-critical and yet never important enough to earn a slot on the roadmap, so it stays manual forever. This note is about that quadrant: critical, unfunded, permanent.

What counts as toil

The term comes from Google’s Site Reliability Engineering book (Beyer et al., 2016), which defines toil precisely. Work is toil to the degree it is manual, repetitive, automatable, tactical (interrupt-driven, not strategic), devoid of enduring value, and scales linearly with the service. Doing a task for the first or second time is not toil; toil is the work you do over and over. A one-off migration is not toil. Running the same export by hand every Monday is.

SRE attaches a number to it: a Toil Budget that keeps toil under 50% of each engineer’s time, reserving the other half for engineering that makes the toil go away. Google’s own quarterly surveys put the average around 33%, with a range from 0% (pure project teams) to 80% (a team drowning). The 50% cap exists because toil expands to fill the time available and quietly starves the engineering that would reduce it.

The trap: critical but not a priority

The reason this specific toil never gets funded is a measurement failure, not a matter of laziness. A prevented incident is invisible by definition, and most delivery rubrics measure what shipped, not what was kept running. The work that keeps the lights on leaves no artifact in the metrics a business watches, so it never competes well against visible feature work. It is the operational cousin of Glue Work, the essential connective effort that most rubrics fail to credit. The same pattern shows up starkly in open-source infrastructure, where the more essential and the more invisible a library is, the less funding it tends to attract.

This is a Tragedy of the Commons inside the org. The task benefits everyone and is owned by no one with budget authority, so each planning cycle rationally spends its scarce capacity on something attributable. The toil persists precisely because it works: it never fails loudly enough to force a decision, and someone always absorbs it. This is also why delivery-metric frameworks like DORA Metrics and SDLC Delivery Metrics for AI-Assisted Engineering can look healthy while a team slowly suffocates under interrupt work the dashboards never counted.

The decision: eliminate, automate, reduce, tolerate

Automation is the second option, not the first. The SRE workbook’s ordering is roughly eliminate (its word is avoid), automate, reduce, then tolerate, and the highest-leverage move is usually to delete the work rather than speed it up. Ask whether the task needs to exist at all: a report nobody reads, a manual gate that a policy change would remove, an export that exists because two systems were never connected. Eliminating the source beats automating the symptom, because automated toil is still a system you now own and maintain.

When you do reach for automation, the naive economics is xkcd 1205: a break-even table of how long you can spend automating before you lose more time than you save over five years. A five-minute task done daily buys you about six days of automation budget. The table is a useful floor and a misleading ceiling. The critique of it is that pure time-saved undercounts the real value: automation also removes error risk (a botched manual step can cost far more than the minutes it took), pays off on a task that recurs more often than people estimate, documents the process as code, and lets someone other than the one expert run it. A task that clears the bar on risk and bus-factor can be worth automating even when the time math says no.

The realistic middle, when the business will not fund a proper productized solution, is deliberate partial automation. You are not building a product; you are buying down risk and time inside your own toil budget. Concrete forms:

Make it visible to change the decision

The durable fix is not a better script, it is changing the funding decision, and the lever for that is measurement. Toil that is untracked is unarguable. Teams that want toil work funded start by counting it: log every interrupt and manual task for a few weeks, attach hours, and convert “this is annoying” into “this consumes 15% of the team and rises with every new customer.” The linear-scaling property is the strongest argument you have, because it lets you project when the toil crosses 50% and the team stalls. This is the same move SRE makes with an error budget: convert invisible reliability work into a hard number that forces a tradeoff leadership would otherwise defer indefinitely.

Counting toil is itself a metric, so it inherits Goodhart’s Law: the moment “hours of toil” becomes a target, people reclassify work to hit the number. Measure it to inform the decision, not to grade the team. The goal of the count is to move the task from invisible to fundable, not to create a new scoreboard.

Sometimes the count lands and leadership still says no. Visibility is necessary, not sufficient. At that point the honest options narrow to three. Enforce the toil budget as a hard cap and let the queue of unfunded critical work become visibly leadership’s backlog, not yours, so the next incident lands as a decision they declined rather than one you hid. Let the service degrade in a controlled, announced way, since toil quietly absorbed forever removes the very pressure that would fund a fix. Or accept it, staff it explicitly, and rotate it so the cost is shared rather than dumped on one person until they leave. The failure mode is the fourth option nobody chooses on purpose: absorb it silently and indefinitely, which converts a funding problem into an attrition problem.

Try it

Run a two-week toil audit (a few hours of logging, then 30 minutes of math). Have the team tag every interrupt-driven, repetitive task in a shared sheet with a duration and a frequency. At the end, for each recurring task compute the xkcd break-even: (time per run) x (runs per year) x (5 years) versus your honest estimate of hours to automate. Sort by that ratio. What you are looking for is the gap between the tasks that feel worst and the tasks that score worst; they are often not the same, because the loudest toil is not always the most expensive. The tasks that are both high-frequency and risk-laden are the ones to script first, and the total gives you the funding argument.

See also

Sources

← All notes Read recent essays →