Skip to main content

Reading notes for chapter "Eliminating Toil" in Google's SRE book

· 2 min read
  • These are my reading notes for Google SRE - What is Toil in SRE: Understanding Its Impact
  • Goal of SRE is to spend most of your time on engineering project work, rather than operations
    • at google, the goal is < 50% toil, averaged over a few quarters to a year
  • Note: Admin work is overhead, not toil
  • "So what is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Not every task deemed toil has all these attributes, but the more closely work matches one or more of the following descriptions, the more likely it is to be toil: ..."
    • manual,
      • A script is automation, but if you have to run it manually, that part is toil
    • repetitive,
    • automatable,
      • if it requires "human judgement" each time, it doesn't qualify as automatable
    • tactical,
    • "devoid of enduring value",
      • "If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved."
      • not clear what "in the same state" means. Perhaps maintenance tasks? Migrations?
    • scales linearly as a service grows
      • more scale => more work => more toil
  • top source of toil: interrupts
  • Not always bad, often necessary, but you don't want too much of it