Join our next Live Demo on Mar 5th!

Resource Blog News Customers Stories

Updated: Aug 25, 2025 Upd: 25.08.25

8 min read

Engineering Toil: The Real DevOps Bottleneck

Aharon Twizer

Aharon Twizer

CEO & Co-founder

Engineering Toil: The Real DevOps Bottleneck

Today, productivity is a key priority for software engineering teams. Every software development, DevOps and cloud team wants to ensure they are working as productively, efficiently and cost effectively as possible. However, teams frequently get bogged down with manual, repetitive tasks, firefighting to keep the lights on, which impacts their ability to move the needle for the organization on technology innovation.

In the DevOps and R&D world, this term is frequently referred to as engineering toil – the bottleneck that DevOps teams are constantly fighting against. This article examines what engineering toil is, why it happens and what actions your DevOps team can put in place to help eliminate excess toil.

Why Scale and Velocity are Challenging to DevOps?

Right now, the scale and velocity of software development present an enormous challenge for enterprises, with software being built faster than it can be secured. In parallel, organizations expect new infrastructure and cloud workloads to be spun up just as quickly, often with little or no cloud governance around them. However, the reality is that the more mature the cloud environment, the more cloud accounts are added, and as configurations evolve, the environment becomes more complex.

This leads to bloated clouds with risk accumulating, which is not only difficult to manage but inefficient and exposes the organization to increasing security incidents. This has been made worse with the advent of AI-powered development, which has raised the stakes. AI is already accelerating software delivery. This means more code, more changes, more infrastructure to support it and if you’re still relying on manual processes to manage your environment, AI just adds fuel to the fire.

How Engineering Toil Impacts DevOps Productivity

This scenario often leads to excessive engineering toil. Far from simply being irritating, there is growing evidence that the impact of engineering toil in today’s high-stakes, high-velocity cloud environments isn’t just annoying, it is incredibly expensive. It also eats up valuable engineering time, slows down delivery, impacts productivity, puts a blocker on innovation and impacts the ability for the business to create a competitive advantage.

But toil is common throughout most programming and DevOps positions, whether we like it or not. And when it comes to platform engineering, the chances of encountering toil are high. You can think of toil as those tedious workarounds that should be automated but rarely are. This could be due to a lack of standard configurations for deployments, meaning engineers must copy and paste data from one module to another, or it could be an integration that has not yet been automated.

Less Than 50% of an Engineer’s Time Should Be Spent on Toil

According to Google’s SRE Book, which defines toil as manual, repetitive, automatable work that scales linearly, it advocates that organizations should strive to keep toil well below 50% of an engineer’s time. It emphasizes automation and strategic engineering practices to reduce toil. Explore the chapter on eliminating toil.

Additionally, a LeadDev article highlights how unchecked toil can lead to burnout, errors, low morale and career stagnation, with employees voting with their feet. If the DevOps engineers who created your infrastructure leave, your corporate knowledge and experience walk out the door with them.

Engineers have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout, boredom, and discontent. This article advocates that the way to eliminate toil is through automation and/or system redesign.

Furthermore, a recent Eindhoven University of Technology academic paper titled: Toil in Organizations That Run Production Services found that toil is more nuanced than Google’s definition and the challenges in reducing toil include cultural inertia combined with a lack of time to automate. But the paper emphasizes that a concerted effort to reduce toil will yield positive outcomes for both individuals and organizations.

In summary, the research found that what machines should be doing is being done manually and if you’re running cloud infrastructure at scale without a purpose-built automation platform, then toil will just continue to escalate.

Importance of Prioritizing Long-Term Engineering Projects

The good news is that toil is measurable, and this is where surveys and ticket metrics can help to quantify it. Reducing toil requires engineering effort with automation and system improvements whereby teams prioritize long-term engineering projects over reactive, repetitive tasks.

However, it is important to recognize that not all toil is bad. Small amounts can be tolerable and even satisfying for your engineers, predictable and repetitive tasks can produce a sense of accomplishment and quick wins. They can be low-risk and low-stress activities. Some people gravitate toward tasks involving toil and may even enjoy that type of work. But be warned, excess toil is harmful – it impacts productivity and velocity.

This Google cloud blog offers some practical steps for identifying, measuring, and reducing toil. In particular, it encourages using Infrastructure as Code (IaC) and automation as key strategies.

How Infrastructure as Code (IaC) Helps Reduce Engineering Toil

Infrastructure as Code (IaC) is a powerful tool in the fight against engineering toil. By allowing infrastructure to be defined, provisioned, and managed through code, this enables better cloud control. But layered onto this, engineering teams need automation, and this is where platforms like Terraform enable DevOps to define, provision, and manage cloud and on-prem resources using declarative configuration. In effect, Terraform transforms manual, repetitive tasks into automated, scalable processes using machine-readable configuration files to define infrastructure (servers, networks, databases), automate provisioning and configuration and enable version control and repeatability.

Here’s how IaC directly tackles the characteristics of toil:

Toil TraitHow IaC Helps
ManualAutomates the setup and configuration of infrastructure
RepetitiveScripts can be reused across environments and deployments
AutomatableIaC is inherently automatable, you run once, apply anywhere
TacticalShifts focus from reactive fixes to proactive system design
No enduring valueIaC creates reusable templates that add long-term value
Scales linearlyIaC enables scalable infrastructure without increasing manual effort

The benefits of Using IaC to Eliminate Toil

There are several key benefits of using IaC to eliminate toil and these include:

  • Consistency: Eliminates “it works on my machine” issues by standardizing environments.
  • Speed: Rapid provisioning and updates reduce downtime and manual effort.
  • Reliability: Reduces human error and improves system stability.
  • Version Control: Infrastructure changes are tracked and auditable.
  • Self-healing Systems: Can be integrated with monitoring to auto-remediate issues.

Tackling Toil in Terraform and Cloud Workflows

So, if you are ready to tackle toil, here is a list of common engineering toil issues found in Terraform and cloud workflows, such as:

  • Manually running Terraform Plan to preview changes before applying them
  • Approving and tracking changes in Slack or spreadsheets
  • Debugging cloud drift without full visibility
  • Writing custom scripts to enforce policies
  • Manually provisioning a VM
  • Reviewing code for basic issues, such as open S3 buckets and bad IAM roles
  • Your SREs are swamped with “can you deploy this?” tickets.

While each task might not sound that onerous, if you multiply each of these by every developer, in every environment, every week, it is easy to see how arduous toil can become.

Why Toil Often Goes Unnoticed

So why does toil frequently go unnoticed, even if you are using Terraform? If you have a patchwork of GitHub repositories, Jenkins jobs, in-house scripts, and Slack approvals, unfortunately, this isn’t an end-to-end platform, it’s a mismatch of tools and it’s where toil lives and multiplies. As a result, most teams don’t even realize how much toil they’re carrying. Toil creeps in quietly. But it scales quickly.

How ControlMonkey Eliminates Engineering Toil

ControlMonkey was built to erase engineering toil from the Terraform workflow. It’s the only complete solution for end-to-end Terraform automation, allowing DevOps to manage cloud infrastructure with the same confidence that they manage software delivery.

Terraform Automation, ReimaginedIt enables the delivery of self-service deployments. PR-based workflows. Policy enforcement is baked in. There are no custom scripts, no friction, and thereby enabling fast infrastructure provisioning without DevOps bottlenecks. ControlMonkey:

  • Auto-runs plans and applies with approval gates
  • Enables templatized environments via QualityGates
  • Imports legacy resources into Terraform in seconds

Cloud Drift is Eliminated. Visibility? Total.Our Cloud vs. Code guarantee detects drift before it becomes a problem – what’s running in your cloud is mirrored in your code, ensuring predictability and:

  • Real-time infra snapshots
  • Drift alerts with context
  • One-click remediation

Governance Without Grit

And unlike homegrown pipelines or partial tools, it all runs on a platform built for Total Cloud Control.

From Engineering Toil to Total Cloud Control

Toil doesn’t scale. And in today’s cloud, neither should your engineering team.

ControlMonkey eliminates Terraform toil by replacing manual workarounds with intelligent automation and proactive governance, giving engineers back their time and your organization back its development velocity.

Request a Demo →

Bottom CTA Background

A 30-min meeting will save your team 1000s of hours

A 30-min meeting will save your team 1000s of hours

Book Intro Call

Author

Aharon Twizer

Aharon Twizer

CEO & Co-founder

Co-Founder and CEO of ControlMonkey. He has over 20 years of experience in software development. He was the CTO of Spot.io, which was bought by NetApp for more than $400 million. There, he led important tech innovations in cloud optimization and Kubernetes. He later joined AWS as a Principal Solutions Architect, helping global partners solve complex cloud challenges. In 2022, he started ControlMonkey to help DevOps teams discover, manage, and scale their cloud infrastructure with Infrastructure as Code. Aharon loves creating tools that help engineering teams. These tools make it easier to manage the complexity of modern cloud environments.

    Sounds Interesting?

    Request a Demo

    FAQ – Engineering Toil, DevOps Toil and SRE Toil

    Toil refers to repetitive, manual tasks that add little long-term value – like re-running scripts, debugging drift, or handling infra tickets. In software engineering, toil slows teams down and causes burnout.

    Engineering toil in DevOps includes low-leverage tasks like manual Terraform applies, Slack-based approvals, and firefighting drift. These tasks scale with infra, but not with business value – making them a bottleneck.

    Toil consumes SRE time with non-strategic tasks. Instead of improving system reliability, they’re stuck deploying code, debugging misconfigurations, or managing infrastructure manually.

    According to Google’s SRE handbook, toil is work that is manual, repetitive, automatable, and scales linearly. Eliminating toil is a core tenet of site reliability engineering.

    DevOps toil slows you down. Automation speeds you up. Replacing toil with automation means faster delivery, fewer production issues, and happier teams.

    Resource Blog News Customers Stories

    Updated: Aug 20, 2025 Upd: 20.08.25

    3 min read

    Cloud Chaos: What It Is and How to End It

    Aharon Twizer

    Aharon Twizer

    CEO & Co-founder

    Cloud Chaos: What It Is and How to End It

    For years, enterprises have raced to cloud infrastructure expecting speed, scale, and agility. But what many teams got instead was cloud chaos: sprawling infra, inconsistent governance, and endless firefighting.

    This “cloud chaos” isn’t just annoying. It’s high-risk. It introduces drift, increases failures, slows delivery, and wastes engineering talent on low-leverage work. And in the AI era, it’s becoming a critical blocker for innovation.

    What Is Cloud Chaos?

    Cloud chaos is the growing gap between the speed of cloud change and the ability to control it.
    It looks like:

    • Manual approvals and patchwork IaC tools that slow everything down
    • Infrastructure drift—what’s deployed no longer matches what’s in code
    • Shadow changes made through the console, outside policy and visibility
    • No single source of truth—or too many
    • Endless tickets to infrastructure teams, who become bottlenecks by default

    Sound familiar? You’re not alone. And you’re not doing it wrong. You’re just trying to scale without the right platform.

    Why Cloud Chaos Happens

    Most cloud chaos stems from three core issues:

    1. Scale without structure: The more teams, regions, and cloud services you add, the harder it is to keep them governed, compliant, and consistent.
    2. Partial automation: Running Terraform isn’t the same as governing it. Without real automation and guardrails, IaC just becomes more code to manage.
    3. Reactive operations: Many teams still operate in firefighting mode, reacting to issues instead of proactively managing change.

    And now, AI is multiplying the problem. Faster development means more infra. More infra means more changes. And more changes—without control—means more chaos.

    From Chaos to Control: How ControlMonkey Solves It

    ControlMonkey is built for one purpose: Total Cloud Control.

    We help enterprises eliminate chaos by turning infrastructure delivery into a proactive, automated, fully end-to-end process. Here’s how:

    Complete Terraform Automation

    • ControlMonkey transforms your Terraform workflows into repeatable, governed pipelines. No more one-off scripts. No more manual changes.
    • Self-serve deployments with policy guardrails
    • PR-based workflows with automated drift detection
    • AI-powered code generation to onboard unmanaged resource

    Cloud vs. Code Integrity

    Know exactly what’s in your cloud—and how it maps to code.
    With our Cloud vs. Code Guarantee, you can:

    • Detect drift automatically
    • Prevent hidden changes from going unnoticed
    • Ensure 100% IaC coverage across environments

    End-to-End Governance

    We give platform and DevOps teams centralized visibility, while letting application teams move fast, with:

    • Role-based access and SDLC controls
    • Built-in compliance and security checks
    • Real-time analytics across regions and clouds

    Real-World Results: Resilience at Scale with Block

    Block, one of the most advanced fintech cloud platforms, faced a wake-up call when a critical review revealed gaps in disaster recovery—not in data, but in infrastructure itself.

    “We needed something fast, reliable, and easy to run. ControlMonkey gave us all of that—and more.”
    -Ben Apprederisse,  Platform Technical Lead at Block

    After adopting ControlMonkey, Block’s teams:

    • Recovered environments 90% faster
    • Gained full visibility into what’s covered by Terraform—and what isn’t
    • Created a clear, automated path for restoring critical infrastructure during outages

    The Path Forward: End Cloud Chaos, Start Building

    Cloud chaos isn’t inevitable. It’s just the result of trying to scale old ways of working into a new era.

    ControlMonkey gives you the structure, visibility, and automation to move fast—without breaking things.

    Explore Automation →

    Request a Demo →

    Bottom CTA Background

    A 30-min meeting will save your team 1000s of hours

    A 30-min meeting will save your team 1000s of hours

    Book Intro Call

    Author

    Aharon Twizer

    Aharon Twizer

    CEO & Co-founder

    Co-Founder and CEO of ControlMonkey. He has over 20 years of experience in software development. He was the CTO of Spot.io, which was bought by NetApp for more than $400 million. There, he led important tech innovations in cloud optimization and Kubernetes. He later joined AWS as a Principal Solutions Architect, helping global partners solve complex cloud challenges. In 2022, he started ControlMonkey to help DevOps teams discover, manage, and scale their cloud infrastructure with Infrastructure as Code. Aharon loves creating tools that help engineering teams. These tools make it easier to manage the complexity of modern cloud environments.

      Sounds Interesting?

      Request a Demo

      FAQ – Cloud Chaos

      Cloud chaos refers to the growing complexity and lack of control in cloud environments as organizations scale. It happens when infrastructure expands faster than governance, leading to:

      • Infrastructure drift (code vs. reality mismatches)
      • Manual changes outside of policy
      • Siloed teams and tools
      • Endless tickets and firefighting

      It’s not just messy—it’s risky. Cloud chaos can slow innovation, increase costs, and expose teams to compliance failures. As AI accelerates infrastructure changes, chaos compounds—unless teams adopt end-to-end automation and governance.

      ControlMonkey takes its name from Chaos Monkey, the open-source tool created at Netflix to randomly shut down cloud resources and test system resilience. That tool exposed a hard truth: most cloud infrastructure wasn’t built for failure.

      ControlMonkey flips the script.

      Where Chaos Monkey introduced failure, ControlMonkey restores control. We give DevOps and platform teams the automation and governance they need to keep up with scale—without losing visibility or stability.

      Resource Blog News Customers Stories

      Updated: Aug 20, 2025 Upd: 20.08.25

      1 min read

      “This is My Offer” – AWS ISV Program

      Zack Bentolila

      Zack Bentolila

      Marketing Director

      “This is My Offer” – AWS ISV Program

      In this episode of “This is My Offer” we will discusses ControlMonkey’s with Bala KP, WW Sr. Solutions Architec

      Bottom CTA Background

      A 30-min meeting will save your team 1000s of hours

      A 30-min meeting will save your team 1000s of hours

      Book Intro Call

      Author

      Zack Bentolila

      Zack Bentolila

      Marketing Director

      Zack is the Marketing Director at ControlMonkey, with a strong focus on DevOps and DevSecOps. He was the Senior Director of Partner Marketing and Field Marketing Manager at Checkmarx. There, he helped with global security projects. With over 10 years in marketing, Zack specializes in content strategy, technical messaging, and go-to-market alignment. He loves turning complex cloud and security ideas into clear, useful insights for engineering, DevOps, and security leaders.

        Sounds Interesting?

        Request a Demo
        Cookies banner

        We use cookies to enhance site navigation, analyze usage, and support marketing efforts. For more information, please read our. Privacy Policy