Cloud chaos refers to the growing complexity and lack of control in cloud environments as organizations scale. It happens when infrastructure expands faster than governance, leading to: Infrastructure drift (code vs. reality mismatches) Manual changes outside of policy Siloed teams and tools Endless tickets and firefighting It’s not just messy—it’s risky. Cloud chaos can slow innovation, increase costs, and expose teams to compliance failures. As AI accelerates infrastructure changes, chaos compounds—unless teams adopt end-to-end automation and governance.

Resource Blog News Customers Stories

Updated: Aug 25, 2025 Upd: 25.08.25

8 min read

Engineering Toil: The Real DevOps Bottleneck

Aharon Twizer

CEO & Co-founder

Engineering Toil: The Real DevOps Bottleneck

Today, productivity is a key priority for software engineering teams. Every software development, DevOps and cloud team wants to ensure they are working as productively, efficiently and cost effectively as possible. However, teams frequently get bogged down with manual, repetitive tasks, firefighting to keep the lights on, which impacts their ability to move the needle for the organization on technology innovation.

In the DevOps and R&D world, this term is frequently referred to as engineering toil – the bottleneck that DevOps teams are constantly fighting against. This article examines what engineering toil is, why it happens and what actions your DevOps team can put in place to help eliminate excess toil.

Why Scale and Velocity are Challenging to DevOps?

Right now, the scale and velocity of software development present an enormous challenge for enterprises, with software being built faster than it can be secured. In parallel, organizations expect new infrastructure and cloud workloads to be spun up just as quickly, often with little or no cloud governance around them. However, the reality is that the more mature the cloud environment, the more cloud accounts are added, and as configurations evolve, the environment becomes more complex.

This leads to bloated clouds with risk accumulating, which is not only difficult to manage but inefficient and exposes the organization to increasing security incidents. This has been made worse with the advent of AI-powered development, which has raised the stakes. AI is already accelerating software delivery. This means more code, more changes, more infrastructure to support it and if you’re still relying on manual processes to manage your environment, AI just adds fuel to the fire.

How Engineering Toil Impacts DevOps Productivity

This scenario often leads to excessive engineering toil. Far from simply being irritating, there is growing evidence that the impact of engineering toil in today’s high-stakes, high-velocity cloud environments isn’t just annoying, it is incredibly expensive. It also eats up valuable engineering time, slows down delivery, impacts productivity, puts a blocker on innovation and impacts the ability for the business to create a competitive advantage.

But toil is common throughout most programming and DevOps positions, whether we like it or not. And when it comes to platform engineering, the chances of encountering toil are high. You can think of toil as those tedious workarounds that should be automated but rarely are. This could be due to a lack of standard configurations for deployments, meaning engineers must copy and paste data from one module to another, or it could be an integration that has not yet been automated.

Less Than 50% of an Engineer’s Time Should Be Spent on Toil

According to Google’s SRE Book, which defines toil as manual, repetitive, automatable work that scales linearly, it advocates that organizations should strive to keep toil well below 50% of an engineer’s time. It emphasizes automation and strategic engineering practices to reduce toil. Explore the chapter on eliminating toil.

Additionally, a LeadDev article highlights how unchecked toil can lead to burnout, errors, low morale and career stagnation, with employees voting with their feet. If the DevOps engineers who created your infrastructure leave, your corporate knowledge and experience walk out the door with them.

Engineers have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout, boredom, and discontent. This article advocates that the way to eliminate toil is through automation and/or system redesign.

Furthermore, a recent Eindhoven University of Technology academic paper titled: Toil in Organizations That Run Production Services found that toil is more nuanced than Google’s definition and the challenges in reducing toil include cultural inertia combined with a lack of time to automate. But the paper emphasizes that a concerted effort to reduce toil will yield positive outcomes for both individuals and organizations.

In summary, the research found that what machines should be doing is being done manually and if you’re running cloud infrastructure at scale without a purpose-built automation platform, then toil will just continue to escalate.

Importance of Prioritizing Long-Term Engineering Projects

The good news is that toil is measurable, and this is where surveys and ticket metrics can help to quantify it. Reducing toil requires engineering effort with automation and system improvements whereby teams prioritize long-term engineering projects over reactive, repetitive tasks.

However, it is important to recognize that not all toil is bad. Small amounts can be tolerable and even satisfying for your engineers, predictable and repetitive tasks can produce a sense of accomplishment and quick wins. They can be low-risk and low-stress activities. Some people gravitate toward tasks involving toil and may even enjoy that type of work. But be warned, excess toil is harmful – it impacts productivity and velocity.

This Google cloud blog offers some practical steps for identifying, measuring, and reducing toil. In particular, it encourages using Infrastructure as Code (IaC) and automation as key strategies.

How Infrastructure as Code (IaC) Helps Reduce Engineering Toil

Infrastructure as Code (IaC) is a powerful tool in the fight against engineering toil. By allowing infrastructure to be defined, provisioned, and managed through code, this enables better cloud control. But layered onto this, engineering teams need automation, and this is where platforms like Terraform enable DevOps to define, provision, and manage cloud and on-prem resources using declarative configuration. In effect, Terraform transforms manual, repetitive tasks into automated, scalable processes using machine-readable configuration files to define infrastructure (servers, networks, databases), automate provisioning and configuration and enable version control and repeatability.

Here’s how IaC directly tackles the characteristics of toil:

Toil Trait	How IaC Helps
Manual	Automates the setup and configuration of infrastructure
Repetitive	Scripts can be reused across environments and deployments
Automatable	IaC is inherently automatable, you run once, apply anywhere
Tactical	Shifts focus from reactive fixes to proactive system design
No enduring value	IaC creates reusable templates that add long-term value
Scales linearly	IaC enables scalable infrastructure without increasing manual effort

The benefits of Using IaC to Eliminate Toil

There are several key benefits of using IaC to eliminate toil and these include:

Consistency: Eliminates “it works on my machine” issues by standardizing environments.
Speed: Rapid provisioning and updates reduce downtime and manual effort.
Reliability: Reduces human error and improves system stability.
Version Control: Infrastructure changes are tracked and auditable.
Self-healing Systems: Can be integrated with monitoring to auto-remediate issues.

Tackling Toil in Terraform and Cloud Workflows

So, if you are ready to tackle toil, here is a list of common engineering toil issues found in Terraform and cloud workflows, such as:

Manually running Terraform Plan to preview changes before applying them
Approving and tracking changes in Slack or spreadsheets
Debugging cloud drift without full visibility
Writing custom scripts to enforce policies
Manually provisioning a VM
Reviewing code for basic issues, such as open S3 buckets and bad IAM roles
Your SREs are swamped with “can you deploy this?” tickets.

While each task might not sound that onerous, if you multiply each of these by every developer, in every environment, every week, it is easy to see how arduous toil can become.

Why Toil Often Goes Unnoticed

So why does toil frequently go unnoticed, even if you are using Terraform? If you have a patchwork of GitHub repositories, Jenkins jobs, in-house scripts, and Slack approvals, unfortunately, this isn’t an end-to-end platform, it’s a mismatch of tools and it’s where toil lives and multiplies. As a result, most teams don’t even realize how much toil they’re carrying. Toil creeps in quietly. But it scales quickly.

How ControlMonkey Eliminates Engineering Toil

ControlMonkey was built to erase engineering toil from the Terraform workflow. It’s the only complete solution for end-to-end Terraform automation, allowing DevOps to manage cloud infrastructure with the same confidence that they manage software delivery.

Terraform Automation, ReimaginedIt enables the delivery of self-service deployments. PR-based workflows. Policy enforcement is baked in. There are no custom scripts, no friction, and thereby enabling fast infrastructure provisioning without DevOps bottlenecks. ControlMonkey:

Auto-runs plans and applies with approval gates
Enables templatized environments via QualityGates
Imports legacy resources into Terraform in seconds

Cloud Drift is Eliminated. Visibility? Total.Our Cloud vs. Code guarantee detects drift before it becomes a problem – what’s running in your cloud is mirrored in your code, ensuring predictability and:

Real-time infra snapshots
Drift alerts with context
One-click remediation

Governance Without Grit

Compliance shouldn’t be manual. ControlMonkey enforces organization policies before anything breaks—without slowing anyone down.
Role-based controls
Guardrails to prevent misconfigurations
Audit trails for every change

And unlike homegrown pipelines or partial tools, it all runs on a platform built for Total Cloud Control.

From Engineering Toil to Total Cloud Control

Toil doesn’t scale. And in today’s cloud, neither should your engineering team.

ControlMonkey eliminates Terraform toil by replacing manual workarounds with intelligent automation and proactive governance, giving engineers back their time and your organization back its development velocity.

Request a Demo →

A 30-min meeting will save your team 1000s of hours

Book Intro Call

Author

Aharon Twizer

CEO & Co-founder

Co-Founder and CEO of ControlMonkey. He has over 20 years of experience in software development. He was the CTO of Spot.io, which was bought by NetApp for more than $400 million. There, he led important tech innovations in cloud optimization and Kubernetes. He later joined AWS as a Principal Solutions Architect, helping global partners solve complex cloud challenges. In 2022, he started ControlMonkey to help DevOps teams discover, manage, and scale their cloud infrastructure with Infrastructure as Code. Aharon loves creating tools that help engineering teams. These tools make it easier to manage the complexity of modern cloud environments.

Sounds Interesting?

Request a Demo

FAQ – Engineering Toil, DevOps Toil and SRE Toil

What is toil in software engineering?

Toil refers to repetitive, manual tasks that add little long-term value – like re-running scripts, debugging drift, or handling infra tickets. In software engineering, toil slows teams down and causes burnout.

What is engineering toil in DevOps?

Engineering toil in DevOps includes low-leverage tasks like manual Terraform applies, Slack-based approvals, and firefighting drift. These tasks scale with infra, but not with business value – making them a bottleneck.

How does toil affect site reliability engineers (SREs)?

Toil consumes SRE time with non-strategic tasks. Instead of improving system reliability, they’re stuck deploying code, debugging misconfigurations, or managing infrastructure manually.

What does Google say about engineering toil?

According to Google’s SRE handbook, toil is work that is manual, repetitive, automatable, and scales linearly. Eliminating toil is a core tenet of site reliability engineering.

What’s the difference between DevOps toil and automation?

DevOps toil slows you down. Automation speeds you up. Replacing toil with automation means faster delivery, fewer production issues, and happier teams.

Related Resources

Resource Blog News Customers Stories

Updated: Aug 20, 2025 Upd: 20.08.25

3 min read

Cloud Chaos: What It Is and How to End It

Aharon Twizer

CEO & Co-founder

Cloud Chaos: What It Is and How to End It

For years, enterprises have raced to cloud infrastructure expecting speed, scale, and agility. But what many teams got instead was cloud chaos: sprawling infra, inconsistent governance, and endless firefighting.

This “cloud chaos” isn’t just annoying. It’s high-risk. It introduces drift, increases failures, slows delivery, and wastes engineering talent on low-leverage work. And in the AI era, it’s becoming a critical blocker for innovation.

What Is Cloud Chaos?

Cloud chaos is the growing gap between the speed of cloud change and the ability to control it.
It looks like:

Manual approvals and patchwork IaC tools that slow everything down
Infrastructure drift—what’s deployed no longer matches what’s in code
Shadow changes made through the console, outside policy and visibility
No single source of truth—or too many
Endless tickets to infrastructure teams, who become bottlenecks by default

Sound familiar? You’re not alone. And you’re not doing it wrong. You’re just trying to scale without the right platform.

Why Cloud Chaos Happens

Most cloud chaos stems from three core issues:

Scale without structure: The more teams, regions, and cloud services you add, the harder it is to keep them governed, compliant, and consistent.
Partial automation: Running Terraform isn’t the same as governing it. Without real automation and guardrails, IaC just becomes more code to manage.
Reactive operations: Many teams still operate in firefighting mode, reacting to issues instead of proactively managing change.

And now, AI is multiplying the problem. Faster development means more infra. More infra means more changes. And more changes—without control—means more chaos.

From Chaos to Control: How ControlMonkey Solves It

ControlMonkey is built for one purpose: Total Cloud Control.

We help enterprises eliminate chaos by turning infrastructure delivery into a proactive, automated, fully end-to-end process. Here’s how:

Complete Terraform Automation

ControlMonkey transforms your Terraform workflows into repeatable, governed pipelines. No more one-off scripts. No more manual changes.
Self-serve deployments with policy guardrails
PR-based workflows with automated drift detection
AI-powered code generation to onboard unmanaged resource

Cloud vs. Code Integrity

Know exactly what’s in your cloud—and how it maps to code.
With our Cloud vs. Code Guarantee, you can:

Detect drift automatically
Prevent hidden changes from going unnoticed
Ensure 100% IaC coverage across environments

End-to-End Governance

We give platform and DevOps teams centralized visibility, while letting application teams move fast, with:

Role-based access and SDLC controls
Built-in compliance and security checks
Real-time analytics across regions and clouds

Real-World Results: Resilience at Scale with Block

Block, one of the most advanced fintech cloud platforms, faced a wake-up call when a critical review revealed gaps in disaster recovery—not in data, but in infrastructure itself.

“We needed something fast, reliable, and easy to run. ControlMonkey gave us all of that—and more.”
-Ben Apprederisse, Platform Technical Lead at Block

After adopting ControlMonkey, Block’s teams:

Recovered environments 90% faster
Gained full visibility into what’s covered by Terraform—and what isn’t
Created a clear, automated path for restoring critical infrastructure during outages

The Path Forward: End Cloud Chaos, Start Building

Cloud chaos isn’t inevitable. It’s just the result of trying to scale old ways of working into a new era.

ControlMonkey gives you the structure, visibility, and automation to move fast—without breaking things.

Explore Automation →

Request a Demo →

A 30-min meeting will save your team 1000s of hours

Book Intro Call

Author

Aharon Twizer

CEO & Co-founder

Sounds Interesting?

Request a Demo

FAQ – Cloud Chaos

What Is Cloud Chaos?

Cloud chaos refers to the growing complexity and lack of control in cloud environments as organizations scale. It happens when infrastructure expands faster than governance, leading to:

Infrastructure drift (code vs. reality mismatches)
Manual changes outside of policy
Siloed teams and tools
Endless tickets and firefighting

It’s not just messy—it’s risky. Cloud chaos can slow innovation, increase costs, and expose teams to compliance failures. As AI accelerates infrastructure changes, chaos compounds—unless teams adopt end-to-end automation and governance.

ControlMonkey and Cloud Chaos: What’s the Connection?

ControlMonkey takes its name from Chaos Monkey, the open-source tool created at Netflix to randomly shut down cloud resources and test system resilience. That tool exposed a hard truth: most cloud infrastructure wasn’t built for failure.

ControlMonkey flips the script.

Where Chaos Monkey introduced failure, ControlMonkey restores control. We give DevOps and platform teams the automation and governance they need to keep up with scale—without losing visibility or stability.

Related Resources

Resource Blog News Customers Stories

Updated: Aug 20, 2025 Upd: 20.08.25

1 min read

“This is My Offer” – AWS ISV Program

Zack Bentolila

Marketing Director

In this episode of “This is My Offer” we will discusses ControlMonkey’s with Bala KP, WW Sr. Solutions Architec

A 30-min meeting will save your team 1000s of hours

Book Intro Call

Author

Zack Bentolila

Marketing Director

Zack is the Marketing Director at ControlMonkey, with a strong focus on DevOps and DevSecOps. He was the Senior Director of Partner Marketing and Field Marketing Manager at Checkmarx. There, he helped with global security projects. With over 10 years in marketing, Zack specializes in content strategy, technical messaging, and go-to-market alignment. He loves turning complex cloud and security ideas into clear, useful insights for engineering, DevOps, and security leaders.

Sounds Interesting?

Request a Demo

Engineering Toil: The Real DevOps Bottleneck

Why Scale and Velocity are Challenging to DevOps?

How Engineering Toil Impacts DevOps Productivity

Less Than 50% of an Engineer’s Time Should Be Spent on Toil

Importance of Prioritizing Long-Term Engineering Projects

How Infrastructure as Code (IaC) Helps Reduce Engineering Toil

The benefits of Using IaC to Eliminate Toil

Tackling Toil in Terraform and Cloud Workflows

Why Toil Often Goes Unnoticed

How ControlMonkey Eliminates Engineering Toil

From Engineering Toil to Total Cloud Control

A 30-min meeting will save your team 1000s of hours

A 30-min meeting will save your team 1000s of hours

Author

Sounds Interesting?

FAQ – Engineering Toil, DevOps Toil and SRE Toil

What is toil in software engineering?

What is engineering toil in DevOps?

How does toil affect site reliability engineers (SREs)?

What does Google say about engineering toil?

What’s the difference between DevOps toil and automation?

Related Resources

What Is OpenTofu? Step-by-Step IaC Guide for 2025

OpenTofu CI CD Guide: AI-Powered Automation to the Rescue

Practical DevOps Guide to Scaling Terraform

Cloud Chaos: What It Is and How to End It

What Is Cloud Chaos?

Why Cloud Chaos Happens

From Chaos to Control: How ControlMonkey Solves It

Complete Terraform Automation

Cloud vs. Code Integrity

End-to-End Governance

Real-World Results: Resilience at Scale with Block

The Path Forward: End Cloud Chaos, Start Building

A 30-min meeting will save your team 1000s of hours

A 30-min meeting will save your team 1000s of hours

Author

Sounds Interesting?

FAQ – Cloud Chaos

What Is Cloud Chaos?

ControlMonkey and Cloud Chaos: What’s the Connection?

Related Resources

What Is OpenTofu? Step-by-Step IaC Guide for 2025

OpenTofu CI CD Guide: AI-Powered Automation to the Rescue

Practical DevOps Guide to Scaling Terraform

“This is My Offer” – AWS ISV Program

A 30-min meeting will save your team 1000s of hours

A 30-min meeting will save your team 1000s of hours

Author

Sounds Interesting?

Related Resources

Self-Service Terraform AWS for DevOps Teams

Engineering Toil: The Real DevOps Bottleneck

Cloud Chaos: What It Is and How to End It