TrademarkTrademark
Features
Documentation

Terraform Operations at Scale

The 3 key elements to being a smooth operator when using Terraform at scale.
Ryan FeeMay 11, 2023
Terraform Operations at Scale
Key takeaways
  • Operating Terraform at scale comes down to reporting and visibility across three areas: operational dashboards, reporting, and monitoring.
  • Run dashboards let platform teams view runs across all environments and workspaces from one place, saving time during incidents and run prioritization.
  • Reporting on which Terraform versions, modules, and providers are used helps avoid technical debt and focus effort where it matters most.
  • Streaming Terraform run events to a tool like Datadog surfaces system-wide pipeline issues and lets teams alert on run errors and queue backups.

Operating Terraform at scale is hard. As a platform team, you have a lot of moving parts to watch if you want developers to work on their own without slowing each other down. A few questions to ask yourself as your Terraform usage grows:

  • How do you know teams are operating in a compliant way?
  • How do you know the pipeline is working as expected?
  • What happens if you have an emergency change and need to cancel runs that are ahead of it in the queue to ensure it gets pushed through?
  • What modules should you invest time in and which ones should be deprecated?

All of these questions revolve around reporting and visibility. If you can answer all of these in a matter of seconds, then you're likely going to have smooth operations.

Scalr operational dashboard displaying Terraform runs across environments

I generally break this down into three areas. Operational dashboards show you the pipeline. Reporting covers current and historical information. Monitoring then catches the things you couldn't see in the dashboards or reports.

Run Dashboards

In terms of operations, you want to be able to view how the Terraform runs are processing, but most importantly, from a single place. Rather than wasting time jumping from workspace to workspace, it's important to see the current runs in the context in which you are working.

If you're the owner of a specific application and operating within that environment, then you'll want to see only the runs for that environment.

If you're managing a platform that developers use for Terraform operations, then you'll want to see all of the runs across all environments and workspaces. That is where the Scalr run dashboards help.

Terraform Run Dashboards

The time this saves when searching for runs, assisting in an incident, or prioritizing runs is not trivial.

Reporting

Now that you have a view into current operations, you also need to know how modules, providers, and Terraform versions are being used across the org. That picture helps you avoid technical debt, so you spend your energy building instead of maintaining.

Can you easily determine which Terraform versions, modules, or providers are used across your Terraform ecosystem? Do you know what source the developers are pulling modules or providers from? If the answer is no, then your organization is not likely operating at maximum efficiency.

Terraform Reports: Modules and Providers

With this information on hand, you can make sure your time goes to the areas that matter most to the business.

Monitoring

With operational dashboards in place and reports grading how well your Terraform platform is maintained, the next job is catching pipeline issues that aren't obvious. By streaming Terraform run events to a tool like Datadog, you will be able to quickly understand system-wide issues or build alerts to watch for run errors in some of the more critical workspaces.

As you grow you'll also want to understand how the pipeline is keeping up with the demand. This can be seen in the run dashboards, but assuming you are not watching this all day, you will want to be alerted of a queue backing up due to a lack of resources. Using Scalr to stream events into Datadog will not only allow you to alert on Terraform workspace level issues, but your overall pipeline.

New Feature: Datadog Integration

The three components for scaling

Adopting Terraform is not hard, but scaling it is. As you keep rolling it out, hold on to the three components that make scaling manageable:

  • Operational run dashboards
  • Reporting on usage
  • Monitoring Terraform events

Set this up from the start and you'll catch issues or incidents that would have slipped by before, keeping your team running smoothly.

About the author
Ryan Feedirector of platform engineering at Scalr
Ryan Fee is the director of platform engineering at Scalr, with over 15 years of experience improving infrastructure experiences at companies large and small.