Hyperscaler Governance¶

Enabling teams to adopt and embrace the hyperscalers while concurrently applying security and governance can easily be seen as two conflicting goals. And based on how it is approached it often is for many teams. However, that need not be the case. By incorporating the latest DevOps and GitOps approaches teams can reach a balance that provides good governance while also enabling teams to operate independently and enabling them to leverage the best of the hyperscalers.

This blog is for governance of AWS Accounts, Google Projects, and Azure Subscriptions

Goals¶

Governance Goals¶

The following are some common governance goals for teams:

Finance - Control Costs and ensure teams do not go outside their budget
Security - Ensure proper security measures are in place for all user access via UI and API to the Cloud Platform
Security - Ensure proper security measures are in place for all services that are placed within the cloud
Security - Ensure proper security monitoring tools are used to detect breaches of the security
Security - Ensure only security team approved cloud services are used by teams

Engineering Goals¶

The following are some common operational goals for engineering and operations teams:

Outage Avoidance - Never do anything in production that was not done previously in a lower environment
Fast - Automate everything, no manual steps, no service now, no old world IT/ITIL Service Management
Flat - Push decision-making down to the lowest appropriate level

Summary¶

The above are just a few of the many forms of goals the teams have for their cloud adoption. This blog is not intended to cover all of the goals but instead to focus on how to manage a governance lifecycle where policies will evolve over time and to enable various independent development and operations groups to work together in rolling out these changes in a successful manner.

Overview¶

AWS Overview

The above diagram includes some basic elements of Governance from the perspective of an Amazon cloud environment. The current amazon design, as of late 2023, has Amazon Organizations as the recommended approach to utilize for governance across an enterprise. Amazons model allows for multiple levels of organizational units underneath each organization allowing the easy construction of various hierarchies to meet different customers needs.

For development teams it is best to never make changes into a production environment that have not first been applied in lower tiered environment such as development and test. This industry best practice allows teams to streamline their operations and to protect their most important asset, their production operations.

For teams who only host other companies software, it is still possible, and valuable, to setup a test and production, or prod and non-prod, so they can first attempt changes in non-prod. While this is a recommended pattern many teams prefer to just go directly against production and for many systems this is just done on the weekend as the systems being affected are often not very business critical and these teams operate around planned scheduled maintenance windows.

Obviously for systems, such as anything visible directly to the companies customers, it is ill-advised in this day and age to cause publicly visible outages. Not being able to reach a company's website due to a scheduled maintenance window or due to a team making changes directly to the production environment will only lead to unnecessary damage to the companies brand.

Structure¶

As can be seen above the design pattern is to create different organizational units under the parent organization in such a way that you can roll out policy changes across these units. Using units on the left to work through any issues or problems caused by the changes. Each project area/organization/engineering group has their own organizational units and their individual accounts then link into their OUs. In this fashion teams who do apply engineering best practices will work through the necessary changes in their operations, automation and their software first in their dev environment, then in their test account/environments and then finally in their production accounts/environmnts. Teams who do no development would then have only test and production where possible. And teams who prefer to make changes first in production would then only have a production OU.

NOTE. It is highly recommended to guide teams to consider adopting a test first approach to handling changes outside of production as a best practice. Granted this is more work than not testing however for many systems in this day and age people expect 24x7 availability and outages can lead to lost revenue and opportunity based on the scale of the outage.

Links¶

Rollout¶

The following are the recommended steps for handling changes to policy across an enterprise. In this example sprints are used as the recommended cadence (sprints normally being one or two weeks in duration)

Step One¶

AWS Step One

Step Two¶

AWS Step Two

Step Three¶

AWS Step Three

Google¶

Google

Google cloud's Orgnization structure is very similar to Amazons. In google cloud Folders play a similar role to Organization Units in Amazon. Projects within Google are similar to accounts. Thus the same overall structure and model that is used with Amazon is easily created within Google.

Links¶

Last update: March 8, 2024