Gartner Highlights Nine Principles To Improve Cloud Resilience
In the cloud, outages almost never involve the entire cloud provider, nor are service outages likely to be total.
The cloud is not magically resilient, and software bugs, not physical failures, cause most cloud outages.
Infrastructure and operations leaders must deploy nine principles to maximise the resilience of cloud environments and limit the impact of cloud provider failures, research and consulting firm Gartner Inc. recommends.
The I&O team must understand the characteristics and causes of cloud outages. Most failures are partial, they tend to be intermittent or involve performance degradation where they are less immediately noticeable, and there are differences in resilience between the services cloud providers offer, according to Gartner.
"In the cloud, outages almost never involve the entire cloud provider, nor are service outages likely to be total. Instead, partial failures, degradations of service, individual service problems or local problems are typical," Chris Saunderson, senior director analyst at Gartner, said.
"Resilience is not a binary state. No one can claim absolute resilience. Clouds should be as or even more resilient than on-premises infrastructure, but only if the I&O team uses them in a resilient manner," Saunderson said.
Gartner recommended that I&O leaders focus on nine key principles to improve cloud resilience.
Business Alignment: Align resilience requirements to business needs. Without this alignment, teams can fall short of resilience expectations or may overspend.
Risk-Based Approach: Take a risk-based approach to resiliency planning that extends beyond catastrophic events. Put more emphasis on the more common failures that organisations have greater control to mitigate.
Dependency Mapping: Build dependency graphs that map all middleware components, databases, cloud services and integration points so they can be architected and configured for resilience and included in both reliability and disaster recovery planning.
Continuous Availability: The continuous availability approach focuses on keeping applications, services and data available at all times and service levels with no downtime and limited impact during a failure event.
Resilient By Design: The application itself should be resilient by design. Infrastructure resilience alone is insufficient to deliver the zero-downtime services that end users expect.
Disaster Recovery Automation: Implementing fully or near-fully automated disaster recovery—either through the organisation's own tools or through third-party cloud-native tools—provides the foundation needed to meet aggressive recovery time objectives and allows disaster recovery to be routinely tested.
Resilience Standards: Adopt resilience standards beyond architecture and disaster recovery. Resilient systems require teams to focus on quality, automation and continuous improvement, and infuse quality throughout the life cycle of an application.
Favour Cloud-Native Solutions: Cloud providers have a significant range of solutions that can be used to improve resilience. Where viable, I&O leaders should leverage them rather than trying to invent alternatives and increasing complexity.
Business Functions Focus: Explore lightweight IT alternatives or lightweight application substitutions that provide the bare minimum business critical functionality required.