Response for Partners & Customers: Ensuring High Availability Amid Regional Cloud Outages

Response for Partners & Customers: Ensuring High Availability Amid Regional Cloud Outages

The recent AWS outage underscores the critical importance of a robust, multifaceted strategy for business continuity. At CoreStack, we take this responsibility very seriously, and our infrastructure on Microsoft Azure is architected specifically to maintain service availability and data durability even during a regional disruption.

Here’s a breakdown of our approach:

1. Proactive High Availability Design on Azure

  • Azure-Centric Infrastructure with 99.9% SLA: All CoreStack production infrastructure is hosted on Microsoft Azure, and our service commitment is backed by a 99.9% uptime SLA.

  • Multi-Tier Database Resilience: Our databases are configured for high availability. They operate in a multi-tiered setup so that if the primary instance fails, it automatically fails over to a healthy secondary replica within the same region, minimizing downtime and ensuring data continuity.

2. Reactive Disaster Recovery for Regional Outages

While high availability protects against localized failures, our Disaster Recovery (DR) plan is activated for a full regional outage.

  • Automated Regional Failover with Azure Site Recovery (ASR): This is the cornerstone of our DR strategy. We do not rely on manual intervention for core service restoration.

    • How it works: Our application and web servers are replicated in near real-time to a secondary, paired Azure region.

    • In an Outage: If a primary region becomes unavailable, Azure Site Recovery is enabled to automatically bring the virtual machines (VMs) online in the alternate region. This process is designed to meet our aggressive Recovery Time Objective (RTO).

3. Comprehensive Data Protection & Recovery Objectives

To ensure no data loss and a swift recovery, we have a rigorous backup strategy.

  • Frequent Backups: We perform both daily and hourly backups of our critical systems and data.

  • Meeting RPO & RTO: This multi-layered backup approach allows us to meet our defined Recovery Point Objective (RPO)—how much data you might lose—and Recovery Time Objective (RTO)—how quickly we can restore service). In a DR scenario, we can restore services from these backups in the secondary region.

4. Rigorous Validation Through DR Drills

Our confidence in this process isn't theoretical. We validate it regularly.

  • Annual Disaster Recovery Drills: We conduct full-scale DR drills at least once a year as a part of our internal and external audit compliance.

  • Verified SLAs: These drills consistently demonstrate that our actual RTO and RPO are well within the thresholds defined by our customer-facing SLAs. This proves our capability to execute a seamless failover with minimal impact.

Summary: How This Protects Our Customers

In the event of a significant Azure regional outage:

  1. Our automated systems detect the failure.

  2. Azure Site Recovery initiates the failover process, spinning up our portal and application servers in the healthy, secondary region.

  3. Database failover and data restoration processes are triggered to ensure data integrity.

  4. The CoreStack portal becomes available from the new region, and customers are able to resume operations.

The net effect is that customers experience minimal disruption. While there will be a brief downtime during the failover process (aligned with our RTO), the automated and tested nature of our DR plan ensures a swift and reliable recovery, safeguarding your operations and data.

We are committed to providing a resilient and reliable platform, and our investment in this comprehensive DR strategy is a core part of that promise.