Research
Cloud Migration and Architecture: Designing a Resilient Multi-Region Cloud Platform
Arxium
November 28, 2024

Cloud Migration and Architecture: Designing a Resilient Multi-Region Cloud Platform

An energy sector client needed to migrate their on-prem analytics and billing system to the cloud with minimal disruption
Article heading

Context and Drivers

When the client — a national critical-infrastructure provider — first invited us in, their estate spanned three data-centres, 200+ Windows and RHEL VMs, an Oracle RAC cluster, two monolithic .NET 4.6 applications, a handful of Java services, and a sprawling collection of SSIS jobs that moved data around on cron-like schedules. Change windows were measured in weekends. Disaster-recovery tests required four weeks of preparation. Every new feature request began with the same grim question: “How much downtime can we negotiate?”

Cost was not the core problem; agility and reliability were. The CIO's mandate was therefore unambiguous: consolidate the estate, raise availability to “four nines”, and shorten release cycles from quarterly to on-demand — without breaking regulatory alignment (IRAP-aligned controls, Essential Eight maturity 3, PCI DSS for a niche payments module).

Architectural Vision

We settled on Azure because the organisation already consumed O365 and AD FS, which gave us identity primitives and an existing EA. Our architecture had three guiding principles:

  • Regions as blast-radii, not just fail-over sites. Everything — stateful or stateless — must run active-active across Australia East and Australia South-East.
  • Cattle over pets from day 1. No hand-crafted servers, even for “quick POCs”. Infrastructure-as-Code or it doesn't exist.
  • Transactional integrity > lift-and-shift speed. Whenever a workload could not meet the RPO/RTO envelope under a re-platform scenario, we rewrote or replaced it.

Execution

Landing Zones and Guard-Rails

We began with a paired landing-zone model using Terraform and the CAF enterprise module. Each zone deployed:

  • Hub-and-spoke VNET topology with Azure Firewall Policy and DNS forwarding
  • Azure Policy assignments for tag governance, diagnostic settings, and CIS benchmarks
  • Sentinel + Log Analytics workspace per region, with cross-region workspace replication for audit immutability
  • Key Vault backed by HSM-protected keys; RBAC enforced through PIM roles

Data Layer

The Oracle RAC workload was the largest anchor keeping them on-prem. We evaluated three options: Azure VM RAC, Azure Database for Oracle (by Oracle), and logical migration to PostgreSQL. Regulatory deadlines ruled out a wholesale RDBMS shift, so we containerised the existing RAC nodes on AKS with FlashGrid — giving us synchronous Data Guard replicas across the two regions. For telemetry and analytics we ingested CDC streams into a region-paired Cosmos DB and exposed query surfaces through Synapse Serverless.

Application Refactor

The .NET monoliths were decomposed along obvious bounded contexts—billing, customer-profile, and reporting — into six services running on AKS with Dapr sidecars for service discovery and pub/sub. We translated WebForms UIs to Razor pages, introduced Serilog structured logging, and standardised health endpoints to the Kubernetes SIG instrumentation spec. Legacy Java WAR files were packaged into OpenShift-compatible images and deployed unchanged; build pipelines simply swapped out the target context.

Pipelines and Environment Promotion

Every repo gained a GitHub Actions workflow: PR-triggered build, Snyk scan, Trivy scan, unit tests in parallel, Docker build, and a call to Atlantis which planned the corresponding Terraform change set. Once a release candidate was tagged, Argo CD (running in a management cluster) pulled updated Helm charts and promoted through dev → test → prod with automated smoke tests in k6 and Postman. Rollback is instantaneous: we pin the Helm release version and Argo reconciles.

Cut-over Strategy

Because RAC replication lag never exceeded 30s and k8s deployments used blue-green semantics, we executed the final cut-over during a Tuesday 02:00 maintenance slot. Traffic was shifted gradually by adjusting DNS TTL down to 60s the week prior, then flipping the Front Door backend pools from DC VIPs to Azure Public IP prefixes. No dropped transactions were observed; synthetic monitoring reported <150 ms P95 end-to-end latency throughout.

Results

  • Availability is now formally measured at 99.992% (Service Credits SLA report, first full quarter).
  • Release cadence improved from one per quarter to 12 deployments per week; median MTTR sits at 11 minutes thanks to Argo rollbacks.
  • Operational cost dropped 38% year-on-year despite higher utilisation; savings came from right-sized AKS node pools and elimination of over-provisioned SAN hardware.
  • Compliance posture: we passed IRAP PROTECTED assessment with zero major findings, and Essential Eight moved from maturity 1 to maturity 3. Cultural shift: infra tickets in Jira fell by 72% as product squads now self-provision via Terraform Cloud workspaces governed by Policy Sets.

Key Takeaways

  1. Architect for failure first, migration second. Had we lifted and shifted VMs into a single region then added redundancy, the blast radius during refactor would have been unacceptable.
  2. Use IaC as social contract. Terraform reviews became the focal point where security, ops, and dev all collaborated—eliminating months of “hardening after go-live”.
  3. Observability is the migration safety-net. Without end-to-end tracing (OpenTelemetry + Azure Monitor), we would not have detected the subtle Oracle redo-log lag that almost derailed cut-over rehearsal #2.
  4. People > platforms. Cloud skills uplift—formal workshops plus pair-programming—proved more valuable than any reference architecture.

We continue to refine the platform: introducing GKE as a secondary cloud, exploring serverless containers with Azure Container Apps, and piloting Chaos Mesh for failure-injection drills. If you’re facing similar constraints — legacy sprawl, high-availability mandates, or audit pressure — I’d be happy to share the Terraform modules, Helm charts, and run-books we developed along the way.

About Us

Arxium is a software consultancy focused on helping government agencies, banks, and enterprises build systems that matter. We specialise in modernising legacy platforms, designing digital services, and delivering scalable, cloud-native architectures.

Leaders across Australia trust us to solve complex technical problems with clarity, pragmatism, and care. Whether it's migrating infrastructure, integrating systems, or launching a public-facing portal—we make software work in the real world.

Contact us to start a conversation about your next project.

Arxium ©
2025