A full Azure region does not fail often, but when it does, the failure is total and it does not negotiate. Compute stops answering, managed databases stop accepting writes, the storage accounts pinned to that geography return errors, and every service you assumed was independent turns out to have shared a single fate. Teams that planned for a single virtual machine crashing, or a single availability zone losing power, discover that their resilience story had a hidden ceiling, and the ceiling was the region boundary. The natural response is to run in two regions at once, both live, both serving production traffic, so that the loss of one is absorbed rather than felt. That design is multi-region active-active on Azure, and it is the most demanding resilience pattern the platform asks you to reason about, because the moment two regions both accept writes, you have signed up for a distributed-systems problem that no amount of routing cleverness can hide.

Multi-Region Active-Active on Azure - Insight Crunch

This guide treats multi-region active-active as an architecture decision first and a configuration exercise second. The configuration is the easy half. Pointing a global router at two regional deployments, turning on cross-geography replication, and watching traffic split across continents takes an afternoon. The hard half is the one that the demos skip: when both regions accept a write to the same record at nearly the same instant, which write survives, who decides, and what happens to the loser. Get that answer wrong and active-active does not improve your resilience, it manufactures a new class of data-correctness bug that single-region systems never see. So the through-line of everything below is a single rule, stated plainly and defended throughout. The hard part of active-active is multi-region writes and the conflicts they create, and unless your data model has a deliberate answer for those conflicts, active-passive is the safer multi-region design. Naming that rule up front is not a hedge against the pattern. It is the filter that tells you whether you should reach for it at all.

By the end you will be able to design an active-active topology on Azure with intent: choose a global traffic layer and know why, pick a data-replication strategy and know the consistency price it charges, decide how write conflicts are resolved before they ever happen, state the recovery point objective and recovery time objective the design actually targets, and compare the whole thing honestly against the simpler active-passive alternative that, for a large fraction of real workloads, is the correct choice. We will walk a concrete reference design, then break it on purpose to see every failure mode it must survive, and we will hand you a decision table you can paste into a design review and defend line by line.

What multi-region active-active actually means

Start with vocabulary, because the words get used loosely and the looseness causes bad designs. A region on Azure is a set of datacenters within a defined geography, connected by a low-latency network and treated as one deployment target. Most regions also expose availability zones, which are physically separate datacenters inside the same geography with independent power, cooling, and networking. Spreading a deployment across zones protects you from a datacenter-level fault such as a power event or a localized network outage, and for many applications that is the resilience tier that matters most day to day. What zones do not protect against is the loss of the entire geography: a region-wide control-plane incident, a regional capacity event, or a natural disaster that takes the whole metropolitan area offline. Surviving that requires a second region, and the question is how the second one participates.

In an active-passive topology, one geography serves all production traffic while the second sits ready but idle, or nearly idle. Data flows continuously from the primary to the secondary through replication, so the standby holds a recent copy. When the primary is lost, you fail over: promote the secondary, repoint traffic, and resume. The defining property is that only one geography accepts writes at any moment, so there is exactly one authoritative copy of the data at all times. That single-writer invariant is the reason active-passive is simpler, and it is worth holding onto as the baseline against which active-active must justify its extra cost.

In an active-active topology, both geographies serve production traffic simultaneously. Users in one part of the world are routed to the nearer deployment, both deployments are warm and exercised continuously, and the loss of one geography removes capacity rather than removing the service. Read traffic spreads naturally. The complication, and it is the entire complication, is writes. If both deployments accept writes to the same logical dataset, then two users can modify the same record in two geographies before the replication between them has caught up, and now there are two divergent versions of one record with no inherent answer for which is correct. That divergence is a write conflict, and how a system detects and resolves it is the single most important property of an active-active design.

Is active-active just active-passive with traffic sent to both regions?

No, and treating it that way is the most common and most expensive mistake. Active-passive keeps one writer, so there is always a single source of truth. Sending live write traffic to a second geography removes that guarantee and introduces concurrent writes to the same data, which is a distributed-systems problem requiring an explicit conflict-resolution strategy, not merely a routing change.

The distinction matters because it determines where the engineering effort goes. If you believe active-active is a networking feature, you will spend your time on the traffic layer, get it working, demonstrate failover, and ship. The data tier will look fine in testing because conflicts are rare under light, well-behaved load. Then production traffic with real concurrency arrives, two regions update the same shopping cart or the same inventory count within the replication window, and the system silently keeps the wrong value. The bug does not announce itself. It shows up as a customer complaint about a missing item or a double charge, weeks later, with no stack trace. This is why the framing here is relentless about the data model: the routing is solved technology, and the writes are where the design lives or dies.

There is also a recovery-objective dimension that separates the two patterns cleanly. Two numbers govern any resilience design. The recovery point objective, or RPO, is the amount of data you can afford to lose measured in time, the gap between the last replicated state and the moment of failure. The recovery time objective, or RTO, is how long the service may be unavailable before it must be back. Active-passive RTO is bounded by how fast you can detect failure and promote the standby, which can be seconds with automated failover or minutes with a human in the loop, and its RPO is bounded by replication lag at the instant of failure. Active-active aims to drive RTO toward zero, because the surviving geography is already serving, so there is nothing to promote and nothing to start. Its RPO is governed by how the data tier handles the in-flight writes that had not yet replicated when a geography was lost. The pattern you choose should follow from the RPO and RTO you actually need, and a great many systems need numbers that active-passive already meets, which is the first place over-engineering creeps in.

The Azure services that make active-active real

An active-active design has two layers that must both be solved: the traffic layer that decides which geography a request reaches, and the data layer that keeps state coherent across geographies. They are independent problems with independent Azure services, and conflating them is another path to a design that demos well and fails in production.

The global traffic layer

The traffic layer answers one question for every incoming request: which regional deployment should serve this. Azure gives you two primary global services for this, and they work at different layers of the stack, which changes everything about how they behave.

Azure Front Door operates at the application layer, layer 7, and it terminates the client connection at a point of presence near the user before forwarding the request over Microsoft’s private global network to a healthy backend. Because it speaks HTTP and HTTPS, it can route on path, host, headers, and cookies, terminate TLS at the edge, cache static responses, and enforce a web application firewall before traffic ever reaches your origin. Its routing uses anycast, so every user resolves the same address and the internet’s own routing delivers them to the nearest edge, where a split-TCP arrangement keeps the slow public-internet leg short and runs the long leg over Microsoft’s backbone. The property that matters most for resilience is failover speed: because Front Door makes the routing decision per request at the edge using live health probes, it can shift traffic away from an unhealthy origin in seconds, without waiting on anything in the client.

Azure Traffic Manager operates at the DNS layer. It does not sit in the request path at all. Instead it answers DNS queries, handing back the address of an endpoint chosen by a routing method you configure: priority for active-passive failover ordering, weighted for percentage splits, performance for lowest-latency endpoint, geographic for routing by the user’s location, and others. Because it works through DNS, Traffic Manager load balances at the domain level only and then steps out of the way, so traffic goes straight from the client to the chosen endpoint with no proxy hop and no edge termination. The cost of that simplicity is failover speed. DNS responses are cached by resolvers and clients according to their time-to-live, and not every system honors a short TTL, so when an endpoint goes unhealthy some clients keep resolving to it until their cached answer expires. Traffic Manager fails over, but it cannot fail over as quickly as a per-request edge proxy, and you must design the TTL and health-probe cadence with that lag in mind.

The choice between them is not subtle once you frame it correctly, and the longer treatment in the comparison at Front Door vs CDN vs Application Gateway walks the full matrix. For a web or API workload that benefits from edge termination, TLS offload, caching, and a firewall, and that needs the fastest possible failover, Front Door is the default. For non-HTTP workloads, for cases where you need the client to connect directly to the endpoint with no proxy in the path, or where DNS-level distribution is genuinely sufficient, Traffic Manager fits, and the two can even be layered so that DNS-level geographic steering sits in front of regional edge routing. What both share is the property active-active depends on: they continuously probe the health of each regional deployment and steer away from one that stops answering, which is what turns the loss of a geography into a capacity reduction rather than an outage.

The data layer and its replication strategies

The traffic layer is the part everyone gets right. The data layer is the part that decides whether the design is sound. State has to exist in both geographies, it has to stay coherent, and the strategy you choose determines both the consistency you can promise and the conflicts you will have to resolve.

Azure Cosmos DB is the service most directly built for this pattern, because it offers multi-region writes, sometimes described as multi-master, where every configured geography accepts writes locally and the service replicates them to the others asynchronously. The engineering detail of how Cosmos models and partitions data is covered in depth in the Cosmos DB engineering guide, and the multi-region write capability is what makes it the natural fit for active-active. Local writes mean low write latency in every geography, which is the headline benefit. The price is that asynchronous replication allows two geographies to accept conflicting writes to the same item before they have reconciled, so Cosmos requires you to declare a conflict-resolution policy on the container, and it can only be set when the container is created, not changed afterward. There are two policy modes. Last Write Wins is the default: the service compares a numeric property, by default the system timestamp _ts, and the highest value wins, which on the NoSQL API you may redirect to a custom numeric path of your own. Custom resolution hands the decision to a merge stored procedure you write in JavaScript, which the service runs exactly once per conflict, and if you select custom but register no procedure, or the procedure fails, the conflicting writes land in a conflicts feed for your application to resolve by hand. Choosing among these is not a tuning knob, it is a statement about what your data means, and we return to it as the central design decision.

Azure SQL Database takes a different posture. Its geo-replication and the failover-group abstraction built on top maintain one or more readable secondaries in other geographies, kept current by asynchronous replication, but the secondaries are read-only. Only the primary accepts writes. That makes Azure SQL a natural fit for active-passive at the database tier, or for an active-active topology in which both geographies serve reads locally but all writes route to a single primary geography. You get local read scale and a fast-promotable standby for failover, and you keep the single-writer invariant that makes correctness simple, at the cost of write latency for users far from the primary. This read-local, write-primary arrangement is one of the most useful and most underused patterns on Azure, because it captures much of the availability benefit of active-active without the write-conflict liability, and we treat it as a first-class option rather than a consolation prize.

Azure Storage offers geo-redundant replication, and in its read-access variant the secondary geography is readable, but the same single-writer principle applies: writes go to the primary, and the secondary is for reads and for failover, with replication asynchronous and therefore subject to lag. The mechanics of configuring storage and database replication for failover, including the operational runbook, are covered in setting up Azure Site Recovery for DR, and the broader question of how all these pieces compose into a coherent recovery strategy is the subject of the disaster recovery architecture guide. The pattern across every data service is the same shape: synchronous replication would let you avoid conflicts entirely by refusing to acknowledge a write until both geographies have it, but the cross-geography network round trip makes that latency unacceptable for most interactive workloads, so real systems replicate asynchronously, accept lag, and must therefore have an answer for what happens to writes caught inside the lag window.

The traffic layer in depth: probes, affinity, and graceful failover

The traffic layer is the solved half of active-active, but solved does not mean trivial, and a handful of its mechanics decide whether failover is graceful or jarring. Understanding them is what lets you tune the layer rather than accept its defaults and hope.

Health probing is the heartbeat of the whole arrangement. Front Door and Traffic Manager both send periodic probes to each registered backend and use the results to decide where traffic may go, and three parameters govern the behavior: the probe interval, how often a probe is sent; the probe path, what endpoint it hits; and the threshold, how many consecutive failures mark a backend unhealthy. A short interval and a low threshold detect a fault quickly but risk evacuating a geography over a transient blip, while a long interval and a high threshold are stable but slow to react, leaving traffic flowing to a faulting geography for longer. The tuning is a direct trade between failover speed and false-positive resistance, and the right setting depends on how costly a brief over-eager evacuation is for your workload versus how costly a few extra seconds of routing to a bad geography is. There is no universal answer, only a deliberate one.

The probe path is where the deep-health idea from earlier becomes concrete. A shallow probe hits a static endpoint that returns success whenever the web process is up, which tells you almost nothing about whether the geography can serve real traffic. A deep probe hits an endpoint that exercises the critical dependencies, confirming the data tier is reachable and responsive, so that a geography whose database has failed reports unhealthy and is evacuated, rather than continuing to accept writes it cannot persist. The deep probe costs a little more to build and a little more to run, and it is the single highest-impact piece of the traffic layer, because it is what makes failover trigger on the failures that actually matter instead of only on total process death.

Session affinity is the next mechanic, and it interacts directly with consistency. By default a stateless workload can have each request routed independently, so two requests from the same user might hit two geographies, which is fine for stateless reads but interacts badly with read-your-writes expectations under asynchronous replication. Pinning a user’s session to one geography, which Front Door supports through affinity, means a user’s reads follow their writes to the same place and they never see their own update apparently missing, which is the routing-layer complement to choosing session consistency at the data layer. The two work together: session affinity keeps a user on one geography, and session consistency guarantees coherence if they do move, and a sound design usually wants both for any workload where users notice staleness in their own actions.

Graceful failover, finally, is about what happens to requests in flight when a geography is evacuated. A well-behaved client and a well-configured router retry a failed request against a healthy geography, so a single failure during the transition becomes a slightly slower request rather than an error the user sees. This depends on requests being safe to retry, which is another reason idempotent write design matters: if retrying a write might apply it twice, you cannot retry freely, and the failover becomes visibly lossy. Designing writes to be idempotent, so that a retried write is harmless, lets the traffic layer paper over the transition entirely, which is the difference between a failover users never notice and one that generates a spike of errors. The traffic layer and the data layer are not as separable as they first appear, and the seams between them, affinity and idempotency, are where a polished active-active design distinguishes itself from a merely functional one.

A reference active-active design, walked through end to end

Abstract trade-offs get concrete fast when you trace a single request and a single write through a real topology, so picture a global commerce API deployed in two geographies, one in North Europe and one in East US, both warm, both serving. The goal is that a customer in Frankfurt is served from North Europe, a customer in Boston is served from East US, each sees low latency, and the loss of either geography costs capacity and nothing else.

The routing path

At the front sits Azure Front Door. Both regional deployments register as origins in a single origin group, and Front Door probes each one on a health endpoint that returns success only when that deployment is genuinely able to serve, meaning its dependencies are reachable and not merely that the process is running. A request from Frankfurt resolves to the nearest Front Door edge, terminates TLS there, and Front Door selects the North Europe origin because latency-based routing favors it and its health probe is green. The same logic sends Boston to East US. When North Europe’s health probe starts failing, Front Door stops selecting it within the span of a few probe intervals and routes Frankfurt traffic to East US instead. Frankfurt users now pay the higher latency of crossing to East US, but they are served, and that latency penalty is the visible, acceptable cost of a geography loss. Nothing in the client changed, no DNS cache had to expire, and no human had to act.

The health endpoint design deserves a sentence of its own because it is where this goes wrong in practice. If the probe checks only that the web process answers, it will report a geography healthy while that geography’s database is unreachable, and Front Door will keep sending writes into a deployment that cannot persist them. A useful probe verifies the dependencies the deployment needs to do real work, so that an unhealthy data tier pulls the whole geography out of rotation rather than leaving it accepting traffic it cannot serve.

The data path and where the design forks

Now the writes, which is where the topology forks into the two designs you must choose between deliberately.

In the first design, the data tier is Azure Cosmos DB configured for multi-region writes across North Europe and East US. The Frankfurt customer’s write to their cart lands locally in North Europe and is acknowledged immediately, giving low write latency, and Cosmos replicates it asynchronously to East US. The Boston customer’s writes behave symmetrically. This is true active-active at the data tier: both geographies accept writes, both have low local write latency, and the loss of one geography leaves the other already writing with no promotion step, so RTO approaches zero. The bill for this is the conflict problem. If the Frankfurt customer and a second device logged into the same account in Boston both modify the same cart item inside the replication window, Cosmos will detect a conflict and resolve it by the policy you declared on the container, and you had better know what that policy does to the customer’s data.

In the second design, the data tier is Azure SQL Database with the write primary in North Europe and a readable secondary in East US through a failover group. Both geographies serve reads locally, so a Boston customer browsing the catalog reads from the East US secondary with low latency. But every write, adding to a cart, placing an order, goes to the North Europe primary, so the Boston customer’s writes cross the Atlantic and pay that latency. The compensation is decisive: there is exactly one writer, so there is never a write conflict, and correctness is the same as a single-region system. On loss of North Europe, the failover group promotes the East US secondary to primary, writes resume there, and the RTO is the promotion time, typically seconds to a couple of minutes depending on configuration, with the RPO bounded by the replication lag at the instant of failure. This is the read-local, write-primary pattern, and notice that it delivers local-latency reads and automatic regional failover, which is most of what people actually want from active-active, without ever exposing the application to a conflict.

Reading the fork honestly

The two designs are not better and worse in the abstract. They answer different questions. The Cosmos design buys local-latency writes in every geography and a near-zero RTO, and it pays with the obligation to define conflict semantics that match what the data means. The SQL design keeps correctness trivial and accepts cross-geography write latency for the non-primary geography and a short promotion-based RTO. Which is right depends on whether your write latency from the far geography is tolerable and whether your data has a sound conflict-resolution story, and that pair of questions is exactly the decision the next sections make formal. The reference topology above is the same for both at the traffic layer; the entire difference, and the entire risk, lives in how state is written.

The write-conflict rule: why multi-region writes are the hard part

Here is the rule this whole guide is built around, stated as plainly as it can be. The hard part of active-active is multi-region writes and the conflicts they create, so unless your data model has a deliberate, correct answer for those conflicts, active-passive is the safer multi-region design. Everything else, the routing, the health probes, the read replicas, is solved engineering you can buy off the shelf. The conflict question is the one only you can answer, because only you know what your data means.

Why a conflict is not a bug you can fix later

A write conflict in an active-active system is not a defect in Azure and not something a patch resolves. It is an unavoidable consequence of two facts that you chose when you adopted the pattern: both geographies accept writes, and replication between them is asynchronous because synchronous cross-geography replication is too slow for interactive use. Given those two facts, there must exist a window of time during which a write has been accepted in one geography and not yet seen in the other. Any write to the same item in the other geography during that window produces two versions of the truth, and physics guarantees the window exists because the geographies are far apart and light is not infinitely fast. The system cannot prevent the conflict. It can only resolve it, and resolution always means choosing a winner and discarding or merging the loser. The question is never whether you will have conflicts. It is whether the way they are resolved preserves the meaning of your data.

The three shapes a conflict takes

Conflicts in a multi-write Cosmos configuration come in three shapes, and naming them makes the design conversation precise. An insert conflict happens when two geographies create a new item with the same identifying value at the same time, for example two records that both claim the same unique key. A replace conflict happens when two geographies update the same existing item concurrently, which is the classic case of two users editing one record. A delete conflict happens when one geography deletes an item while another updates it, leaving the system to decide whether the thing still exists. Each shape needs an answer, and the answer is your conflict-resolution policy.

Last Write Wins and what it quietly throws away

The default policy, Last Write Wins, resolves every conflict by keeping the write with the highest value on a chosen numeric property, which defaults to the system timestamp. It is simple, it is deterministic, and it requires no code, which is why it is the default and why it is correct for a large class of data. If a record represents the latest known state of something and an older value carries no information you need, such as a user’s most recent profile photo or a device’s most recent reported temperature, then keeping the newest write and dropping the older one loses nothing that mattered. Last Write Wins is the right answer when later genuinely means more correct.

The danger of Last Write Wins is that it silently discards the losing write, and for some data the losing write carried information you needed. Consider an inventory counter or an account balance. If North Europe decrements stock by one and East US decrements it by one within the conflict window, both writes set the value to the same decremented number, and Last Write Wins keeps one and throws the other away, so two sales record as one decrement and the count is now wrong by one. The data model that survives this is not a stored current value at all but an append-only log of decrement events, where two events are two distinct items that never conflict and the count is derived by summing them. The lesson generalizes: Last Write Wins is safe for last-state-wins data and unsafe for data where every write is a fact that must be preserved, and recognizing which kind you have is the design judgment that matters most.

Custom resolution, merge logic, and the conflicts feed

When later does not mean more correct, you move to custom resolution, where you register a merge stored procedure that the service invokes exactly once per conflict to compute the reconciled result. For a collaborative document you might merge non-overlapping edits. For a counter you might add the deltas rather than choosing one. The merge logic encodes your data’s semantics, which is precisely why no platform can write it for you. The safety net is the conflicts feed: if you choose custom resolution and the procedure is absent or fails, the conflicting versions are not silently merged or dropped, they are written to a feed your application reads and resolves on its own schedule, which lets you handle rare or complex conflicts asynchronously and even alert a human when the backlog grows. A growing unresolved-conflict backlog is itself a signal worth monitoring, because it means your resolution process is falling behind the rate at which conflicts are being created.

The design move that dissolves the problem

The most reliable way to handle write conflicts is to build a data model in which they cannot arise, and there are two standard moves. The first is append-only modeling, already described: represent changes as immutable events rather than mutating shared state, so concurrent writes become distinct records that never collide and the current value is computed by folding the events. The second is region-affinity, sometimes called partition-by-geography: arrange the data so that each item is only ever written in one geography even though all geographies can read it. If every customer’s account is anchored to a home geography and only that geography writes it, you have active-active at the system level, with full read locality and capacity in both places, but single-writer semantics per item, which means no conflicts. Region-affinity is how many large active-active systems get the availability of the pattern without paying its correctness tax, and it deserves to be the first thing you reach for before you reach for a conflict-resolution policy.

The InsightCrunch multi-region decision table

The decision between active-active and active-passive, and between the data-tier variants inside each, reduces to a small number of dimensions that you can rate honestly for any given workload. This is the findable artifact of this guide: the InsightCrunch multi-region decision table, which rates the patterns on the five axes that actually decide the question, then names the deciding signal that should tip you one way or the other. Rate your workload on each row, and the pattern whose profile matches your real requirements is your answer, not the pattern that sounds the most resilient in a meeting.

Dimension Active-passive (single writer) Read-local, write-primary Active-active (multi-region writes) Deciding signal
RTO Seconds to minutes: detect failure, promote standby Seconds to minutes: promote secondary to primary Near zero: surviving geography already writing Choose active-active only if a seconds-to-minutes RTO is genuinely insufficient
RPO Bounded by replication lag at failure Bounded by replication lag at failure Bounded by in-flight unreplicated writes at failure Comparable across patterns; active-active does not magically improve RPO
Write conflicts None: one writer, one truth None: one writer, reads are local Real: requires a declared resolution policy If the data model cannot resolve conflicts soundly, do not choose multi-region writes
Complexity Lowest: standby plus failover runbook Moderate: read routing plus failover Highest: conflict semantics, monitoring, testing Match complexity to team capacity to operate and test it
Cost Lower: standby can run reduced until needed Moderate: full read tier in both geographies Highest: both geographies fully provisioned and hot If running both geographies hot is not justified by the RTO need, prefer the cheaper pattern

Read the table as a filter, not a scoreboard. Active-active wins exactly one row outright, RTO, and ties or loses every other row. That is the honest shape of the trade-off. The pattern is worth its cost when, and only when, a near-zero RTO is a real requirement that a seconds-to-minutes promotion cannot meet, and when your data model has a sound answer for the conflict row. If either of those is false, one of the simpler patterns is the correct engineering choice, and choosing active-active anyway is how teams end up paying double the infrastructure bill to host a data-correctness bug.

What the table does not capture

Two things sit outside the grid because they are preconditions rather than dimensions. The first is write-latency tolerance for the read-local, write-primary pattern: if your far-geography users cannot tolerate the cross-geography write latency, that pattern is off the table regardless of how well it scores elsewhere, and you are pushed toward either true multi-region writes or accepting that the far geography is read-only by design. The second is regulatory data residency: if law requires that a given customer’s data be written and stored in a specific geography, region-affinity is not an optimization, it is a constraint, and it happens to also solve your conflict problem for free. Let the constraints prune the options before you score the survivors.

Failure modes active-active must survive

A resilience pattern earns trust only when you have broken it on purpose and watched it recover. An active-active design that has never been tested under failure is a hypothesis, not a safeguard. These are the failure modes the design must handle, each with the symptom you will see and the property that saves you.

Region loss while writes are in flight

The headline scenario is the one you built for: a whole geography becomes unavailable. The traffic layer handles its half cleanly, because Front Door or Traffic Manager probes detect the unhealthy geography and steer all traffic to the survivor, so users keep being served at the cost of higher latency for those who were local to the lost geography. The data layer is where the subtlety lives. Any writes that had been accepted in the lost geography but not yet replicated to the survivor are, for the moment, unreachable, and that gap is your real RPO. With Cosmos multi-region writes the survivor keeps accepting writes immediately, so there is no write outage, but the unreplicated writes from the lost geography are pending until that geography returns and reconciles. With a write-primary design, if the lost geography held the primary, the failover group promotes the secondary and writes resume there, with the same unreplicated-tail caveat on RPO. The property that saves you is that one geography was always able to keep writing, which is the entire point of the pattern, and the cost you accept is the small tail of in-flight writes that defines your RPO.

Concurrent writes producing conflicts

The everyday failure mode, far more common than a region loss, is two geographies writing the same item inside the replication window. The symptom depends entirely on your policy. Under Last Write Wins on inappropriate data the symptom is silent corruption, a count or balance that is quietly wrong, with no error and no log entry, which is the worst kind of failure because nothing tells you it happened. Under a sound model, append-only events or region-affinity, the symptom is nothing at all, because the conflict never forms. Under custom resolution the symptom is either a correctly merged result or, when the procedure cannot decide, an entry in the conflicts feed that your application drains. The lesson is that this failure mode is not detected and fixed at runtime, it is designed out beforehand, and the only defense is the data model you chose before the first conflict ever occurred.

Split-brain and the asymmetric-partition trap

A network partition that isolates the geographies from each other while both remain reachable by their local users is the split-brain scenario, and it is the most dangerous because both geographies believe they are healthy and keep accepting writes that cannot replicate. In a multi-write design this is simply a longer conflict window: both sides write, and when the partition heals, the accumulated conflicts resolve by policy, which is tolerable if and only if your policy is sound for that data. In a write-primary design, split-brain is the case you must guard against hardest, because if a partition causes the secondary to promote itself while the original primary is still alive and writing, you now have two primaries accepting writes, which is exactly the two-writers situation the pattern was supposed to forbid. The defense is that promotion must be coordinated so that two primaries cannot exist at once, and you must understand precisely how your failover automation makes that guarantee before you rely on it. Test the partition, do not assume it.

Replication lag masquerading as a bug

A subtler failure mode is not a failure at all but the normal behavior of asynchronous replication surprising the application. A user writes in one geography and a moment later reads from the other before the write has replicated, and sees stale data, their own update apparently missing. This is read-your-writes inconsistency, and it is inherent to asynchronous cross-geography replication. The fixes are well understood: pin a user’s session to one geography so their reads follow their writes, or use a consistency level that guarantees a session sees its own writes, which Cosmos offers through its session consistency level. The point is to recognize the symptom as expected behavior of the chosen replication model rather than chasing it as a defect, and to have decided up front which consistency level the application requires, because consistency and latency trade against each other directly and the choice belongs in the design, not in a late bug fix.

Health-probe blindness

The last failure mode is operational rather than architectural: a health probe that reports a geography healthy when it cannot actually serve. If the probe checks only that the front-end process answers, a geography whose data tier is down will keep receiving and accepting traffic it cannot persist, turning a clean failover into a partial outage where some users hit a geography that silently fails their writes. The defense is a deep health probe that verifies the dependencies a real request needs, so that an unhealthy data tier removes the whole geography from rotation. This is cheap to build and routinely skipped, and it is the difference between a failover that works and one that half works.

Consistency levels and the latency you trade for them

Underneath the conflict question sits a more fundamental one that active-active forces you to answer explicitly: how consistent do reads need to be across geographies, and what latency are you willing to pay for that consistency. In a single-geography system you rarely think about this because the answer is effectively strong by default. The moment data lives in two places connected by an asynchronous link, consistency becomes a dial with latency on the other end, and Cosmos DB exposes that dial directly as five consistency levels, each a named point on the trade-off curve.

At the strong end, strong consistency guarantees that a read sees the most recent committed write, which is the behavior most developers assume by reflex. The catch in a multi-geography deployment is that strong consistency cannot be offered across asynchronously replicated write regions, because guaranteeing a globally most-recent read would require the cross-geography coordination whose latency the pattern exists to avoid. So choosing strong consistency constrains your topology, and that constraint is itself a design input you must surface early rather than trip over late.

At the relaxed end, eventual consistency promises only that, given enough time without new writes, all geographies converge to the same state, with no ordering guarantee in the meantime. It offers the lowest latency and the highest availability, and it is correct for data where a brief window of staleness costs nothing, such as a view counter or a cached aggregate. Between the extremes sit bounded staleness, which caps how far behind a read may lag in versions or time, consistent prefix, which guarantees reads never see writes out of order even if they see them late, and session consistency, which is the level that solves the read-your-writes problem described earlier by guaranteeing a single client session always sees its own writes and reads in order.

Which consistency level should an active-active system use?

Session consistency is the pragmatic default for most active-active applications. It gives each user a coherent view of their own actions, which is what users actually notice, while allowing the low latency and high availability that the pattern exists to provide. Reserve stronger levels for the specific data that genuinely requires global ordering, and accept eventual consistency only where staleness is harmless.

The reason to settle the consistency level in the design rather than in a later bug fix is that it interacts with everything else on this page. A stronger level reduces the apparent staleness that surprises applications, but it raises latency and can constrain which topologies are even available. A weaker level maximizes the availability and latency benefits of active-active but widens the window in which an application can read stale data and must therefore be more defensively written. The conflict-resolution policy, the consistency level, and the topology are three facets of one decision about how much coordination your data demands, and a coherent design picks them together. Picking a low-latency weak level and then being surprised by stale reads, or picking a strong level and then being surprised that it limits multi-region writes, are both symptoms of treating consistency as a runtime knob rather than an architectural choice.

It is worth stating that the consistency level you choose also shapes your real RPO in subtle ways. A weaker level acknowledges writes sooner and replicates them in the background, which means a larger tail of unreplicated writes can be in flight at the instant a geography is lost, slightly widening the RPO. A stronger level narrows that tail at the cost of write latency. The numbers are usually small, but for a design whose entire justification is recovery objectives, the interaction belongs in the analysis rather than being waved away, and it is one more reason the consistency dial deserves a deliberate setting.

Designing the data model to avoid conflicts: a worked example

The abstract advice to model append-only or to use region-affinity becomes actionable only when you see it applied to a real schema, so work through a concrete case: an order-and-inventory system for a retailer running active-active across two geographies. This is precisely the kind of mutable shared state that punishes a naive Last Write Wins policy, and watching it get reshaped shows the design move that makes active-active safe.

The naive model and how it corrupts

The obvious schema stores an inventory record per product with a current quantity field, and an order record per purchase. When a customer buys an item, the application reads the quantity, subtracts one, and writes it back. In a single geography this is fine under proper concurrency control. In active-active it is a trap. Suppose a popular item has a quantity of one remaining, and two customers in two geographies buy it within the replication window. Each geography reads quantity one, subtracts one, and writes quantity zero. Both writes set the same value, Cosmos detects a conflict on the inventory item, and Last Write Wins keeps one write and discards the other. The stored quantity is zero, which looks correct, but two orders were placed against one unit of stock, and the second order can never be fulfilled. Nothing errored. The corruption is silent and is discovered only when a warehouse cannot ship. This is the canonical failure of mutable-counter data under multi-region writes, and no routing configuration prevents it because the problem is in the shape of the data, not the path of the request.

The append-only reshape

Now reshape the inventory as an append-only ledger. Instead of a mutable quantity field, you store immutable stock-movement events: a receipt event adds units, a sale event removes one. Each event is a distinct item with its own identifier, so when two geographies record two sales of the same product, they write two separate sale events that never collide, because they are different items and a conflict only arises when two writes target the same item. The current quantity is no longer stored, it is computed by summing the movement events for that product, and that computation is correct regardless of the order in which events replicated, because addition is commutative. The two-customers-one-unit case now produces two sale events and a computed quantity of negative one, which is not silent corruption but a visible signal that you oversold by one, which your business logic can detect and handle deliberately, for example by flagging the second order for backorder. The reshape did not eliminate the race, it converted a silent data-corruption bug into an explicit, detectable business event, which is exactly the transformation you want.

The region-affinity alternative

Suppose the append-only reshape is awkward for some part of the system, for instance a customer-profile record that is genuinely mutable and that you would rather not turn into an event log. Region-affinity handles it. Anchor each customer to a home geography, perhaps by the geography nearest their registered address, and ensure that writes to that customer’s profile only ever happen in their home geography, while reads may happen anywhere. Now the profile record has a single writer even though the overall system is active-active, so it can never conflict. The routing layer cooperates by sending a customer’s write requests to their home geography, and if that geography is down, the customer’s writes either fail over with the rest of the workload or are queued, depending on how strict your single-writer requirement is for that data. Region-affinity trades a little routing complexity for the complete elimination of conflicts on the affined data, and it composes cleanly with append-only modeling, so a real system often uses both: append-only for the high-contention transactional data, region-affinity for the mutable per-entity records.

The principle the example teaches

The worked example generalizes to a single principle that is worth stating as its own claim. In active-active, you do not resolve conflicts, you prevent them, and you prevent them by choosing a data model in which concurrent writes target different items rather than the same one. Last Write Wins and custom merge procedures are the safety nets for the residual conflicts that slip past the model, not the primary strategy. A design that leans on Last Write Wins as its main conflict story for mutable shared state has not solved the problem, it has deferred it to the first unlucky concurrency window. A design that has reshaped its hot data to append-only and affined its mutable records to home geographies has made conflicts rare and harmless, and only then is a resolution policy a reasonable backstop for the edge cases that remain.

When active-active fits and when it is overkill

The decision table gives you the dimensions; this section gives you the verdicts, because a guide that only lays out trade-offs without committing to recommendations leaves the reader exactly where they started. The platform genuinely supports more than one right answer here, so the honest approach is to name the condition that tips the decision and then commit to the stronger default for each case.

When active-active is the right call

Active-active fits when a near-zero RTO is a hard requirement that a seconds-to-minutes promotion cannot satisfy, and when your data model already resolves conflicts soundly. The clearest case is a global, high-traffic service where any measurable downtime during a failover is unacceptable, where users are spread across continents and need local-latency writes rather than just local reads, and where the data is naturally append-only or partitions cleanly by geography. A globally distributed telemetry-ingestion system fits this profile almost perfectly: writes are append-only events that never conflict, every geography needs to accept them at low latency, and the loss of a geography must not interrupt ingestion for a moment. For that workload, active-active on Cosmos multi-region writes is not over-engineering, it is the correct design, and the conflict question is answered for free by the append-only shape of the data.

The second strong case is region-affinity at scale. When you serve a large user base that partitions naturally, each customer anchored to a home geography, you can run true active-active across geographies while every individual record has a single writer. You get the availability, the capacity, and the local-latency reads everywhere, and because no item is written in two places, the conflict tax is zero. This is how a great many large consumer services run, and it is worth designing toward deliberately rather than discovering by accident.

When active-active is overkill

Active-active is overkill, and a net negative, when a seconds-to-minutes RTO would have been perfectly acceptable and you adopted the pattern anyway for the feeling of resilience. The tell is a design review where nobody can state the RTO requirement in numbers but everyone agrees active-active sounds safer. If a two-minute failover does not breach any commitment you have made, then active-passive or read-local-write-primary delivers the resilience you need at a fraction of the complexity and cost, and the extra spend on a second hot geography plus the ongoing burden of conflict semantics and conflict testing buys you nothing but risk. The pattern is also overkill when your data is fundamentally mutable shared state with no sound conflict story, because then active-active does not give you resilience, it gives you data corruption with extra steps, and the right move is to keep a single writer and choose active-passive.

There is a third overkill case worth naming: adopting multi-region writes when read-local-write-primary would have met the requirement. Teams often reach for full multi-master because it is the impressive option, when in fact their far-geography users would tolerate cross-geography write latency just fine and only needed local reads. Read-local-write-primary gives them that, keeps correctness trivial, and avoids the conflict problem entirely, and it should be the default you talk yourself out of rather than the fallback you settle for.

Modeling the choice before you commit

You do not have to settle this on a whiteboard from first principles, and you should not, because the latencies and failover behaviors that decide it are measurable. This is where hands-on modeling earns its place: VaultBook provides Azure labs where you can stand up a two-geography topology, wire Front Door across regional deployments, configure Cosmos multi-region writes with a chosen conflict-resolution policy, and then measure the things that actually drive the decision, the cross-geography write latency, the replication lag, the failover time when you take a geography offline, and the behavior of your conflict policy under concurrent writes. Modeling the routing and the replication against your own data shapes turns the decision from an argument into a measurement, and it lets you watch a conflict resolve before it ever reaches production. The hour spent reproducing your write pattern in a lab is the cheapest insurance you will buy on this design.

How to evolve toward active-active without betting the company

Almost no system should be born active-active. The pattern is the destination of a progression, and skipping the early stages is how teams end up operating complexity they do not understand. The sound path adds resilience in stages, each of which is independently valuable and independently reversible.

Begin single-region but multi-zone. Spreading across availability zones inside one geography protects against the datacenter-level faults that are far more common than whole-geography losses, and it costs little and changes nothing about your data model. For a large share of workloads this is the correct resting state, and the urge to go multi-region should be checked against whether a single-geography zonal deployment already meets the durability and availability you have actually promised.

Add a second geography as a passive standby next. Turn on geo-replication for your databases and geo-redundant storage, keep the standby warm enough to promote quickly, and write the failover runbook, then test it. The detailed mechanics of standing up that replication and the runbook live in setting up Azure Site Recovery for DR, and the wider strategy of tiering recovery objectives by workload is the subject of the disaster recovery architecture guide. At this stage you have a single writer, trivial correctness, and a tested recovery story, and you have not signed up for a single conflict. Most teams discover that this stage meets their real requirements and that the project can stop here, which is a feature of the progression, not a failure of nerve.

Promote reads to local before you promote writes. The next increment is read-local, write-primary: serve reads from the secondary geography so far-away users get local read latency, while all writes still route to the single primary. This captures most of the user-visible benefit of active-active, local-feeling performance and continuous availability of the read path, while preserving the single-writer invariant that keeps you out of conflict territory. It is the highest-value, lowest-risk step in the whole progression, and it is the one most often skipped in the rush to multi-master.

Only then, and only if a near-zero RTO requirement or a genuine need for local-latency writes everywhere justifies it, move to true multi-region writes, and do it on a data model that has already been shaped to resolve conflicts: append-only events, region-affinity, or a deliberate conflict-resolution policy validated under load. Migrate one bounded dataset first, the one whose conflict story you are most sure of, observe its conflict feed and replication behavior in production, and expand only once you trust it. Evolving in this order means that at every step you hold a system you fully understand, and you take on the conflict problem only after you have proven you can handle it, rather than discovering it in an incident.

The cost and complexity ledger

Active-active is the most expensive resilience pattern on two ledgers at once, money and operational burden, and pretending otherwise sets up the over-engineering trap. On the money side, both geographies run fully provisioned and hot, because the whole point is that either can carry production load at any moment, so you are paying roughly double the compute and database footprint of a single-geography deployment, plus the cross-geography replication traffic, plus the global traffic layer. Active-passive lets the standby run reduced until it is needed, so its steady-state cost is materially lower, and read-local-write-primary sits between, paying for a full read tier in both places but not a second write tier. If the budget conversation has not happened, the design conversation is not finished, because the cost of running both geographies hot is one of the dimensions that should decide the pattern, not an afterthought discovered on the first invoice.

On the operational ledger, active-active demands capabilities the simpler patterns do not. You must monitor replication lag and the conflicts feed, you must test failover and partition scenarios regularly rather than assuming them, you must understand and own the conflict-resolution policy as living code when it is custom, and you must keep the deep health probes honest as the application evolves. None of this is exotic, but all of it is ongoing work that a single-region or active-passive system simply does not carry, and a team that cannot staff that work should not run the pattern, because an active-active design that is not operated and tested is more fragile than the active-passive design it replaced. Match the pattern to the team’s capacity to operate it, not just to the architecture diagram’s elegance.

What RPO and RTO active-active actually targets

The entire justification for active-active rests on recovery objectives, so the numbers deserve a treatment of their own rather than the scattered mentions they have had so far. Two figures govern the conversation, and being precise about what each one means in an active-active context is what keeps the design honest.

Recovery time objective, the RTO, is the maximum tolerable duration of unavailability. This is the axis where active-active genuinely shines, and it is the only one. Because both geographies are already serving, the loss of one does not require promoting a standby, starting cold compute, or reconfiguring anything; the survivor was already handling its share and simply handles more. The RTO therefore approaches the time it takes the traffic layer to detect the fault and stop routing to the lost geography, which with a well-tuned edge proxy is seconds. Contrast that with active-passive, where the RTO includes detecting the failure, promoting the standby database, possibly starting or scaling compute, and repointing traffic, which lands in the seconds-to-minutes range even when automated and longer when a human is in the loop. If your requirement is that the service must never be unavailable for more than a handful of seconds, active-active is one of the few ways to meet it, and that single fact is the legitimate reason to pay its costs.

Recovery point objective, the RPO, is the maximum tolerable data loss measured in time, and here the honest message is that active-active does not improve it the way people assume. The RPO of any asynchronously replicated multi-geography design is bounded by the replication lag at the instant of failure, because the writes that had been accepted in the lost geography but not yet replicated are the writes at risk. Active-active and active-passive face the same physics on this axis: both replicate asynchronously, both have a lag, and both lose the unreplicated tail if a geography vanishes. The detail that active-active keeps the survivor writing does not recover the lost geography’s in-flight writes, it only means new writes continue. So a design sold internally on the promise that active-active gives a better RPO is sold on a misunderstanding, and the correction matters because it changes which pattern you should choose: if RPO is your binding constraint, the lever is the replication mode and the consistency level, not the active-active topology.

Does active-active give zero data loss?

No. Active-active drives the RTO toward zero because the surviving geography keeps serving with nothing to promote, but the RPO is still bounded by the writes that had not replicated when a geography was lost. Achieving zero data loss requires synchronous replication, whose cross-geography latency most interactive workloads cannot tolerate, so active-active typically accepts a small RPO.

The interaction between the two objectives and the consistency level closes the loop. A stronger consistency level narrows the unreplicated tail and thus tightens the RPO, at the cost of write latency. A weaker level loosens the RPO slightly while maximizing latency and availability. The conflict-resolution policy then governs what happens to the writes that do replicate but collide. Reading the three together, RTO from the topology, RPO from the replication and consistency choices, and correctness from the conflict policy, is the complete recovery story of an active-active design, and a design document that states all three with numbers is one you can defend in a review and hold yourself to in an incident. One that states only the aspiration that the system is highly available has not done the work.

Testing an active-active design: breaking it on purpose

An active-active topology that has never been failed on purpose is a set of assumptions, not a guarantee, and the assumptions are exactly the ones most likely to be wrong, because they concern behavior that only appears under failure. Treating failure drills as a routine part of operating the pattern, rather than as a one-time validation, is what separates a design you can trust from one you merely hope works.

The first drill is the planned geography evacuation. Take one geography out of rotation deliberately, by failing its health probe or stopping its deployment, and watch the traffic layer reroute. Measure how long it takes for traffic to fully shift, confirm that no requests are lost during the transition, and verify that the surviving geography absorbs the full load without degrading past your latency budget. This last point catches a common oversight: if each geography normally runs near its capacity ceiling, the survivor cannot absorb the other’s traffic, and your active-active design fails precisely when it is needed because the survivor falls over under the doubled load. The drill reveals whether you have provisioned each geography to carry the whole load alone, which active-active resilience actually requires.

The second drill is the concurrent-write conflict test. Deliberately write to the same item in both geographies within the replication window, using whatever your fastest path to simulate that is, and observe what your conflict policy does. If you use Last Write Wins, confirm that the data you lose is data you can afford to lose. If you use custom resolution, confirm the merge procedure runs and produces the result you expect, and confirm that when you make it fail, the conflict lands in the conflicts feed as designed rather than vanishing. If you rely on append-only or region-affinity, confirm that the concurrent writes genuinely target different items and no conflict forms. This drill is the only way to know your conflict story is real rather than theoretical, and it should be run against your actual schema, not a simplified stand-in.

The third drill is the partition test, which is the hardest to stage and the most important to understand. Simulate a network partition between the geographies while keeping both reachable by their local users, hold it for a realistic duration, and then heal it. Watch how writes accumulate on both sides during the partition and how they reconcile when it heals. In a write-primary design, this drill is where you confirm that a second primary cannot be promoted while the first is alive, because split-brain promotion is the failure that turns a resilience pattern into a data-loss event. Do not assume your failover automation prevents two primaries; force the partition and verify it.

How often should failure drills run?

Run the geography-evacuation drill on a regular cadence, monthly or quarterly depending on how often your deployment changes, because the behavior degrades silently as capacity assumptions and dependencies drift. Run the conflict and partition tests whenever the data model or the failover configuration changes, since those are the changes most likely to break the guarantees you depend on.

The discipline that makes drills valuable is treating their results as design feedback rather than pass-fail theater. A drill that surfaces a slower-than-expected failover, a survivor that cannot carry full load, or a conflict policy that loses data you needed is not a failed test, it is the test working, because it found the flaw in a controlled window instead of during a real incident. Build the drills into the operational rhythm of the system, automate the ones you can, and keep the runbook for each scenario current, because the runbook you wrote a year ago describes a system that no longer exists.

What to monitor once it is live

A live active-active system generates a small set of signals that tell you whether the pattern is healthy, and watching the right ones is the difference between catching a degradation early and discovering it through a customer complaint. The signals are not the generic ones you already watch, they are specific to the cross-geography nature of the design.

Replication lag is the first and most important. It is the time between a write being accepted in one geography and that write being visible in the others, and it is the direct determinant of two things you care about: how stale a cross-geography read can be, and how large the tail of unreplicated writes will be if a geography is lost, which is your RPO. A replication lag that is creeping upward is an early warning that your write volume is outpacing the replication capacity, and it should alert before it grows large enough to widen your RPO past what you have promised. Watch it per geography pair, because lag is not symmetric and a problem on one link does not show on another.

The conflicts feed depth is the second signal, and it matters specifically for custom-resolution designs. A healthy system drains conflicts as fast as they arrive, so the feed stays near empty. A growing backlog means your resolution process is falling behind the rate at which conflicts form, which is both a correctness risk, because unresolved conflicts are unreconciled data, and a capacity signal, because it usually means write contention is higher than the design anticipated. Even in designs that rely on append-only or region-affinity, a nonzero conflict rate is worth watching, because it reveals places where the model is leaking and concurrent writes are reaching the same item against your intent.

Traffic distribution across geographies is the third signal, and it confirms that the routing layer is doing what you think. In steady state you expect a distribution that matches your user geography, and a sudden shift, all traffic collapsing onto one geography, tells you the other has gone unhealthy and been evacuated, which may be correct behavior responding to a real fault or may be a false positive from an overzealous health probe. Either way you want to know immediately, because a silent evacuation means you are running on one geography with no redundancy and one more fault from an outage.

Failover time is the fourth signal, captured from your drills and from any real failovers, and it is the empirical version of your RTO. The RTO you wrote in the design document is an aspiration; the failover time you measure is the truth, and if the measured number exceeds the promised one, you have a gap to close before the next real incident finds it for you. Track it over time, because it tends to drift upward as the system grows and dependencies multiply, and a number that was fine at launch can quietly slide past your commitment as the deployment evolves.

The verdict

Multi-region active-active on Azure is a precise tool for a specific requirement, not a general upgrade to resilience. The traffic layer is solved: Front Door gives you fast, edge-terminated, health-probed routing across geographies, Traffic Manager gives you DNS-level distribution where that suffices, and either turns a geography loss into a capacity reduction. The data layer is where the design is won or lost, and the rule that governs it is the one to carry out of this guide. The hard part of active-active is multi-region writes and the conflicts they create, so unless your data model has a deliberate and correct answer for those conflicts, through append-only modeling, region-affinity, or a validated conflict-resolution policy, active-passive is the safer multi-region design.

So the verdict is conditional and stated with the deciding factor named. Choose active-active when a near-zero RTO is a hard requirement that a seconds-to-minutes promotion cannot meet and your data model resolves conflicts soundly. Choose read-local, write-primary when you want local reads and automatic failover but your data is mutable shared state with no clean conflict story, which describes a large fraction of business applications. Choose active-passive when a seconds-to-minutes failover meets your commitments and you value the lowest cost and the simplest correctness story. Evolve toward the more demanding patterns in stages, stopping as soon as a stage meets your real recovery objectives, and let the RPO and RTO you actually need, rather than the resilience you would enjoy bragging about, decide how far you go. The best active-active design is frequently the one you correctly decided not to build.

Frequently asked questions

How do I build a multi-region active-active architecture on Azure?

You build it in two independent layers. The traffic layer uses Azure Front Door or Traffic Manager to route each request to the nearer healthy geography and to evacuate a geography that fails its health probe. The data layer keeps state coherent across geographies, and this is the layer that determines whether the design is sound. For true active-active writes you use a data service that accepts local writes everywhere, such as Cosmos DB with multi-region writes, and you declare how concurrent writes to the same item are resolved. Before anything else, decide whether your data model can handle conflicts, because the routing is solved technology and the conflicts are the real design problem. If your data is mutable shared state with no conflict story, build read-local, write-primary instead, which gives local reads and automatic failover while keeping a single writer and trivial correctness.

How do Front Door and Traffic Manager route across regions?

They operate at different layers, which is the whole distinction. Azure Front Door works at the application layer, terminating each connection at an edge near the user, inspecting the HTTP request, and forwarding it over Microsoft’s private network to a healthy backend chosen per request from live health probes. That per-request decision lets it fail over in seconds and add edge caching, TLS offload, and a web application firewall. Traffic Manager works at the DNS layer, answering name queries with the address of an endpoint selected by a routing method such as priority, weighted, performance, or geographic, then stepping out of the path so the client connects directly. Because DNS answers are cached by time-to-live and not every client honors a short one, Traffic Manager fails over more slowly than an edge proxy. Use Front Door for HTTP workloads needing fast failover and edge features, and Traffic Manager for DNS-level distribution or non-HTTP endpoints.

How do I replicate data across regions for active-active?

The replication strategy depends on the data service and on whether you need local writes everywhere. Cosmos DB offers multi-region writes, where every configured geography accepts writes locally and replicates asynchronously to the others, which gives low write latency everywhere at the cost of possible write conflicts. Azure SQL Database offers geo-replication and failover groups that maintain read-only secondaries kept current asynchronously, so all writes go to one primary and there are no conflicts. Azure Storage offers geo-redundant replication with an optional readable secondary, again with a single write target. Across all of them the replication is asynchronous, because synchronous cross-geography replication would charge a latency too high for interactive use, so every strategy accepts some replication lag and must have an answer for the writes caught inside that lag window when a geography is lost.

How do I handle write conflicts in multi-region writes?

You handle them first by preventing them and second by resolving the residue. Prevention is the primary strategy: model hot data as an append-only log of immutable events so concurrent writes become distinct items that never collide, and anchor mutable per-entity records to a home geography so each has a single writer through region-affinity. For the conflicts that still slip past the model, Cosmos lets you declare a resolution policy on the container at creation time. Last Write Wins keeps the write with the highest value on a numeric property, defaulting to the timestamp, which is correct for last-state-wins data and dangerous for data where every write is a fact. Custom resolution runs a merge stored procedure you write, and if it is absent or fails, the conflicting versions land in a conflicts feed for your application to reconcile. Match the policy to what your data means.

What RPO and RTO does active-active actually give?

Active-active drives the recovery time objective toward zero, because both geographies are already serving and the loss of one removes capacity rather than requiring any promotion or cold start, so the recovery time is essentially how fast the traffic layer detects the fault and reroutes, which is seconds. The recovery point objective is the part people misread. It is bounded by the replication lag at the instant of failure, the unreplicated tail of writes from the lost geography, and active-active does not improve it relative to active-passive, because both replicate asynchronously and lose the same in-flight writes. Active-active keeps new writes flowing on the survivor, but it does not recover the lost geography’s pending writes. If recovery point is your binding constraint, the levers are the replication mode and the consistency level, not the active-active topology, so do not choose the pattern expecting a better recovery point than active-passive would give.

Active-active versus active-passive: which should I choose?

Choose by recovery time requirement and by whether your data resolves conflicts. Active-passive keeps a single writer, so correctness is trivial and conflicts never arise, and its recovery time is the seconds to minutes needed to detect failure and promote the standby. Active-active keeps both geographies writing for a near-zero recovery time, but it requires a data model that handles concurrent writes to the same item. So choose active-active only when a seconds-to-minutes failover genuinely fails your commitment and your data has a sound conflict story through append-only modeling or region-affinity. Choose active-passive when a seconds-to-minutes failover meets your needs, because it is cheaper, simpler, and safer. A large fraction of workloads need recovery times that active-passive already meets, which makes it the correct default and makes active-active the pattern you justify with a specific requirement rather than reach for by reflex.

What is the difference between availability zones and multiple regions?

Availability zones are physically separate datacenters within one geography, each with independent power, cooling, and networking, connected by a low-latency link. Spreading across zones protects against a datacenter-level fault such as a power event or a localized network failure, and for many workloads this single-geography, multi-zone deployment is the resilience tier that matters most day to day, because it is cheap and changes nothing about the data model. What zones cannot survive is the loss of the whole geography, whether from a regional control-plane incident, a capacity event, or a disaster affecting the entire metropolitan area. Surviving that requires a second geography, which is where multi-region patterns come in. The sensible progression is multi-zone first, then a second geography as a passive standby, then local reads, and only then active-active, stopping as soon as a stage meets the availability you have actually promised.

Does Cosmos DB Last Write Wins lose data?

Yes, by design, and whether that matters depends entirely on your data. Last Write Wins resolves a conflict by keeping the write with the highest value on a numeric property, defaulting to the system timestamp, and silently discarding the losing write. For data where the latest value is the only one that matters, such as a most-recent profile setting or a latest sensor reading, the discarded write carried nothing you needed and the policy is correct. For data where every write is a fact that must be preserved, such as an inventory decrement or an account transaction, the discarded write was information you needed, and Last Write Wins produces silent corruption with no error and no log entry. The defense is to reshape such data as an append-only event log, so concurrent writes become distinct items that never conflict and the current value is computed by summing events, which turns a silent corruption into a correct result.

When is read-local, write-primary better than full active-active?

It is better whenever your far-geography users can tolerate cross-geography write latency and your data is mutable shared state without a clean conflict story, which describes a large share of business applications. In this pattern every geography serves reads locally for low read latency, but all writes route to a single primary geography, so there is exactly one writer and never a conflict, giving the same correctness as a single-region system. You still get automatic regional failover, because the secondary can be promoted, and you get most of the user-visible performance benefit, because reads feel local. What you give up is local write latency for users far from the primary, which is often an acceptable trade. It is the highest-value, lowest-risk multi-region step and the one most often skipped in the rush to multi-master, so make it the default you talk yourself out of rather than the fallback.

How does region-affinity prevent write conflicts?

Region-affinity, sometimes called partition-by-geography, arranges the data so that each item is only ever written in one geography even though every geography can read it. If you anchor each customer to a home geography and ensure that customer’s records are only written there, then every individual record has a single writer, which means concurrent writes to the same item cannot happen and no conflict can form, while the system as a whole remains active-active with full read locality and capacity in both places. It gives you the availability and performance of active-active without paying the correctness tax, because the single-writer invariant holds per item. The routing layer cooperates by sending each customer’s writes to their home geography. Region-affinity composes cleanly with append-only modeling, so real systems often use append-only for high-contention transactional data and region-affinity for the mutable per-entity records, eliminating conflicts across both.

Why use append-only modeling for active-active?

Append-only modeling eliminates write conflicts by changing the shape of the data rather than resolving collisions after they happen. Instead of storing a mutable current value that two geographies might overwrite, you store immutable events, and each event is a distinct item with its own identifier, so concurrent writes in two geographies create two separate events that never target the same item and therefore never conflict. The current value is computed by folding the events, and because operations like addition are order-independent, the computed result is correct regardless of the order in which events replicated. The classic example is inventory: rather than a mutable quantity that Last Write Wins would corrupt under concurrent sales, you store sale and receipt events and sum them, which turns a silent oversell into a visible, handleable signal. Append-only is the primary conflict strategy for hot transactional data, with resolution policies serving only as a backstop for whatever residue slips through.

Which Cosmos consistency level should an active-active app use?

Session consistency is the pragmatic default for most active-active applications. It guarantees that a single client session always sees its own writes and reads them in order, which solves the most noticeable staleness problem, a user writing and then immediately reading their own update apparently missing, while still allowing the low latency and high availability the pattern exists to provide. Reserve stronger levels, such as bounded staleness, for the specific data that genuinely requires tighter ordering, and accept eventual consistency only where a brief window of staleness is harmless, such as a view counter. Strong consistency cannot be offered across asynchronously replicated write regions, so choosing it constrains your topology and is a design input to surface early. The consistency level interacts with both the recovery point objective and the conflict policy, so settle all three together in the design rather than treating consistency as a runtime knob to adjust later.

How much does active-active cost compared to active-passive?

Active-active is the most expensive resilience pattern because both geographies run fully provisioned and hot, since either must be able to carry the whole production load at any moment, so you pay roughly double the compute and database footprint of a single geography, plus cross-geography replication traffic and the global traffic layer. Active-passive is materially cheaper at steady state, because the standby can run reduced until a failover needs it. Read-local, write-primary sits between the two, paying for a full read tier in both geographies but only one write tier. Beyond the bill there is an operational cost: active-active requires monitoring replication lag and the conflicts feed, running failover and partition drills, and owning conflict-resolution logic as living code, which the simpler patterns do not carry. If running both geographies hot is not justified by a recovery time requirement that the cheaper patterns cannot meet, the cheaper pattern is the correct engineering choice.

How do I avoid split-brain in a write-primary failover?

Split-brain happens when a network partition causes a secondary to promote itself to primary while the original primary is still alive and accepting writes, leaving two writers and exactly the divergence the single-writer design was meant to forbid. The defense is that promotion must be coordinated so two primaries cannot exist at once, and you must understand precisely how your failover automation enforces that before you depend on it. Do not assume the platform prevents it; force the scenario in a controlled test. Stage a partition that isolates the geographies while keeping both reachable by local users, hold it for a realistic duration, and confirm that a second primary is not promoted and that writes reconcile correctly when the partition heals. The partition drill is the hardest to stage and the most important to run, because split-brain promotion is the failure that turns a resilience pattern into a data-loss event, and only testing reveals whether your automation reliably prevents it.

Why do users see stale data right after writing in active-active?

That is read-your-writes inconsistency, and it is the expected behavior of asynchronous cross-geography replication rather than a bug. A user writes in one geography, then a request routes them to another geography and reads before the write has replicated, so their own update appears missing for a moment. The window exists because replication takes time and the geographies are far apart, which is the same physics that makes write conflicts possible. The fixes are well understood. Pin the user’s session to one geography with session affinity at the traffic layer, so their reads follow their writes to the same place, and choose session consistency at the data layer, which guarantees a session always sees its own writes even if it does move. The two work together, affinity keeping the user on one geography and session consistency covering the case where they move, and a sound design usually wants both for any workload where users notice staleness in their own actions.

How do I test an active-active design?

Test it by breaking it on purpose, on a regular cadence, with three drills. The geography-evacuation drill takes one geography out of rotation and measures how fast traffic reroutes, whether any requests are lost, and crucially whether the survivor can absorb full load without degrading, which catches the trap of geographies provisioned too close to their ceiling to carry each other. The conflict drill deliberately writes the same item in both geographies within the replication window and confirms your policy does what you expect, that Last Write Wins discards only data you can lose, that a custom merge runs and that its failure routes to the conflicts feed, or that append-only and region-affinity genuinely keep the writes on different items. The partition drill isolates the geographies, holds the partition, then heals it, confirming writes reconcile and that no second primary is promoted. Treat surfaced flaws as the test working, and keep each runbook current.

What should I monitor in a live active-active system?

Watch four cross-geography signals beyond your usual metrics. Replication lag, the time between a write being accepted in one geography and visible in the others, directly determines cross-geography read staleness and the size of the unreplicated tail that sets your recovery point, so alert when it creeps upward and watch it per geography pair because it is not symmetric. The conflicts feed depth tells you whether custom resolution is keeping pace, since a growing backlog means resolution is falling behind conflict creation and represents unreconciled data. Traffic distribution across geographies confirms the routing layer is behaving, and a sudden collapse onto one geography signals that another was evacuated, which you want to know immediately because it means you are running without redundancy. Failover time, captured from drills and real events, is the empirical version of your recovery time objective and tends to drift upward as the system grows, so track it against the number you promised and close any gap before an incident finds it.

Can active-active satisfy data residency requirements?

Yes, and region-affinity is the mechanism that makes it work. When law or contract requires that a given customer’s data be written and stored in a specific geography, you anchor that customer’s records to the required geography and ensure they are only ever written there, while the rest of the system remains active-active across geographies. This is the same region-affinity move that eliminates write conflicts, so a residency constraint and a conflict-prevention strategy turn out to be solved by the same design, which is a happy alignment rather than a coincidence, because both are satisfied by giving each item a single home geography. The routing layer cooperates by directing each customer’s writes to their home geography, and your data model enforces that no write to that record occurs elsewhere. Treat residency as a hard constraint that prunes your options before you score the survivors, not as an optimization, because it can rule out true multi-region writes for the affected data entirely.