Azure VPN Gateway Deep Dive

The first time an engineer wires an Azure VPN gateway to an on-premises firewall, the tunnel often comes up, passes a ping, and then quietly refuses to carry a second subnet, drop a route the team expected, or survive a maintenance event on the Azure side. The confusion is rarely about the encryption. IPsec is mature, and the cryptographic handshake either completes or it does not. The confusion is about what kind of gateway was built, what that gateway type is structurally capable of, and how Azure decides which traffic enters the tunnel at all. An Azure VPN gateway is not a single product with a few knobs. It is a family of gateway types, sizes, and connection modes whose every visible behavior follows from two early decisions that are easy to make by accident and expensive to reverse.

Azure VPN Gateway deep dive on route-based versus policy-based gateways, SKUs, connection types, active-active, and BGP - Insight Crunch

This deep dive builds the working model an engineer needs to design encrypted connectivity into Azure on purpose and to reason about its failure modes before they happen rather than after a tunnel drops in production. We will hold the whole picture at once: route-based versus policy-based gateways and why the first is the modern default, the gateway sizes and the throughput and tunnel counts they bound, site-to-site connectivity to physical devices versus point-to-site connectivity to individual clients, active-active gateways that survive the loss of one instance, and the Border Gateway Protocol that turns a static tunnel into a dynamically routed link. By the end, the difference between a tunnel that works in a demo and a connection that holds up under real traffic will be a set of decisions you can make deliberately rather than discoveries you make at three in the morning.

The InsightCrunch VPN gateway model: the two decisions everything follows from

Before any command runs, two choices shape everything the gateway can ever do. The first is the gateway type, route-based or policy-based, which fixes how the gateway decides what traffic belongs in the tunnel. The second is the gateway SKU, the size, which fixes the throughput, the number of tunnels, and whether features such as active-active and zone redundancy are even available. The InsightCrunch VPN gateway model arranges these so a reader can predict a capability before testing for it: read the gateway type to know how traffic is selected, read the SKU to know how much traffic and how many tunnels the gateway will carry, read the connection type to know whether you are joining a network or a roaming client, and read the redundancy mode to know what a single instance failure costs you.

Decision	Options	What it fixes	The trap
Gateway type	Route-based, policy-based	How traffic is selected into the tunnel	Policy-based caps you at a single tunnel and single connection, blocks BGP, active-active, and point-to-site
Gateway SKU	Basic, VpnGw1 through VpnGw5, and the AZ zone-redundant variants	Aggregate throughput, tunnel and connection counts, feature availability	An undersized SKU silently caps throughput far below what the link needs
Connection type	Site-to-site, point-to-site, VNet-to-VNet	Whether you connect a network, a client, or two Azure networks	Mixing the wrong type for the need, such as site-to-site where roaming clients were meant
Redundancy mode	Active-standby, active-active, zone-redundant	What the loss of one gateway instance costs in downtime	A single-instance gateway means a maintenance event drops the tunnel

The model’s payoff is that almost every “why does my VPN do this” question resolves to one of these four rows. A connection that refuses a second subnet is usually a policy-based gateway. A tunnel that throttles under load is usually an undersized SKU. A link that drops during a platform maintenance window is usually a non-redundant gateway. A static route that never updates when the on-premises side changes is usually a gateway without BGP. Hold the four rows, and the failure modes stop being mysteries and become predictions.

What is an Azure VPN gateway and what does it actually do?

An Azure VPN gateway is a managed pair of virtual machines, invisible to you, deployed into a dedicated subnet called the GatewaySubnet, that terminates IPsec and IKE tunnels and forwards traffic between an Azure virtual network and a remote network or client. You never see, patch, or size those instances directly. You choose a SKU and a type, and Azure provisions and maintains the underlying compute, applying the encryption, the routing, and the high-availability behavior the SKU promises.

The gateway is the encrypted door between a private Azure address space and the world outside it, whether that world is a corporate data center, a branch office router, or an employee’s laptop in a hotel. Everything in this article is a refinement of that one job: how the door decides what passes through it, how wide the door is, how many doors there are, and what happens when one door fails.

Route-based versus policy-based: the decision that defines the gateway

The single most consequential choice you make is the gateway type, and the names undersell how different the two designs are. A policy-based gateway selects traffic for the tunnel using a static set of address prefixes called traffic selectors. You declare, in effect, “traffic from this local prefix to that remote prefix goes in this tunnel,” and the gateway encrypts exactly what matches that declaration and nothing else. A route-based gateway selects traffic differently. It presents the tunnel as a virtual interface in the routing table, and any packet whose route points at that interface enters the tunnel. The selection is a routing decision, not a static prefix match.

That structural difference cascades into nearly every capability that matters. Because a policy-based gateway binds the tunnel to a fixed pair of prefixes, it supports one tunnel and one connection. There is no room in the design for a second tunnel, for dynamically learned routes, for an active-active pair, or for individual roaming clients, because all of those require the gateway to make per-packet routing decisions rather than honoring a single frozen selector. A route-based gateway, by treating the tunnel as a routable interface, supports many tunnels at once, dynamic routing through BGP, active-active redundancy, point-to-site clients, and coexistence of site-to-site and point-to-site on the same gateway. The modern feature set lives almost entirely on the route-based side.

This is the namable claim of the article, the route-based-by-default rule: a route-based gateway supports BGP, multiple tunnels, active-active, and point-to-site, so policy-based is a legacy fallback reserved for a narrow case, and choosing route-based by default is the decision that avoids the majority of VPN limitations engineers run into. State it plainly to a team and you preempt a whole class of redesigns, because nearly every “the gateway will not let me” complaint traces back to a policy-based choice made without realizing what it foreclosed.

When does a policy-based gateway ever make sense?

A policy-based gateway makes sense only when the device on the other end cannot do route-based IKEv2 and is locked to IKEv1 with policy-based selectors, and the connection is a single tunnel between one local prefix and one remote prefix with no need for dynamic routing or redundancy. That narrow profile is the entire remaining justification.

In practice that case is rare and shrinking. It appears with older firewall appliances, certain legacy on-premises devices, or a partner network whose security team will not move off a policy-based configuration. Even then, many of those devices can be coaxed into a route-based IKEv2 connection, and a route-based Azure gateway can interoperate with a policy-based peer in some configurations by using policy-based traffic selectors as an option on a route-based gateway. The result is that the deliberate, modern choice is route-based, and policy-based survives as a compatibility shim rather than a design you reach for on purpose. If you find yourself about to provision a policy-based gateway, the right first question is whether the remote device truly cannot do route-based, because the answer is usually that it can.

How does the gateway decide which traffic enters the tunnel?

On a route-based gateway, the tunnel is a next hop in the effective routes for the GatewaySubnet and the connected subnets, so a packet enters the tunnel when its destination matches a route whose next hop is the virtual network gateway. On a policy-based gateway, a packet enters the tunnel only when its source and destination match the configured traffic selectors exactly, and anything outside that match is dropped or routed normally.

The consequence for debugging is direct. On a route-based gateway, a packet that should be crossing the tunnel but is not is almost always a routing problem: the route advertising the remote prefix is missing, a user-defined route is overriding it, or the connection is not propagating gateway routes to the subnet. You debug it by reading the effective routes on a network interface in the subnet and confirming the remote prefix points at the gateway. On a policy-based gateway, the same symptom is almost always a traffic selector mismatch: the local or remote prefix in the Azure connection does not match what the peer device expects, so the security association never forms for that prefix. You debug it by comparing the selectors on both ends until they agree. Knowing which kind of gateway you built tells you which of these two investigations to run, and that alone saves hours.

The gateway SKU: what the size actually buys you

If the gateway type fixes what the tunnel can do, the SKU fixes how much it can do and which advanced behaviors are even on the menu. Azure offers a Basic SKU, a generational ladder of VpnGw SKUs from VpnGw1 through VpnGw5, and a parallel set of zone-redundant AZ variants of those generational SKUs. Each step up the ladder raises the aggregate throughput, raises the number of site-to-site tunnels the gateway will terminate, raises the number of point-to-site connections it accepts, and in the higher tiers unlocks behaviors the lower tiers withhold.

The throughput figure deserves a careful reading, because it is the number engineers most often misunderstand. The published throughput for a SKU is an aggregate ceiling across all tunnels on that gateway, measured under favorable conditions, not a guaranteed per-tunnel rate and not a figure you should expect to hit with a single TCP flow over a long-distance, high-latency path. A gateway rated for a given aggregate will divide that capacity across however many tunnels are active, and real throughput on any one tunnel is further bounded by latency, packet loss, the cipher suite negotiated, and the remote device’s own capacity. Treat the SKU throughput as a planning ceiling to size against with headroom, and verify the achieved rate with a real transfer rather than assuming the brochure number.

The tunnel and connection counts are the other half of the SKU decision and the one that quietly breaks expansion plans. A gateway that comfortably terminated three branch tunnels will refuse the tenth if the SKU’s tunnel limit sits below ten, and the failure surfaces as a connection that simply will not establish rather than as an obvious capacity error. Before a hub gateway is expected to fan out to many branches, the tunnel count of the chosen SKU has to be checked against the planned branch count with room to grow, because raising it later means resizing the gateway, and resizing across certain SKU families is not always an in-place operation.

Which SKU should I choose for a production gateway?

Choose the lowest VpnGw SKU whose aggregate throughput clears your measured peak with comfortable headroom and whose tunnel and connection limits exceed your planned count, and prefer the zone-redundant AZ variant in any region that offers availability zones. Avoid the Basic SKU for production because it lacks the modern feature set and the resilience options.

The reasoning behind that rule is worth making explicit. The Basic SKU exists mainly for development and the simplest connectivity and omits capabilities such as active-active, certain BGP options, and the zone-redundant deployment model, so building production connectivity on it tends to require a disruptive rebuild later. Among the VpnGw SKUs, the cost difference between adjacent tiers is usually small relative to the cost of an outage caused by an undersized gateway throttling a critical replication or backup stream, so sizing with headroom is the economical choice over the life of the connection. The AZ variants spread its underlying instances across availability zones in regions that support them, so a single zone incident does not take the gateway down, and choosing the AZ variant at creation time avoids a later migration because moving from a non-zonal to a zone-redundant gateway is not a setting you flip in place.

Can I resize a VPN gateway without rebuilding it?

You can resize within the same generation of VpnGw SKUs, for example from VpnGw1 to VpnGw2, as an in-place operation that keeps its public IP and connections, but moving between certain families, such as from a Basic SKU to a VpnGw SKU or from a non-zonal SKU to a zone-redundant AZ SKU, requires deleting and recreating the gateway. The distinction matters because a recreate changes its public IP unless you have explicitly allocated a static one, and a changed IP means every peer device’s configuration has to be updated.

The practical lesson is to pick the SKU family correctly at creation rather than counting on a painless upgrade path. Allocate a static public IP for the gateway from the start so that even a forced recreate can reuse the same address, which spares you from coordinating a firewall change with every branch at the moment of a rebuild. And when capacity planning, size for the connection’s expected growth over its life rather than its first-week footprint, because the cost of a slightly larger SKU is almost always less than the operational cost of a resize that touches every remote peer.

Site-to-site, point-to-site, and VNet-to-VNet: choosing the connection type

A gateway type and a SKU describe the gateway itself. The connection type describes what the gateway connects to, and the three forms answer three different needs. A site-to-site connection joins an entire on-premises or remote network to the Azure virtual network over an IPsec tunnel terminated by a physical or virtual VPN device on the far end. A point-to-site connection lets an individual client, a laptop or workstation running a VPN client, dial into the virtual network without any on-premises device at all. A VNet-to-VNet connection joins two Azure virtual networks to each other through their gateways, which is a managed variant of site-to-site where both ends happen to be Azure.

Each form has a natural use. Site-to-site is the workhorse for connecting a data center, a branch office, or a colocation facility to Azure, carrying server-to-server traffic across a tunnel that the network teams on both ends manage. Point-to-site is for remote and roaming users, contractors, or administrators who need to reach private resources from arbitrary locations without provisioning a tunnel per location, and it authenticates each client individually rather than trusting a network. VNet-to-VNet connects Azure networks across regions or across subscriptions when virtual network peering does not fit the boundary, though for most same-tenant cross-region links peering is now the simpler and faster choice and the gateway-based VNet-to-VNet is reserved for cases where peering is not available or not desired.

A route-based gateway carries all three forms at once. The same gateway can terminate several site-to-site tunnels to different branches, accept point-to-site clients for remote users, and hold a VNet-to-VNet link to another region, all simultaneously, because each is just another set of routes pointed at the gateway interface. A policy-based gateway carries none of this flexibility, which is the route-based-by-default rule showing up again from a different angle.

How do site-to-site and point-to-site differ in what they trust?

Site-to-site trusts a network: the tunnel authenticates between two gateway devices using a shared key or certificates, and once it is up, every host behind the remote device can reach the permitted Azure subnets through it. Point-to-site trusts a client: each connecting device authenticates individually with a certificate, with Azure AD, or with RADIUS, and only that authenticated client gets an address and a route into the network.

That difference in trust boundary drives the security design. A site-to-site tunnel is appropriate when you control the remote network and want its hosts treated as part of the connected estate, and its security depends on the integrity of the remote device and the segmentation behind it. A point-to-site connection is appropriate when the endpoints are not part of a controlled network, such as personal or roaming devices, and it lets you authenticate and authorize each session, revoke an individual client without touching others, and avoid extending trust to a whole remote subnet. Mixing them deliberately on one route-based gateway is common and supported: branches connect over site-to-site, and the small set of administrators who work from anywhere connect over point-to-site, each with the trust model that fits.

BGP: turning a static tunnel into a dynamically routed link

A site-to-site tunnel without BGP carries only the routes you statically configure. You tell the Azure connection which on-premises prefixes live behind the remote device, and you tell the remote device which Azure prefixes live behind the gateway, and the two sides forward accordingly. That works until the address plan on either side changes, at which point every static entry that referenced the old plan has to be edited by hand on both ends, and a missed entry becomes a silently black-holed prefix.

The Border Gateway Protocol replaces that manual exchange with a dynamic one. With BGP enabled on both the Azure side and the remote device, each side advertises its reachable prefixes to the other over the tunnel and withdraws them when they go away, so the routing tables update themselves as networks are added or removed. The Azure gateway is assigned a BGP autonomous system number and uses an APIPA or a private address as its BGP peer endpoint, and the remote device peers with it across the tunnel. Once the session is established, on-premises prefixes appear in Azure’s effective routes automatically and Azure prefixes appear on premises automatically.

BGP earns its keep in three situations especially. The first is any environment where prefixes change with enough frequency that hand-editing static routes is error-prone, which describes most growing estates. The second is active-active and multi-tunnel designs, where BGP lets the network converge around a failed path by withdrawing the routes that traversed it and preferring the routes that remain. The third is transit and hub designs, where BGP allows a hub gateway to learn branch prefixes and, with the right configuration, allow branches to learn one another’s prefixes through the hub. A route-based gateway is required for any of this, because BGP is fundamentally about per-prefix routing decisions that a policy-based gateway’s static selector model cannot express.

How does BGP routing work over a VPN gateway?

The Azure gateway and the remote device establish a BGP session across the IPsec tunnel, each identified by an autonomous system number, and they exchange route advertisements: Azure announces the virtual network’s prefixes and any prefixes it learned from other connections, while the peer announces the on-premises prefixes. Each side installs the learned routes with the tunnel as the next hop, and withdrawals propagate automatically when a prefix becomes unreachable.

The detail that trips people up is the BGP peer addressing. The gateway uses a specific BGP peer IP, and on certain configurations that address is an APIPA address in a reserved range rather than an address from your virtual network, so the remote device has to be configured to peer with that exact address. A second detail is autonomous system number selection: Azure reserves certain numbers for its own use, so the on-premises side must choose an ASN that does not collide with the reserved set, and the two sides must not accidentally share an ASN unless an explicit same-ASN configuration is intended. When a BGP session refuses to come up over a healthy tunnel, the peer address and the ASN values are the first two things to verify, because a tunnel can be perfectly encrypted and forwarding while the routing layer above it never agrees.

Active-active gateways and resilience: surviving the loss of one instance

Behind every Azure VPN gateway are two underlying instances, but how they are used depends on the redundancy mode. In the default active-standby mode, one instance carries the tunnels and the other waits idle, taking over only if the active one fails or during a planned maintenance event, which produces a brief interruption while the standby assumes the connections and the routing reconverges. In active-active mode, both instances carry traffic at the same time, each with its own public IP, and the gateway presents two tunnels to the remote side so that the loss of one instance leaves the other already up and forwarding rather than waiting to be promoted.

Active-active changes the resilience math. With active-standby, a maintenance event on the active instance means a failover gap, short but real, during which the tunnel is down. With active-active, the remote device maintains tunnels to both gateway instances simultaneously, and BGP steers traffic across whichever tunnels are healthy, so a single instance dropping out is absorbed without a full reconnection. For connections carrying replication, financial transactions, or anything where even a brief drop is costly, active-active is the design that holds. It does require a route-based gateway and a SKU that supports it, and it works best with BGP so the routing layer can converge cleanly around the surviving path.

Resilience extends further on the remote side. A truly redundant design pairs an active-active Azure gateway with two on-premises VPN devices, so that the failure of a single device, a single Azure instance, or a single tunnel never severs connectivity. This dual-redundancy topology is the gold standard for connections that must not drop, and it is only expressible on a route-based gateway with multiple tunnels and BGP coordinating the paths. The zone-redundant AZ SKUs add another dimension by placing the underlying instances in separate availability zones, so the resilience covers a zone-level incident and not only an instance-level one.

How does active-active give redundancy, and is it always worth it?

Active-active runs both gateway instances simultaneously, each terminating its own tunnel to the remote device, so when one instance fails the other is already carrying traffic and no failover promotion is needed, which removes the reconnection gap that active-standby incurs during instance loss or maintenance. BGP rides on top to converge routing around the surviving tunnel automatically.

Whether it is worth it depends on the cost of a brief outage. For a development link or a branch where a short reconnection is harmless, active-standby is adequate and simpler. For production connectivity carrying database replication, payment flows, or interactive workloads where seconds of loss are visible to users or violate a service commitment, the active-active design pays for itself the first time a maintenance event would otherwise have dropped the tunnel. The cost is a modestly more complex configuration, a second public IP, a SKU that supports the mode, and ideally a second remote device, all of which are small next to an avoided outage on a critical path. The decision rule is simple: if a momentary tunnel drop is acceptable, active-standby; if it is not, active-active with BGP, and a second remote device when the budget allows.

The configuration that realizes the model

Reading the model is one thing; building it is another, and the build order matters because several pieces depend on others existing first. The gateway lives in a dedicated subnet named exactly GatewaySubnet, which Azure recognizes by name, so the first prerequisite is a virtual network with that subnet carved out and sized generously, because a gateway subnet that is too small constrains future features and certain configurations need more addresses than a minimal subnet provides. The gateway then needs a public IP, the gateway resource itself with its type and SKU, a local network gateway object that represents the remote device and its address space, and finally a connection that ties the gateway, the local network gateway, and a shared key together.

The following sequence creates a route-based, active-active-capable gateway and connects it to an on-premises device. The commands are illustrative of the order and the objects involved, and the exact parameter values should be checked against your environment and the current command reference before running them.

# 1. Create the GatewaySubnet (name is significant; Azure matches it by name)
az network vnet subnet create \
  --resource-group rg-network \
  --vnet-name vnet-hub \
  --name GatewaySubnet \
  --address-prefixes 10.0.255.0/27

# 2. Allocate a static public IP for the gateway so a recreate can reuse it
az network public-ip create \
  --resource-group rg-network \
  --name pip-vpngw \
  --allocation-method Static \
  --sku Standard

# 3. Create the route-based VPN gateway on a production-grade SKU
az network vnet-gateway create \
  --resource-group rg-network \
  --name vpngw-hub \
  --vnet vnet-hub \
  --public-ip-addresses pip-vpngw \
  --gateway-type Vpn \
  --vpn-type RouteBased \
  --sku VpnGw2 \
  --no-wait

Two details in that sequence carry more weight than they appear to. The first is the gateway subnet name and size: name it GatewaySubnet exactly, and give it room, because Azure places its underlying instances there and certain features reserve additional addresses. The second is the static public IP allocated up front: by pinning the address before the gateway exists, you ensure that even a forced recreate of the gateway can reuse the same public IP, which means the remote peer device never has to be reconfigured for a new address during a rebuild.

With the gateway provisioned, the remote device is represented in Azure as a local network gateway, and the connection binds the two with a shared key. When BGP is in play, the gateway and the local network gateway also carry their BGP settings, the autonomous system numbers and peer addresses, so that the routing session can form once the tunnel is up.

# 4. Represent the on-premises device and its address space
az network local-gateway create \
  --resource-group rg-network \
  --name lng-onprem \
  --gateway-ip-address 203.0.113.10 \
  --local-address-prefixes 192.168.0.0/16 \
  --asn 65010 \
  --bgp-peering-address 192.168.255.1

# 5. Create the site-to-site connection with BGP enabled
az network vpn-connection create \
  --resource-group rg-network \
  --name cn-hub-to-onprem \
  --vnet-gateway1 vpngw-hub \
  --local-gateway2 lng-onprem \
  --shared-key 'use-a-strong-preshared-key-here' \
  --enable-bgp true

The shared key is the one piece of this configuration that is a secret rather than a setting, and it should come from a secured store rather than being typed inline as the example shows for clarity. Beyond the build, the configuration that the defaults get wrong is the IPsec and IKE policy: Azure negotiates a default set of cipher suites, and when the remote device’s security team has pinned a specific suite, the tunnel will not establish until the Azure connection’s custom IPsec policy is set to match exactly. A mismatched proposal is one of the most common reasons a freshly built tunnel never comes up, and it is invisible until you compare the two sides’ policies parameter by parameter.

How do I verify the gateway and connection are actually working?

Confirm the connection’s status reports connected, confirm the tunnel’s ingress and egress byte counters are climbing under a test transfer, and on a BGP connection confirm the learned routes show the remote prefixes with the gateway as next hop. A connected status with zero data and no learned routes means the tunnel formed but traffic is not flowing, which points at routing or selectors rather than at the IPsec handshake.

The verification commands report the connection status and the BGP peer state, and reading both together separates the layers cleanly. If the connection status is not connected, the problem is at the IPsec or IKE layer: a wrong shared key, a mismatched policy, an unreachable peer address, or a firewall blocking the IKE and IPsec ports on the path. If the connection is connected but no traffic crosses, the problem is above the tunnel: missing routes, a traffic selector mismatch on a policy-based gateway, a user-defined route diverting the traffic, or a network security group dropping it at the subnet. If BGP is enabled but no routes are learned, the problem is the BGP session: a wrong peer address, a colliding or wrong autonomous system number, or BGP not enabled on one side. Each symptom maps to a layer, and checking them in that order keeps you from reconfiguring the encryption when the real fault is a route.

# Connection status and data counters
az network vpn-connection show \
  --resource-group rg-network \
  --name cn-hub-to-onprem \
  --query "{status:connectionStatus, ingress:ingressBytesTransferred, egress:egressBytesTransferred}"

# BGP peer status on the gateway (learned routes and peer state)
az network vnet-gateway list-bgp-peer-status \
  --resource-group rg-network \
  --name vpngw-hub
az network vnet-gateway list-learned-routes \
  --resource-group rg-network \
  --name vpngw-hub

The failure modes and the diagnostic tools that expose them

Every capability in the model has a corresponding failure mode, and the value of holding the model is that each symptom points back at the decision that produced it. The six patterns below are the ones engineers report most, and for each there is a confirming signal and a fix that follows from understanding rather than from guessing.

The first pattern is a policy-based gateway limiting the connection to a single tunnel. A team builds a working tunnel to one branch, then tries to add a second branch and finds the new connection refuses to establish, or the existing one breaks when a second is configured. The confirming signal is its VPN type reporting policy-based, and the underlying cause is structural: a policy-based gateway supports one connection by design. The fix is not a setting but a rebuild to a route-based gateway, after which multiple tunnels coexist freely. This is the most expensive failure to discover late, which is exactly why the route-based-by-default rule exists, and why the troubleshooting guide on how to fix Azure VPN gateway tunnel disconnects begins by confirming the gateway type before anything else.

The second pattern is throughput capped by an undersized SKU. A replication job or a backup stream that should saturate the link plateaus well below the expected rate, and engineers chase the application or the disk before suspecting the gateway. The confirming signal is aggregate throughput across its tunnels sitting near the SKU’s rated ceiling while individual flows stall, visible in its metrics. The fix is to resize to a higher VpnGw SKU within the same generation, an in-place operation, and to verify the new ceiling with a measured transfer rather than trusting the published figure on a high-latency path.

The third pattern is the wrong connection type for the need. A team builds site-to-site connectivity and then discovers that the people who actually need access are roaming administrators with no fixed network behind them, so a per-location tunnel makes no sense. The confirming signal is the mismatch itself: a site-to-site design serving individual humans rather than networks. The fix is to add point-to-site to the same route-based gateway so each roaming client authenticates individually, while the site-to-site tunnels continue to serve the branches that genuinely are networks.

Why does my tunnel show connected but no traffic passes?

A connected tunnel with no traffic almost always means the IPsec layer succeeded but the routing or filtering layer above it is dropping the packets: a missing route to the remote prefix, a user-defined route overriding the gateway route, a network security group denying the traffic, or, on a policy-based gateway, a traffic selector mismatch. The encryption is fine; the path is not.

Working this systematically beats poking at the shared key. Start by reading the effective routes on a network interface in a connected subnet and confirm the remote prefix is present with the virtual network gateway as its next hop; if it is missing, the connection is not propagating routes, which on a BGP connection points at a dead BGP session and on a static connection points at an unconfigured prefix. If the route is present and correct, check for a user-defined route with a more specific prefix sending the traffic somewhere else, because longest-prefix match means a stray specific route silently wins. If routing is clean, check the network security groups on the subnet and the network interface for a rule denying the flow. And on a policy-based gateway, compare the traffic selectors on both ends, because a single-byte difference in a prefix means the security association never covers the traffic you are testing. The detailed sequence for chasing a dropped or flapping tunnel is laid out in the companion piece that walks through how to diagnose VPN gateway tunnel disconnects, which pairs each symptom with its confirming command.

The fourth pattern is the single-gateway resilience gap. A connection that has run flawlessly for months drops for a minute or two during what turns out to be a platform maintenance event, and the team is surprised that a managed service had any interruption at all. The confirming signal is the timing: a brief, total loss correlated with a maintenance notification, on a gateway running in active-standby mode. The fix is to move to an active-active gateway, ideally with BGP and a second remote device, so that the loss of one instance is absorbed by the other without a reconnection gap. The resilience was never broken; it was simply never designed in.

The fifth pattern is BGP enabling dynamic routes that never appear. BGP is configured, the tunnel is healthy, and yet on-premises prefixes never show up in Azure’s learned routes. The confirming signal is a BGP peer status that is not in the established state while the IPsec connection is connected. The cause is almost always the BGP peer address or the autonomous system number: the remote device is peering with the wrong address, or the two sides have chosen colliding or reserved ASNs. The fix is to align the BGP peer addresses and ASNs on both ends until the session reaches established, after which the routes populate automatically.

The sixth pattern is a route-based-versus-policy-based mismatch with the on-premises device. The Azure gateway is route-based, the remote device is configured for policy-based traffic selectors, and the security association either never forms or forms and then drops because the two sides disagree about how to select traffic. The confirming signal is an IKE negotiation that fails at the traffic-selector or proposal stage, visible in the gateway diagnostics and in the remote device’s logs. The fix is to align the two: either move the remote device to route-based IKEv2, or configure policy-based traffic selectors as an option on the route-based Azure connection so it presents selectors the policy-based peer accepts.

Which diagnostic tools expose these failures?

The gateway’s own connection status and data counters tell you whether the tunnel is up and whether bytes are moving; the BGP peer and learned-route views tell you whether the routing layer agreed; effective routes on a subnet interface tell you whether traffic is even being pointed at the gateway; and packet capture on the gateway, plus the remote device’s IKE logs, expose the negotiation itself when a tunnel will not form. Each tool answers one layer’s question.

The discipline that turns these tools into fast diagnosis is to question the layers in order rather than all at once. Is the connection connected? If not, the fault is IPsec or IKE, so examine the shared key, the policy match, the peer reachability, and the IKE logs. Is it connected but silent? Then the fault is routing or filtering, so read the effective routes, the user-defined routes, and the network security groups. Is BGP failing on a healthy tunnel? Then the fault is the BGP session, so check the peer address and the ASN. Reproducing each of these failures in a controlled lab is the fastest way to build the instinct for which signal to read first, and you can run the hands-on Azure labs and command library on VaultBook to model route-based gateways, exercise the SKU throughput behavior, and stand up site-to-site and point-to-site connections against a reproduced remote device so the diagnostic steps become muscle memory before you need them on a live link.

How the gateway interacts with the rest of the network

A VPN gateway does not live alone. It sits inside a virtual network, alongside route tables, network security groups, peerings, possibly an ExpressRoute circuit, and often a hub-and-spoke topology, and most of the surprising behavior engineers hit comes from how the gateway interacts with those neighbors rather than from the gateway itself.

The most important interaction is gateway transit across virtual network peering. A gateway lives in one virtual network, but the resources that need to reach on-premises usually live in many. Rather than deploying a gateway per network, you place one gateway in a hub virtual network and enable gateway transit on the peering between the hub and each spoke, which lets the spokes use the hub’s gateway to reach on-premises and lets on-premises reach the spokes through the hub. Gateway transit is what makes the hub-and-spoke topology economical, because a single gateway serves the whole estate. The configuration has two halves that must both be set: the hub side allows gateway transit on the peering, and the spoke side uses the remote gateway. Miss either half and the spoke’s traffic never finds the tunnel, which surfaces as a spoke that cannot reach on-premises while the hub can, a symptom that points straight at a half-configured peering. The full design of this layout is covered in the walkthrough on the Azure hub-and-spoke network topology, which shows where the gateway terminates and how the spokes inherit its reach.

The second interaction is with route tables and user-defined routes. Azure injects routes for its connected prefixes automatically, but a user-defined route with a more specific prefix overrides them by longest-prefix match, so a UDR meant to steer traffic through a firewall appliance can accidentally divert traffic that should have crossed the tunnel, black-holing it if the appliance does not forward it onward. When on-premises connectivity breaks after someone adds a route table or a network virtual appliance, the effective routes on the affected subnet are the first place to look, because a stray specific route silently outranks the gateway route. Forced tunneling is the deliberate version of this interaction: a UDR or a BGP advertisement of a default route sends internet-bound traffic from Azure back through the tunnel to on-premises for inspection, which is sometimes a security requirement and sometimes an accident that breaks outbound connectivity when the on-premises side is not prepared to route it.

The third interaction is coexistence with ExpressRoute. A virtual network can hold both a VPN gateway and an ExpressRoute gateway, with the VPN tunnel serving as a backup path for the private circuit or as connectivity to sites the circuit does not reach. When both are present and both advertise the same prefixes over BGP, route preference decides which path traffic takes, and the design has to be explicit about which is primary and which is backup or the failover will not behave as intended. The comparison of when to reach for a VPN tunnel, virtual network peering, or an ExpressRoute circuit is worked through in the companion piece on VNet peering versus VPN versus ExpressRoute, and the deep treatment of the private circuit itself lives in the Azure ExpressRoute deep dive, which pairs naturally with this article when a design needs both the public-internet tunnel and the private circuit.

Why can my hub reach on-premises but my spoke cannot?

A spoke that cannot reach on-premises while the hub can almost always has a half-configured gateway transit: either the hub-to-spoke peering does not allow gateway transit, the spoke-to-hub peering does not use the remote gateway, or the on-premises device is not advertising or being told about the spoke’s address range. The tunnel is healthy; the spoke simply never inherits the path to it.

Resolving it means checking three things in sequence. First, confirm both halves of the peering: gateway transit allowed on the hub side and use-remote-gateway set on the spoke side, because both are required and either alone does nothing. Second, confirm that the on-premises side knows about the spoke’s address range, because a tunnel and a local network gateway scoped only to the hub’s range will not carry traffic destined for a spoke prefix the remote device has never heard of; with BGP this propagates automatically once the spoke prefix is advertised, and with static routing the spoke prefix has to be added to the local network gateway’s address space. Third, check the effective routes on a spoke interface to confirm the on-premises prefix points at the gateway and is not overridden by a user-defined route. These three checks resolve the overwhelming majority of spoke-reachability problems, and they all follow from understanding that the spoke borrows the hub’s gateway rather than owning one.

Designing a VPN gateway for production

Pulling the model together into a production design means making each of the four decisions deliberately and then accounting for the neighbors. The gateway type is route-based, without exception unless a remote device genuinely cannot do route-based IKEv2, because route-based is the only type that carries the modern feature set. The SKU is a VpnGw tier sized for the measured peak with headroom and for the planned tunnel count with room to grow, in its zone-redundant AZ variant wherever the region offers availability zones, never the Basic SKU for anything that matters. The redundancy mode is active-active with BGP for any connection where a brief drop is unacceptable, ideally paired with two on-premises devices so no single failure on either side severs the link. And the connection types are chosen per need: site-to-site for networks, point-to-site for roaming clients, coexisting on the same gateway.

The neighbors are designed in alongside the gateway. A static public IP is allocated up front so a forced recreate reuses the address and the remote peers never have to be reconfigured. The gateway subnet is named exactly and sized generously. Gateway transit is enabled across the hub-and-spoke peerings so a single gateway serves the estate. Route tables are audited for user-defined routes that might override the gateway routes, and forced tunneling is configured deliberately if egress inspection is required rather than discovered accidentally. The IPsec and IKE policy is matched to the remote device’s required cipher suite from the start, since a default-policy mismatch is the most common reason a new tunnel never forms. And the shared key is drawn from a secured secret store rather than embedded in a template or a script.

Monitoring closes the loop. The gateway’s connection status, its tunnel ingress and egress byte counters, and its BGP peer state are the signals that tell you the connection is not merely provisioned but actually carrying traffic and routing correctly. Alerting on a connection status that leaves the connected state, on a BGP peer that leaves the established state, and on throughput approaching the SKU ceiling turns the three most common failure modes, a dropped tunnel, a dead routing session, and a saturated gateway, into notifications you receive before users do rather than incidents you reconstruct afterward. A connection designed this way is not merely working; it is observable, resilient, and built so that the predictable failures are caught early and the expensive ones are designed out.

How do I make a VPN gateway design repeatable and auditable?

Express the entire gateway, its public IP, its connections, its local network gateways, and its BGP settings as infrastructure as code in a Bicep or Terraform module, keep the shared keys in a secret store referenced rather than embedded, and version the module so every change is reviewed and reproducible. A gateway built by hand in the portal is a gateway nobody can rebuild identically under pressure.

The reason to invest in this is that VPN gateways are long-lived, infrequently changed, and catastrophic when they fail, which is precisely the profile where hand-built configuration drifts and nobody remembers the exact settings until an outage forces a rebuild. A versioned module captures the gateway type, the SKU, the IPsec policy, the BGP ASNs and peer addresses, and the connection topology in a form that can be code-reviewed, diffed, and redeployed, so a recreate after a SKU-family change or a regional move reproduces the working configuration exactly. It also makes the security posture auditable, because the cipher suite, the use of a static IP, and the reference to a secret store are all visible in the source rather than buried in portal blades nobody opens until something breaks.

The IPsec and IKE layer, read closely enough to debug it

Underneath the gateway abstraction is a two-phase negotiation that either succeeds or leaves a precise trail when it fails, and reading that trail is the difference between fixing a tunnel in minutes and cycling settings for an afternoon. The Internet Key Exchange protocol, in its second version for route-based gateways, runs in two phases. The first phase authenticates the two endpoints to each other and establishes a secure channel for the negotiation itself, using the shared key or certificates and an agreed set of cryptographic parameters. The second phase, riding inside that secure channel, negotiates the actual IPsec security associations that protect the data traffic, including the cipher, the integrity algorithm, and the lifetime after which the keys are renegotiated.

Most tunnel-establishment failures land in one of those two phases, and which phase tells you what to fix. A first-phase failure usually means the endpoints cannot agree on who they are or how to talk: a wrong shared key, a mismatched IKE proposal because one side requires a cipher the other did not offer, or an unreachable peer because a firewall on the path is blocking the negotiation. A second-phase failure usually means the endpoints authenticated but cannot agree on how to protect the data: a mismatched IPsec proposal, or, on the boundary between route-based and policy-based, a disagreement about traffic selectors. Azure negotiates a default set of proposals that interoperate with many devices, but the moment a remote security team pins a specific suite, the Azure connection’s custom IPsec policy has to be set to match that suite exactly, parameter for parameter, or the negotiation fails in the phase where the disagreement lives.

The lifetimes matter too, in a subtle way. The security associations are renegotiated periodically, and if the two sides disagree wildly about the lifetimes, a tunnel can come up cleanly and then drop at the rekey interval, producing a connection that works for a while and then flaps on a regular cadence. A tunnel that fails immediately is a proposal or key problem; a tunnel that fails on a rhythm is often a rekey or lifetime mismatch. Hearing the difference between never-up and up-then-flapping points the investigation at the right parameter before any logs are opened.

Why does my tunnel come up and then drop on a regular interval?

A tunnel that establishes cleanly and then drops on a predictable cadence is usually rekeying badly: the two endpoints disagree about the security association lifetimes, so when the keys are due to be renegotiated the two sides fall out of step and the tunnel resets. The encryption and authentication are correct, which is why it came up at all; the renegotiation is where they diverge.

The fix is to align the IPsec and IKE lifetimes on both ends, or to set a custom IPsec policy on the Azure connection whose lifetimes match what the remote device uses. It is worth distinguishing this from an unstable physical path, which produces irregular drops tied to packet loss rather than a clean periodic reset, and from an idle-timeout behavior on some devices that tears down a tunnel with no interesting traffic, which produces drops correlated with quiet periods rather than with the rekey clock. The cadence of the drops is the tell: periodic and regular points at rekey, irregular points at the path, and quiet-correlated points at an idle timeout, and each has a different remedy.

Point-to-site in depth: connecting clients without a network

Point-to-site deserves its own treatment because it answers a need site-to-site cannot, and its mechanics differ in ways that matter. Where site-to-site terminates a tunnel between two gateway devices, point-to-site terminates a tunnel between the Azure side and an individual client running VPN software, and it authenticates that client rather than trusting a network behind a device. The gateway hands each connected client an address from a configured client address pool and a set of routes for the Azure prefixes it should reach, so the client behaves as though it had a private foothold inside the virtual network.

The protocol and the authentication are the two decisions inside point-to-site. The tunneling protocol can be an SSL-based protocol that traverses firewalls easily because it rides on a standard secure web port, or an IKEv2-based protocol that some platforms prefer, and the gateway can offer more than one so different client operating systems connect with whatever they support best. The authentication can be certificate-based, where each client presents a certificate that chains to a root the gateway trusts, or it can integrate with Azure AD or a RADIUS server so that identity and conditional access policies govern who connects. Certificate-based authentication is simple to stand up and revoke per client, while identity-based authentication ties VPN access to the same directory and policies that govern the rest of the estate, which is usually the stronger posture for an organization that already centralizes identity.

Point-to-site scales differently from site-to-site, and the SKU governs how many simultaneous clients the gateway accepts, so a gateway expected to serve a large remote workforce has to be sized for the client count as well as the throughput. The client address pool must be large enough for the peak number of simultaneous connections and must not overlap with the virtual network or the on-premises ranges, because an overlapping pool produces routing conflicts that are maddening to diagnose. And because point-to-site clients are, by definition, outside any controlled network, the authentication choice is also a security choice: identity-based authentication with conditional access gives you per-session control that a network-trusting site-to-site tunnel never offers.

Can site-to-site and point-to-site run on the same gateway?

Yes, a single route-based gateway carries both simultaneously: it can terminate several site-to-site tunnels to branch networks and accept point-to-site clients for roaming users at the same time, because each is just another set of routes pointed at the gateway interface. A policy-based gateway can do neither alongside the other, which is one more reason route-based is the default.

The combined deployment is in fact the common production shape. Branches and data centers connect over site-to-site tunnels because they are networks with devices, while administrators, contractors, and remote employees connect over point-to-site because they are individuals without a fixed network, and both populations reach the same Azure resources through the same gateway. The design considerations are additive rather than conflicting: size the SKU for the sum of the site-to-site throughput and the point-to-site client count, keep the point-to-site client pool from overlapping any connected network, and apply the trust model appropriate to each population, network-level for the branches and per-client identity for the roaming users.

Migrating from a policy-based to a route-based connection

Because the policy-based choice forecloses so much, teams that made it early often reach a point where they need what only route-based offers, a second tunnel, dynamic routing, redundancy, or roaming clients, and the migration is worth describing because it is a rebuild rather than a setting change. There is no in-place conversion that turns a policy-based connection into a route-based one; the type is fixed at creation, so the path forward is to stand up a new route-based resource and cut the connections over to it.

The cutover sequence minimizes the outage. First, provision the new route-based resource alongside the existing one, ideally with a static public IP allocated up front, on a SKU sized for the full set of tunnels it will eventually carry. Second, prepare the remote devices for the new endpoint, configuring the new public IP, the matching IPsec policy, and, where dynamic routing is the goal, the BGP autonomous system numbers and peer addresses. Third, move connections one branch at a time where the topology allows, so a single misconfiguration affects one site rather than the whole estate, and verify each cutover with the connection status and the data counters before moving to the next. Finally, once every connection has moved and proven healthy, decommission the old policy-based resource.

Two complications deserve attention during the move. The first is the address-plan opportunity: a migration is the natural moment to introduce BGP so that future address changes propagate automatically, rather than carrying forward the static-route brittleness that made the policy-based connection painful to maintain. Investing the extra configuration in BGP during the rebuild pays back every time the network changes afterward. The second is the IPsec policy alignment: the new connection must present a proposal the remote device accepts, and since the remote team may have pinned a specific cipher suite, the custom policy on the new connection has to match it exactly, the same parameter-for-parameter discipline that governs any new tunnel. A migration that fails at cutover almost always fails on a policy mismatch or a forgotten BGP peer address, the same two faults that bedevil any fresh connection, so the verification discipline that catches them on a new build catches them here too.

The reward for the rebuild is everything the policy-based design withheld. After the cutover, the estate can grow past a single tunnel, survive an instance loss through active-active, learn routes dynamically through BGP, and admit roaming clients through point-to-site, all on one route-based resource. The migration is a one-time cost that removes a standing limitation, which is why the route-based-by-default rule is worth applying at the very first design rather than discovering its value through a forced rebuild later.

The strategic verdict

An Azure VPN gateway rewards the engineer who treats it as a set of deliberate decisions rather than a wizard to click through. The route-based-by-default rule is the spine of the whole subject: choose route-based and the modern feature set, multiple tunnels, BGP, active-active, point-to-site, and the coexistence of connection types, is available to you, while choosing policy-based forecloses nearly all of it for the sake of compatibility with a shrinking set of legacy devices. The SKU is the second deliberate decision, sized for measured throughput and planned tunnel count with headroom, in its zone-redundant variant where zones exist, never Basic for production. Redundancy is the third, active-active with BGP and ideally a second remote device wherever a brief drop is unacceptable. And the connection type is matched to whether you are joining a network or admitting a client.

Read the gateway through the four-row model and its behavior stops being mysterious. A connection that refuses a second tunnel is a policy-based gateway. A link that throttles is an undersized SKU. A tunnel that drops in maintenance is a non-redundant gateway. A route that never updates is a gateway without BGP. A spoke that cannot reach on-premises is a half-configured gateway transit. Each symptom names its decision, and each decision, made deliberately at design time, removes the symptom before it ever appears. The gateway that survives production is not the one that came up in a demo; it is the one whose every capability and every failure mode was chosen on purpose, expressed as code, and watched by monitoring that catches the predictable faults early. Build it that way, and the encrypted door between your Azure estate and the world holds when it matters. The discipline is the same one the whole series returns to: reason from the architecture first, confirm with the right signal second, and let every deliberate decision at design time remove a failure mode that would otherwise have surfaced as an incident at the worst possible hour.

Throughput, latency, and the ceiling a single flow actually hits

The gap between a SKU’s published aggregate throughput and the rate a real workload achieves is wide enough that it deserves its own reasoning, because engineers who size only against the brochure number are routinely disappointed. The published figure is an aggregate across all tunnels under favorable lab conditions, and several independent factors pull a real-world rate below it. Latency is the first: a tunnel between continents adds tens or hundreds of milliseconds of round-trip time, and a single TCP flow’s throughput is bounded by the window size divided by the round-trip time, so a long-distance link starves one flow long before its aggregate ceiling is in sight. Packet loss compounds this, because loss triggers TCP’s congestion response and collapses the achievable window further.

The implication for design is twofold. First, a workload that needs high throughput over a high-latency tunnel should be parallelized into many flows rather than relying on one, because the aggregate of many flows can approach the gateway ceiling even when no single flow can. Replication and backup tools that support parallel streams will saturate a gateway that a single-threaded copy leaves mostly idle. Second, the SKU should be sized against the realistic aggregate, with the understanding that the encryption overhead, the cipher suite, and the remote device’s own capacity all sit between the workload and the published number. The honest way to size a gateway is to provision a candidate SKU, drive it with a representative parallel transfer, measure the achieved aggregate, and resize up if the measured ceiling does not clear the requirement with headroom, rather than trusting any number on a page.

There is also a maximum transmission unit consideration that quietly degrades throughput when it is wrong. The IPsec encapsulation adds overhead to every packet, and if the path does not accommodate the resulting packet size, fragmentation or path-MTU discovery failures force retransmission and slow everything down. Tunnels that perform well on small packets and badly on large transfers often have an MTU or packet-size problem on the path rather than a gateway capacity problem, and clamping the maximum segment size on the traffic entering the tunnel frequently restores the expected rate. This is the kind of failure that looks like a throughput ceiling but is really a packet-size pathology, and distinguishing the two saves a needless SKU upgrade.

Securing the gateway and the tunnels it terminates

A VPN gateway is a security control as much as a connectivity one, and a few choices separate a hardened gateway from a soft one. The first is the cipher suite: rather than accepting whatever default both ends happen to agree on, set a custom IPsec and IKE policy that pins strong, current algorithms for encryption, integrity, and key exchange, so the negotiation cannot fall back to a weaker proposal. The second is the shared key, which for site-to-site should be long and random and stored in a secret store rather than embedded in a template, and which should be rotated on a schedule, since a static preshared key that lives forever in a script is a standing exposure. Certificate-based authentication, where the devices support it, removes the shared-secret problem entirely and is the stronger posture for connections that justify the operational overhead.

For point-to-site, the security center of gravity is the client authentication. Identity-based authentication that ties VPN access to the organization’s directory lets conditional access policies govern who connects, from where, and under what device posture, which is far stronger than a certificate that, once issued, grants access until it is explicitly revoked. Whatever the method, the ability to revoke a single client without disrupting others is essential, because roaming clients are lost, stolen, and offboarded routinely, and a design that forces a mass reissue to remove one client will not be maintained.

The gateway’s exposure to the public internet is itself a consideration. The gateway presents a public IP to terminate tunnels from remote devices and clients, and while the IPsec and authentication layers protect the tunnels, the surrounding design should ensure that only the intended ports and protocols are reachable and that the resources behind the gateway are themselves segmented, so that a compromise of one connected network does not become free movement across the entire Azure estate. The gateway opens a door; the segmentation behind it decides how far an intruder who comes through that door can walk.

Forced tunneling and default-route advertisement, handled deliberately

One of the more consequential interactions between a VPN gateway and the rest of the network is forced tunneling, the practice of sending internet-bound traffic from Azure back through the tunnel to on-premises so it passes through the corporate egress and inspection stack rather than leaving directly through Azure. It is a legitimate and common security requirement, but it is also a frequent source of self-inflicted outages, because the mechanism that enables it can be triggered by accident. Forced tunneling is realized either by a user-defined route that sends the default route at the gateway as its next hop, or by the on-premises device advertising a default route over BGP, which the gateway then installs and propagates to the subnets that use it.

The accidental version is the dangerous one. An on-premises device that advertises a default route over BGP without the Azure side expecting it will quietly capture all internet-bound traffic from the affected subnets and funnel it through the tunnel, and if the on-premises side is not prepared to route that traffic onward to the internet, the result is that Azure resources lose outbound connectivity entirely. The symptom is dramatic and confusing: resources that could reach the internet a moment ago suddenly cannot, with no change on the Azure side, because a route learned over BGP rewrote their default path. The fix is to control default-route advertisement explicitly on both ends, advertising it only when forced tunneling is genuinely intended and the on-premises egress is ready to carry the load, and filtering it otherwise so a peer cannot impose it unexpectedly.

The deliberate version requires planning the on-premises side for the additional egress volume, because all of Azure’s internet-bound traffic now traverses the tunnel and the corporate egress, which can be a significant load the on-premises network was never sized for. It also interacts with services that expect direct outbound connectivity, some of which break or degrade when their egress is forced through a distant inspection point, so the design has to account for the traffic that must remain direct. Forced tunneling is a powerful control when chosen on purpose with the egress capacity to match, and a baffling outage when it arrives by an unfiltered BGP advertisement, and the difference is entirely in whether the default route is governed deliberately.

Monitoring, alerting, and the cost of the connection

A gateway that is merely provisioned is not the same as a gateway that is known to be healthy, and the difference is monitoring. Three signals carry most of the diagnostic value. The connection status tells you whether each tunnel is in the connected state, and an alert on a tunnel leaving connected catches a dropped link before users report it. The tunnel ingress and egress byte counters tell you whether traffic is actually flowing and, watched over time, whether throughput is approaching the SKU ceiling, which is the early warning that a resize is coming. The BGP peer status tells you whether the routing session is established, and an alert on a peer leaving established catches the silent failure where a tunnel stays up but stops learning routes, which is otherwise invisible until traffic for a learned prefix starts black-holing.

Watching these together turns the three most common failure modes into notifications rather than incidents. A tunnel that leaves connected is the dropped-link failure. A throughput trend pressing the ceiling is the undersized-SKU failure arriving in slow motion. A BGP peer that leaves established is the dynamic-routing failure that produces a connected-but-no-traffic symptom. Each has an alert that fires before the user impact, and a connection instrumented this way fails loudly and early rather than quietly and late.

The cost of a VPN gateway has two parts worth understanding. The first is the gateway itself, billed by the hour for the chosen SKU, where a higher tier costs more but the difference between adjacent VpnGw tiers is modest relative to the cost of an outage from undersizing. The second is the data that egresses Azure through the tunnel, which is billed as outbound data transfer, so a high-volume connection carries an ongoing data cost on top of the hourly gateway charge. The practical consequence is that sizing decisions trade a small, predictable hourly cost against the risk of throttling, which favors headroom, while high-egress designs should account for the data-transfer component when comparing a VPN tunnel against alternatives like an ExpressRoute circuit, whose cost structure differs and which can be more economical at sustained high volume. The honest cost comparison weighs the hourly gateway charge, the egress data charge, and the operational cost of the failure modes each option carries, rather than the sticker price of the gateway alone.

Frequently Asked Questions

Q: What is an Azure VPN gateway and how does it work?

An Azure VPN gateway is a managed pair of virtual machine instances, deployed into a dedicated subnet named GatewaySubnet, that terminate IPsec and IKE tunnels and forward traffic between an Azure virtual network and a remote network or client. You do not see or patch the underlying instances; you choose a gateway type and a SKU, and Azure provisions and maintains the compute that applies the encryption, the routing, and the high-availability behavior. The gateway negotiates an encrypted tunnel with a remote device for site-to-site connectivity or with an individual client for point-to-site connectivity, hands the far end the routes it needs, and carries traffic across the tunnel. Everything else about the gateway, its throughput, its tunnel count, its redundancy, and its support for dynamic routing, follows from the type and the SKU you select at creation, which is why those two choices deserve more deliberation than the rest of the configuration combined.

Q: What is the difference between a route-based and a policy-based VPN gateway?

A policy-based gateway selects traffic for the tunnel using static traffic selectors, fixed pairs of local and remote prefixes, and as a consequence supports exactly one tunnel and one connection with no dynamic routing, no active-active, and no point-to-site. A route-based gateway presents the tunnel as a routable virtual interface, so any packet whose route points at the gateway enters the tunnel, which lets it carry many tunnels at once, BGP for dynamic routing, active-active redundancy, point-to-site clients, and the coexistence of all of these on one gateway. The route-based design holds essentially the entire modern feature set, while policy-based survives only as a compatibility option for legacy devices that cannot do route-based IKEv2 and need a single static tunnel. The practical guidance is to choose route-based by default and treat policy-based as the rare exception you reach for only when a remote device genuinely leaves you no other choice.

Q: When should I still use a policy-based VPN gateway?

Use a policy-based gateway only when the remote device cannot do route-based IKEv2, is locked to IKEv1 with policy-based traffic selectors, and the connection is a single tunnel between one local prefix and one remote prefix with no need for dynamic routing, redundancy, or roaming clients. That narrow profile is the entire remaining case. It appears with older firewalls and some partner networks whose security teams will not move off a policy-based configuration, and even then many of those devices can be coaxed into route-based IKEv2, or a route-based Azure gateway can be configured with policy-based traffic selectors as an option to interoperate. Because policy-based forecloses multiple tunnels, BGP, active-active, and point-to-site, choosing it commits you to a structurally limited connection, so the right reflex when policy-based seems necessary is to confirm that the remote device truly cannot do route-based before accepting the limitation.

Q: What do the VPN gateway SKUs determine?

The SKU determines three things: the aggregate throughput the gateway will carry across all its tunnels, the number of site-to-site tunnels and point-to-site connections it will terminate, and which advanced features, such as active-active and zone-redundant deployment, are available at all. The ladder runs from the Basic SKU, suitable only for development and the simplest connectivity, up through the VpnGw generational tiers and their zone-redundant AZ variants, with each step raising throughput and connection counts. The published throughput is an aggregate ceiling under favorable conditions, not a guaranteed per-tunnel rate, so it should be treated as a planning ceiling to size against with headroom and verified with a real transfer. The tunnel count is the limit that most often breaks expansion plans, because a hub gateway expected to fan out to many branches will refuse connections beyond its SKU’s tunnel limit, so the count has to be checked against the planned branch total with room to grow.

Q: Which VPN gateway SKU should I choose for production?

Choose the lowest VpnGw tier whose aggregate throughput clears your measured peak with comfortable headroom and whose tunnel and connection limits exceed your planned count, and prefer the zone-redundant AZ variant in any region that offers availability zones. Avoid the Basic SKU for production, because it omits active-active, certain BGP options, and the zone-redundant model, so building on it usually forces a disruptive rebuild later. The cost difference between adjacent VpnGw tiers is small relative to the cost of an outage from an undersized gateway throttling a critical stream, so sizing with headroom is the economical choice over the connection’s life. Allocate a static public IP at creation so that even a forced recreate, which is required when moving between certain SKU families, can reuse the same address and spare you from reconfiguring every remote peer. Picking the right family at creation matters more than the upgrade path, because some moves require delete and recreate rather than an in-place resize.

Q: Can I resize a VPN gateway without downtime or a rebuild?

You can resize in place within the same VpnGw generation, for example from VpnGw1 to VpnGw2, keeping the public IP and the connections, but moving between certain families, such as from Basic to VpnGw or from a non-zonal SKU to a zone-redundant AZ SKU, requires deleting and recreating the gateway. A recreate changes the public IP unless you allocated a static one in advance, and a changed IP means updating every peer device, so the resilient practice is to allocate a static public IP from the start and to pick the correct SKU family at creation rather than relying on a painless upgrade. Even an in-place resize within a generation can involve a brief interruption while the gateway adjusts, so resizes should be scheduled into a maintenance window rather than performed casually on a live production link. Sizing for the connection’s expected growth at creation almost always costs less than the operational disruption of a later resize that touches remote peers.

Q: What is the difference between site-to-site and point-to-site VPN?

Site-to-site connects an entire remote network to Azure over an IPsec tunnel terminated by a physical or virtual VPN device on the far end, and it trusts that network: once the tunnel is up, the hosts behind the remote device reach the permitted Azure subnets through it. Point-to-site connects an individual client, a laptop running VPN software, to the virtual network with no on-premises device at all, and it trusts each client individually, authenticating every device with a certificate, with Azure AD, or with RADIUS before granting it an address and routes. Site-to-site suits data centers and branch offices, which are networks with devices, while point-to-site suits roaming users, contractors, and administrators, who are individuals without a fixed network. A route-based gateway carries both at once, so the common production shape is branches on site-to-site and remote people on point-to-site through the same gateway, each with the trust model that fits.

Q: How does active-active mode improve VPN gateway resilience?

In the default active-standby mode, one of its two instances carries the tunnels while the other waits idle, so a failure or a maintenance event on the active instance causes a brief interruption while the standby takes over and routing reconverges. Active-active mode runs both instances at once, each with its own public IP and its own tunnel to the remote device, so the loss of one instance leaves the other already forwarding rather than waiting to be promoted, which removes the reconnection gap. It requires a route-based gateway and a supporting SKU, and it works best with BGP so the routing layer converges around the surviving tunnel automatically. For maximum resilience, an active-active Azure gateway is paired with two on-premises devices so that no single failure on either side severs connectivity. Active-active is worth its modest extra complexity wherever a brief tunnel drop is unacceptable, and unnecessary where a short reconnection is harmless.

Q: How does BGP routing work over an Azure VPN gateway?

With BGP enabled on both the Azure side and the remote device, the two establish a BGP session across the IPsec tunnel, each identified by an autonomous system number, and they exchange route advertisements: Azure announces the virtual network prefixes and any it learned from other connections, while the peer announces the on-premises prefixes. Each side installs the learned routes with the tunnel as the next hop, and withdrawals propagate automatically when a prefix becomes unreachable, so routing tables update themselves as networks change rather than requiring hand-edited static routes. The gateway uses a specific BGP peer address, which on some configurations is an APIPA address in a reserved range, and the remote device must peer with that exact address, while the autonomous system numbers on the two sides must avoid Azure’s reserved values and must not collide unless a same-ASN setup is intended. When a BGP session will not establish over a healthy tunnel, the peer address and the ASN are the first two values to verify.

Q: Why does my VPN tunnel show connected but no traffic passes through it?

A connected tunnel that carries no traffic means the IPsec layer succeeded but the routing or filtering layer above it is dropping the packets. The usual causes are a missing route to the remote prefix, a user-defined route with a more specific prefix overriding the gateway route by longest-prefix match, a network security group denying the flow, or, on a policy-based gateway, a traffic selector mismatch so the security association never covers the traffic you are testing. Debug it by layers: read the effective routes on a subnet interface and confirm the remote prefix points at the gateway, then check for an overriding user-defined route, then check the network security groups, and on a policy-based gateway compare the traffic selectors on both ends. The encryption is fine, which is why the tunnel shows connected, so reconfiguring the shared key or the cipher wastes time; the fault lives in routing or filtering, and reading the effective routes first usually finds it.

Q: Why does my VPN tunnel come up and then drop on a regular interval?

A tunnel that establishes cleanly and then drops on a predictable cadence is usually rekeying badly: the two endpoints disagree about the IPsec and IKE security association lifetimes, so when the keys are due for renegotiation the sides fall out of step and the tunnel resets. The encryption and authentication are correct, which is why it came up at all, and the renegotiation is where they diverge, so the fix is to align the lifetimes on both ends or to set a custom IPsec policy on the Azure connection whose lifetimes match the remote device. The cadence is the diagnostic tell. Regular, clock-like drops point at rekey; irregular drops tied to packet loss point at an unstable physical path; and drops correlated with quiet periods point at an idle timeout on a device that tears down a tunnel with no interesting traffic. Each cause has a different remedy, so the pattern of the drops should be characterized before any parameter is changed.

Q: Why can my hub virtual network reach on-premises but my spoke cannot?

A spoke that cannot reach on-premises while the hub can almost always has a half-configured gateway transit. Gateway transit lets spokes use the hub’s gateway, but it has two halves that must both be set: the hub-to-spoke peering must allow gateway transit, and the spoke-to-hub peering must use the remote gateway. If either half is missing, the spoke never inherits the path to the tunnel. Beyond the peering, the on-premises side must know about the spoke’s address range, because a tunnel scoped only to the hub’s range will not carry traffic for a spoke prefix the remote device has never heard of; with BGP this propagates once the spoke prefix is advertised, and with static routing the spoke prefix has to be added to the local network gateway. Finally, check the effective routes on a spoke interface to confirm the on-premises prefix points at the gateway and is not overridden by a user-defined route. Those three checks resolve almost every spoke-reachability problem.

Q: Can a VPN gateway and an ExpressRoute circuit coexist in the same network?

Yes, a virtual network can hold both a VPN gateway and an ExpressRoute gateway, and the common reasons are to use the VPN tunnel as a backup for the private circuit or to reach sites the circuit does not serve. When both are present and both advertise the same prefixes over BGP, route preference decides which path traffic takes, so the design must be explicit about which connection is primary and which is backup, or the failover will not behave as intended. The two gateways occupy the same gateway subnet and are sized and managed independently, and the coexistence is a deliberate resilience pattern rather than an accident. Teams weighing whether they need the tunnel, virtual network peering, or the private circuit should work through the comparison of those options, and teams that decide they need both the public-internet tunnel and the private circuit will find the ExpressRoute deep dive a natural companion to this article, since the private circuit has its own SKUs, peering types, and failure modes to reason about.

Q: What authentication options does point-to-site VPN support?

Point-to-site supports certificate-based authentication, where each client presents a certificate that chains to a root the gateway trusts, and identity-based authentication that integrates with Azure AD or a RADIUS server so that directory identity and conditional access policies govern who connects. Certificate-based authentication is simple to stand up and lets you revoke a single client by removing its certificate, while identity-based authentication ties VPN access to the same directory and policies that govern the rest of the estate, which is usually the stronger posture for an organization that already centralizes identity, because it allows conditional access to evaluate device posture, location, and risk at connection time. The tunneling protocol is a separate choice from the authentication: the gateway can offer an SSL-based protocol that traverses firewalls easily on a standard secure web port and an IKEv2-based protocol that some platforms prefer, and offering more than one lets different client operating systems connect with whatever they support best.

Q: How many simultaneous point-to-site clients can a VPN gateway support?

The number of simultaneous point-to-site clients a gateway accepts is governed by its SKU, so a gateway expected to serve a large remote workforce must be sized for the client count as well as for the site-to-site throughput it carries. The two demands are additive: the SKU has to clear the sum of the tunnel throughput and the concurrent client load. Beyond the SKU limit, the client address pool configured on the gateway must be large enough for the peak number of simultaneous connections, because the pool hands each connected client an address and a pool that is too small will refuse new clients once it is exhausted. The pool must also not overlap with the virtual network ranges or the on-premises ranges, since an overlapping pool produces routing conflicts that are difficult to diagnose. Size the SKU for the realistic peak rather than the average, because remote-work demand spikes, and a gateway that refuses connections at peak defeats the purpose.

Q: Why does my VPN throughput fall far below the SKU’s published rate?

The published SKU throughput is an aggregate ceiling across all tunnels under favorable conditions, not a guaranteed per-tunnel rate, and several factors pull a real rate below it. Latency is the largest: a single TCP flow’s throughput is bounded by the window size divided by the round-trip time, so a long-distance tunnel starves one flow long before the aggregate ceiling is reached, and packet loss compounds this by triggering congestion control. The remedy for a high-latency link is to parallelize the workload into many flows, since the aggregate of many can approach the gateway ceiling even when one cannot, which is why parallel-capable replication and backup tools saturate a gateway that a single-threaded copy leaves idle. A separate cause is a packet-size or MTU problem: the IPsec encapsulation adds overhead, and if the path does not accommodate the resulting size, fragmentation slows large transfers while small packets look fine. A tunnel that is fast on small packets and slow on large transfers usually has an MTU issue rather than a capacity one, and clamping the maximum segment size often restores the rate.

Q: How do I secure an Azure VPN gateway and its tunnels?

Harden the gateway by setting a custom IPsec and IKE policy that pins strong, current algorithms so the negotiation cannot fall back to a weaker proposal, by using a long random shared key stored in a secret store and rotated on a schedule, or certificates where the devices support them, and by choosing identity-based authentication with conditional access for point-to-site so directory policy governs who connects. The ability to revoke a single client without disrupting others is essential for roaming endpoints that are routinely lost or offboarded. Because the gateway presents a public IP to terminate tunnels, ensure only the intended ports and protocols are reachable and that the resources behind the gateway are segmented, so a compromise of one connected network does not become free movement across the whole estate. The gateway opens an encrypted door, and the segmentation behind it decides how far an intruder who comes through can travel, so the gateway and the internal network design are two halves of one security posture.

Q: Why does a freshly built VPN tunnel never establish at all?

A new tunnel that never comes up has failed in the IKE negotiation, and which phase it failed in points at the cause. A first-phase failure means the endpoints cannot agree on who they are or how to talk: a wrong shared key, a mismatched IKE proposal because one side requires a cipher the other did not offer, or an unreachable peer because a firewall on the path is blocking the negotiation ports. A second-phase failure means they authenticated but cannot agree on how to protect the data: a mismatched IPsec proposal, or a traffic-selector disagreement on the route-based versus policy-based boundary. The single most common cause in practice is a policy mismatch: Azure negotiates a default suite that interoperates with many devices, but the moment a remote security team pins a specific cipher suite, the Azure connection’s custom IPsec policy has to match it exactly, parameter for parameter. Compare the two ends’ proposals and shared key before touching anything else, because that is where most never-up tunnels are decided.

Q: How do I make a VPN gateway deployment repeatable and auditable?

Express the entire deployment, the gateway, its static public IP, its connections, its local network gateways, its IPsec policy, and its BGP settings, as an infrastructure-as-code module in Bicep or Terraform, reference the shared keys from a secret store rather than embedding them, and version the module so every change is reviewed and reproducible. VPN gateways are long-lived, infrequently changed, and catastrophic when they fail, which is exactly the profile where hand-built portal configuration drifts and nobody remembers the exact settings until an outage forces a rebuild. A versioned module captures the gateway type, the SKU, the cipher policy, the ASNs and peer addresses, and the connection topology in a form that can be diffed and redeployed, so a recreate after a SKU-family change reproduces the working configuration exactly, and it makes the security posture auditable because the cipher suite, the static IP, and the secret reference are all visible in source rather than buried in blades nobody opens until something breaks.