Outbound Connectivity and SNAT in Azure

A web service runs cleanly for weeks. Then a marketing push triples the traffic for an afternoon, and the application starts failing in a way that makes no sense to the people staring at the dashboards. Inbound requests still arrive. The virtual machines report healthy. The database answers. Yet calls the application makes to an external payment API begin timing out, then a downstream weather feed stops responding, then a third-party logging endpoint goes quiet. The network team checks the route table, the security rules, and the public address, and finds nothing wrong. By the time anyone thinks to correlate the failures, the traffic has dropped back to normal and the symptoms have vanished, leaving a mystery that recurs at every load spike.

Outbound Connectivity and SNAT in Azure - Insight Crunch

That mystery has a single name, and it is one of the most misunderstood limits in Azure: outbound source network address translation, or SNAT, and the finite pool of translation ports that backs it. The behavior confuses engineers because it inverts the usual mental model. People reason carefully about inbound capacity, about how many requests per second a load balancer can accept and how many instances stand behind it, and they almost never reason about how many simultaneous connections those instances can open going the other direction. Azure quietly translates every outbound flow from a private address to a shared public one, and that translation draws from a pool of ports that is large but not infinite. When the pool runs dry, new outbound connections fail while everything else keeps working, which is exactly why the failure looks like a phantom.

This article builds the model that makes the phantom legible. By the end you will understand how a packet from a private virtual machine reaches the public internet, why each shared public address can support only a bounded number of concurrent flows, what drives that pool toward zero, and how a NAT gateway changes the arithmetic so completely that the exhaustion problem largely disappears. You will also understand a change in defaults that matters for every new deployment: the implicit outbound access Azure has historically handed to virtual machines is being retired, which means outbound is moving from something you got for free to something you must choose on purpose.

The InsightCrunch outbound model

Before the mechanics, here is the map the rest of the article fills in. We call it the InsightCrunch outbound model, and it has three layers: how ports are allocated, what drives a given path toward exhaustion, and which fixes scale. The point of naming it is that once you can place a symptom on this map, the diagnosis and the remedy follow without guesswork.

Layer	What it describes	The decisive fact
Allocation	How a public address’s translation ports are handed to instances	A single public IP offers roughly 64,000 usable ports, and how they are divided depends entirely on the outbound method
Exhaustion drivers	What pushes the usable pool toward zero	Many short-lived flows to the same destination, a single shared address serving many instances, and the absence of connection reuse
Fixes ranked by scale	The remedies, ordered by how far they move the ceiling	Connection pooling reduces demand, load balancer outbound rules let you size supply manually, and a NAT gateway raises supply by orders of magnitude through on-demand allocation

The namable claim this model produces is the NAT-gateway-for-scale rule: because translation ports are finite per public address, any workload that opens many outbound connections should route egress through a NAT gateway, which provides far more usable ports than an implicit address or a load balancer outbound path and hands them out on demand rather than carving them up in advance. The rest of this piece is the reasoning behind that rule and the configuration that realizes it.

How a packet actually leaves Azure

Start with the situation that makes translation necessary at all. A virtual machine sits in a subnet with a private address, something in the RFC 1918 ranges like 10.0.1.4. That address is meaningful only inside the virtual network. It cannot appear as the source of a packet on the public internet, because the public internet has no route back to it and because the address space is not globally unique. So when the application on that machine opens a connection to an external endpoint, something has to stand in for the private address with a public one that return traffic can find.

In Azure, that something is SNAT. The platform rewrites the source of the outbound packet, replacing the private address and the port the application chose with a public address and a port drawn from a managed pool. It records the mapping in a translation table. When the reply comes back addressed to that public address and port, the platform looks up the table, reverses the rewrite, and delivers the packet to the original private address and port. The application never sees any of this. From inside the virtual machine, it looks like the machine talked directly to the internet, when in fact every byte passed through a translation layer that swapped addresses on the way out and swapped them back on the way in.

The unit that the translation table keys on is the flow, identified by a five-part tuple: the protocol, the source address, the source port, the destination address, and the destination port. This tuple matters more than any other single idea in the article, because it is what determines how many flows a public address can support. As long as two flows differ in at least one element of the tuple, they can share the same public address and the same translation port without colliding, because the platform can tell them apart by the parts that differ. Two flows from the same instance to two different destination addresses can reuse one SNAT port, because the destination differs. Two flows to the same destination address and port must use different SNAT ports, because everything else about them is identical and the port is the only thing left to distinguish them.

That last sentence is the whole game. The reason SNAT port pressure exists is that flows headed to the same destination compete for distinct ports, and a workload that hammers one external endpoint with many simultaneous connections forces the platform to spend a fresh port on each one. The model that says “a public IP has 64,000 ports, so I can have 64,000 connections” is wrong in both directions. It undercounts when destinations vary, because port reuse across destinations stretches the pool far past 64,000 distinct flows. It overcounts when one destination dominates, because every flow to that single endpoint burns its own port and the effective ceiling collapses toward the raw port count.

For the wider context of how addressing, subnets, and routing fit together before translation enters the picture, the Azure networking fundamentals walkthrough lays the groundwork that this article builds on.

Why the port pool is finite

A TCP or UDP port number is sixteen bits, which gives 65,536 possible values, numbered zero through 65,535. A handful are reserved, so a single public address offers on the order of 64,000 ports available for translation. That figure is the hard physical ceiling behind every SNAT conversation, and no configuration can raise it for a single address, because it is a property of the protocol, not of Azure. The only ways to get more translation capacity are to add more public addresses, each of which brings its own pool, or to use those ports more efficiently through reuse and pooling.

The finiteness is easy to dismiss as a large-numbers problem. Sixty-four thousand sounds like plenty until you do the division. Suppose forty instances share one public address through a load balancer, and the platform divides the pool evenly so each instance gets a fixed slice. Forty into sixty-four thousand is sixteen hundred ports per instance. Now suppose each instance runs an application that opens connections to a single external API and tears them down after each call without reuse. Under a burst, an instance can easily have more than sixteen hundred connections to that one endpoint in flight or in the TCP teardown state that holds a port briefly after the connection closes. The moment demand crosses the slice, the next connection attempt has no port to claim, and it fails. The instance next to it may have ports to spare, but the static division does not let one instance borrow from another’s slice. That asymmetry, where the pool as a whole is far from empty but one instance’s slice is exhausted, is one of the most common and most confusing forms of the problem.

The teardown state deserves a moment because it surprises people. When a TCP connection closes, the side that closed it actively holds the local port in a waiting state for a period, traditionally to absorb any stray packets from the old connection before the port is reused. During that window the SNAT port stays allocated even though no data flows. A workload that opens and closes connections rapidly therefore consumes ports faster than the count of currently active connections suggests, because each recently closed connection still pins a port for the duration of that wait. This is why a service that makes many brief calls is far harder on the pool than a service that holds a few long-lived connections, even when the two move the same total bytes.

The three ways out: implicit access, outbound rules, and a NAT gateway

Azure offers more than one path for outbound traffic, and the path a workload uses determines how its SNAT ports are allocated and therefore how soon it exhausts them. There are three to understand, and they sit on a clear ladder from least to most scalable.

The first is the implicit path, historically called default outbound access. For years, a virtual machine deployed without any explicit outbound configuration could still reach the internet, because Azure silently assigned it a public address for outbound use. No one configured it; it simply worked, which made it both convenient and dangerous. Convenient because a quick test machine could reach the internet with zero setup. Dangerous because the address was platform-managed and not under your control, the SNAT behavior was opaque, and a production workload could come to depend on something that was never meant to carry production load. This implicit access is the path most prone to silent exhaustion, because nothing about it advertises its limits and nothing lets you tune them.

The crucial change is that default outbound access is being retired. Microsoft has set a retirement date of September 30, 2025, after which new virtual machines created without an explicit outbound method will not receive implicit internet access. The replacement is to choose a method deliberately: associate a public address, place the workload behind a load balancer with outbound rules, or, the recommended option, attach a NAT gateway to the subnet. This is not a minor deprecation. It reframes outbound connectivity from a default you inherit into a design decision you own, and it means that a template or script that relied on the implicit path will produce machines with no internet egress once the retirement takes effect. Any team with infrastructure as code that omits an outbound method needs to add one before that date.

The second path is a load balancer with outbound rules. When instances sit behind a public load balancer, you can define outbound rules that bind a set of public addresses to the backend pool and let you set the number of SNAT ports preallocated per instance. This gives you control the implicit path never offered: you choose how many addresses back the pool, and you choose how the ports divide. The trade-off is that the division is still static. You decide in advance how many ports each instance gets, that allocation is reserved whether or not the instance uses it, and the manual nature of the sizing means you must calculate your peak concurrent flow needs and provision for them. Get the math wrong and you either waste address space or hit the same per-instance ceiling under load. The relationship between the load balancer and the application gateway, and which layer each operates at, is worth understanding when you choose where outbound lives; the load balancer versus application gateway comparison draws that line.

The third path, and the one the NAT-gateway-for-scale rule points to, is a NAT gateway attached to the subnet. A NAT gateway is a managed resource that takes over outbound translation for every instance in the subnets it serves. It changes two things at once, and both matter. First, it allocates ports on demand rather than dividing them up front, so any instance can draw from the full pool of all attached addresses as it needs, instead of being boxed into a fixed slice. Second, it lets you attach multiple public addresses, each contributing roughly 64,000 ports, so the supply scales linearly with the addresses you attach. A single address behind a NAT gateway already behaves better than the same address divided statically, because of on-demand allocation; attach several addresses and the ceiling rises to a level most workloads will never approach.

The allocation arithmetic, path by path

The difference between these paths is not philosophical. It is arithmetic, and seeing the numbers side by side is what makes the recommendation concrete.

Under a load balancer with default preallocation, the platform assigns ports per instance based on the size of the backend pool, on a sliding scale: small pools get many ports each, large pools get fewer, because the same address pool must stretch across more instances. A pool of a few instances might see a thousand or more ports apiece, while a pool of many dozens drops to a few hundred each from a single address. The exact figures follow a published table, but the shape is what matters: the more instances share a fixed set of addresses, the thinner each instance’s slice, and the sooner a busy instance exhausts it. To carry a large, chatty pool you must add public addresses and raise the per-instance allocation manually, which is precisely the manual sizing burden that makes this path tedious at scale.

Under a NAT gateway, the arithmetic inverts. A single attached address contributes its full pool of roughly 64,000 ports, and the gateway hands those ports to whichever instance needs one at the moment it needs it. There is no per-instance reservation to exhaust, so a single instance under a burst can draw far more ports than any static slice would have given it, as long as the subnet’s aggregate demand stays under the total. Attach a second address and the pool roughly doubles; attach the maximum and a subnet commands hundreds of thousands of ports. More important than the raw count is the on-demand behavior, because it means the pathological case of one instance starving while its neighbors sit idle simply cannot happen. The gateway also reuses ports across destinations more aggressively, stretching each address further when the workload talks to many endpoints.

This is the heart of why the NAT gateway is the recommended outbound path and why the rule names it specifically. The load balancer’s static division forces you to predict peak per-instance demand and provision for it, and it punishes uneven load by letting one hot instance fail while capacity sits unused elsewhere. The NAT gateway’s on-demand allocation pools all the capacity and lets it flow to wherever the pressure is, which both raises the effective ceiling and removes the failure mode that static division creates.

The six patterns that drain the pool

Real exhaustion events fall into a small set of recurring shapes. Each is a different way the model bends toward zero, and naming them turns a vague “we ran out of ports” into a specific cause you can confirm and remove.

The first pattern is a chatty service with no connection reuse. An application opens a fresh connection for every outbound call, makes the call, and closes the connection, repeating this thousands of times a minute against the same endpoint. Because the calls all target one destination, every connection demands its own translation port, and because the connections close rapidly, each one pins a port through the teardown wait after it finishes. The result is a port consumption rate far higher than the count of genuinely active connections, and under load it crosses the ceiling quickly. This pattern is the single most common root cause, and it is also the most fixable, because it lives in the application code rather than the network.

The second pattern is a single shared public address serving a workload at scale. The implicit path and a thinly provisioned load balancer both funnel many instances through one or a few addresses, and the more instances share an address, the smaller each one’s effective allotment. The workload may be perfectly well behaved per instance, but the aggregate demand across all instances against a small address pool drives the shared pool down. The symptom looks like a platform limit, but the cause is undersupply: too few addresses backing too much concurrency.

The third pattern is many instances behind a load balancer with static preallocation, where the division leaves each instance too few ports for its peak. This is the asymmetry described earlier, where the pool as a whole is far from empty but an individual instance’s slice is exhausted because load landed unevenly. Autoscale makes it worse, because adding instances to a fixed address pool shrinks every instance’s slice at exactly the moment you scaled out to handle more load. A NAT gateway resolves this pattern directly, because its on-demand allocation does not slice the pool per instance at all.

The fourth pattern is load balancer outbound rules sized for average rather than peak. A team that does adopt outbound rules, intending to take control, can still under-provision them by calculating ports for typical traffic and forgetting the bursts. Outbound rules give you the knob, but the knob is only as good as the peak estimate behind it. When the estimate is low, the rules simply move the same ceiling rather than raising it.

The fifth pattern is reliance on the retiring default outbound access. A workload that depends on implicit egress is exposed twice: once to the silent exhaustion that the opaque implicit path is prone to, and again to the retirement itself, after which new instances on that path will have no internet access at all. Any workload still on the implicit path is on borrowed time and on the most fragile of the three paths while it lasts.

The sixth pattern is the absence of pooling at the protocol level even when the application intends to reuse connections. A connection pool that is configured too small forces the application to open and close connections under load despite the intent to reuse them, recreating the chatty-service pattern through misconfiguration rather than design. The fix here is not more ports but a correctly sized pool that keeps a working set of connections alive and hands them back to the application instead of tearing them down.

These patterns are why the same symptom shows up across very different workloads. A 503 from an App Service under load and a timeout from a virtual machine reaching a payment API can both trace to the same drained pool; the App Service 503 troubleshooting guide walks the symptom from the application side, while this article explains the mechanism underneath it.

Diagnosing exhaustion before it becomes a mystery

The reason SNAT exhaustion reads as a phantom is that the failure is specific and the dashboards people watch are general. Inbound metrics look fine because inbound is fine. Instance health looks fine because the instances are healthy. The only signal that points at the real cause is the outbound translation state, and it lives in metrics most teams never open until they go looking for it.

The metric to watch is the count of allocated and used SNAT ports, exposed on the load balancer and on the NAT gateway. A healthy workload shows used ports well below allocated ports with comfortable headroom; a workload approaching exhaustion shows used ports climbing toward the allocated ceiling, and the climb usually tracks the load curve. The companion metric is the count of failed connections or dropped SNAT attempts, which is the smoking gun: a nonzero and rising failure count under load, paired with used ports near the ceiling, is the signature of exhaustion and nothing else. When you see those two together, you have confirmed the cause and you can stop chasing route tables and security rules.

The behavioral symptom that precedes the metric, and the one users feel, is a specific sequence. Outbound calls first slow down, because the application is spending time waiting for a port to free up, then they begin to time out as the wait exceeds the connection timeout, and the failures cluster around the busiest external destinations because those are the flows competing hardest for ports. The fact that internal traffic and inbound traffic stay healthy while only outbound to the public internet degrades is itself diagnostic, because it rules out the whole class of inbound and intra-network causes that teams reach for first.

When you need to confirm whether a given flow can leave the subnet at all, the diagnostic tooling that tests an outbound path end to end belongs to the broader networking toolkit rather than to SNAT specifically, but the distinction it draws is useful: a flow that is blocked by a security rule or a missing route fails differently from a flow that has nowhere to put its translation, and separating those two is the first step in any outbound investigation. VaultBook is the place to make this concrete: you can stand up a subnet, drive deliberate outbound load against a single endpoint until the metrics show the pool draining, then attach a NAT gateway and watch the same load run clean, which turns the abstract arithmetic into something you have seen happen. The hands-on Azure labs and command library on VaultBook carry the SNAT and NAT gateway exercises that reproduce this end to end.

A worked diagnosis: tracing one exhaustion event end to end

Numbers make the model concrete, so walk through one event as it actually unfolds. A reporting service runs on one hundred instances behind a single public load balancer, configured with one frontend public address and the platform’s default port preallocation. The service enriches each incoming report by calling a third-party scoring API, and the code that makes the call creates a fresh client, sends the request, reads the response, and disposes the client, once per report. On a normal afternoon the service handles a few hundred reports a second across the fleet, the per-instance call rate sits comfortably under the allotment, and nothing looks remarkable.

Begin with the supply side. With one hundred instances behind one address, the default preallocation lands each instance in the tier that grants 512 ports apiece, because the same roughly 64,000 ports must stretch across all hundred. So each instance can hold at most 512 concurrent translated flows to that scoring API at any instant, and that number is fixed regardless of how busy the rest of the fleet is.

Now the demand side under a spike. A scheduled batch arrives and the per-instance report rate jumps to roughly 300 a second on the busiest instances. Each report opens one connection to the scoring API, and because the code disposes the client after the call, each connection closes immediately and then sits in its teardown wait, holding its port for that window before the platform can reuse it. With the teardown window measured in tens of seconds and 300 new connections a second arriving, the count of ports pinned by recently closed connections climbs fast: a few seconds of teardown backlog alone can pin several thousand ports’ worth of demand against a slice that holds only 512. The instance crosses its allotment within seconds of the spike, and the 513th simultaneous flow has no port to claim.

What the team sees is the phantom. Inbound report ingestion looks healthy because inbound has nothing to do with the drained pool. The instances report healthy because they are. The internal database calls succeed because they never leave the private space and consume no translation ports. Only the calls to the scoring API degrade, first slowing as the application waits for a port to free, then timing out as the wait outlasts the connection timeout, and the failures concentrate on the busiest instances because those are the ones whose slices drained first. A team watching request latency and instance health sees green everywhere it looks, which is exactly why the cause hides.

The confirmation comes from one place: the outbound port metrics on the load balancer. Pull up used SNAT ports per instance against allocated, and the picture resolves at once. The healthy instances sit at a few dozen used ports against their 512 allotted; the failing instances are pinned at 512 used with the failed-connection counter climbing in lockstep with the report backlog. That pairing, used ports at the ceiling and failures rising together under load, is the signature, and seeing it ends the investigation. No route table, no security rule, and no instance health metric can produce that exact pattern, so its presence names the cause unambiguously.

The remedy follows the model’s ranking. The fastest win is on the demand side: change the code to reuse a single long-lived client so that the connections to the scoring API persist and carry many calls each instead of one. With reuse in place, a hundred reports a second can ride a handful of held connections rather than opening a hundred fresh flows, and the per-instance port draw collapses from hundreds to a few. The structural win is on the supply side: attach a NAT gateway to the subnet so the instances no longer each carry a 512-port slice but instead draw from a shared, on-demand pool that no single instance can monopolize and that scales with each address attached. Applied together, the used-port metric that had been pinned at the ceiling drops to a thin baseline and stays there through the next spike, and the failures do not return. That before-and-after on a single metric is the cleanest evidence that the diagnosis was right.

The load balancer port-allocation table

The 512-port slice in the worked example is not arbitrary; it comes from a published sliding scale that the platform applies when a load balancer uses default outbound preallocation. The scale exists because a fixed supply of ports per frontend address must be divided across a backend pool whose size you choose, so the larger the pool, the thinner each instance’s share. Knowing the tiers turns capacity planning from guesswork into arithmetic, which is why the table is worth keeping as a reference.

Backend pool size (instances)	Preallocated SNAT ports per instance, per frontend IP
1 to 50	1,024
51 to 100	512
101 to 200	256
201 to 400	128
401 to 800	64
801 to 1,000	32

Read the table as a warning about scale. A modest pool of a few dozen instances enjoys a comfortable 1,024 ports each, and many workloads at that size never feel pressure. Cross into the hundreds of instances and the per-instance share falls to 256, then 128, then lower, at exactly the point where the total workload is largest and the outbound demand is heaviest. The scale punishes growth: scaling out to handle more inbound load simultaneously shrinks every instance’s outbound headroom, because the same address pool now divides across more instances. This is the structural reason a workload can pass every load test at small scale and then fail outbound the first time autoscale pushes it into a lower tier.

The lever the table implies is the frontend address count. Because the allotment is per frontend address, attaching a second public address to the load balancer’s outbound configuration roughly doubles the ports available to divide, and outbound rules let you set the per-instance allocation explicitly rather than accepting the default tier. A pool of two hundred instances that would default to 256 ports each can be lifted by adding addresses and raising the manual allocation, at the cost of consuming more address space and doing the sizing math yourself. That manual burden, and the static reservation that wastes ports an instance never uses, is precisely what the NAT gateway’s on-demand model removes, which is why the table is best read as the case for moving past it rather than as a target to optimize within.

The fixes, in the order you should reach for them

The fixes ranked by scalability in the model map to three concrete actions, and the order matters because the cheapest fix is often the most effective.

Reach first for connection reuse, because it reduces demand rather than adding supply and because the chatty-service pattern is the most common cause. In practice this means using a single long-lived client object instead of creating and disposing one per call, so that the underlying connections stay open and are reused across calls. In .NET that means a single shared HttpClient or an HttpClient created through a factory that pools the underlying handlers, never a new HttpClient per request inside a using block, which is the classic mistake that recreates the chatty pattern. For database access it means leaving connection pooling enabled and sized correctly so the pool holds a working set of connections rather than opening one per query. The same principle applies to any protocol: keep connections alive and hand them back to a pool. A workload that reuses connections can run an order of magnitude more requests through the same number of ports, because each port now carries a stream of calls instead of a single call.

Reach second for explicit supply through load balancer outbound rules when the workload sits behind a public load balancer and connection reuse alone is not enough. Outbound rules let you attach more public addresses and raise the per-instance port allocation, and they are the right tool when you want manual control and your instance count and peak demand are predictable. Size them for peak, not average, and remember that adding instances behind a fixed set of addresses shrinks each instance’s slice unless you add addresses to match.

Reach third, and for any workload at real scale reach first, for a NAT gateway. Attaching a NAT gateway to the subnet is the single most effective structural fix, because it replaces static per-instance division with on-demand allocation from a pool that grows with every address you attach. It removes the uneven-load failure mode entirely, it scales by simply adding addresses, and it makes the per-instance math irrelevant because there is no per-instance reservation. For a subnet whose workload opens many outbound connections, the NAT gateway is not one option among several; it is the default the platform now steers you toward, and the retirement of implicit access makes choosing it explicitly the path of least resistance.

These three fixes compose. The strongest production posture is a NAT gateway for supply, connection pooling in the application for efficient demand, and outbound metrics on a dashboard so the pool’s state is visible before it drains rather than after.

How outbound fits the rest of the network

SNAT does not operate in isolation, and several other parts of the network interact with it in ways that change the picture.

Routing comes first. A subnet’s effective routes decide where outbound traffic goes before any translation happens, and a user-defined route can send egress to a firewall or a network virtual appliance instead of straight to the internet. When that happens, the translation may occur at the appliance rather than at a NAT gateway, and the appliance’s own SNAT capacity becomes the limit. Forced tunneling, where a route sends internet-bound traffic back to on-premises through a gateway, removes Azure SNAT from the path entirely and hands the translation problem to the on-premises edge. The point is that the outbound method only governs translation for traffic that actually reaches it, and routing decides what reaches it. The full egress model, including how routes and the gateway interact, is part of the virtual network deep dive, which frames where SNAT sits in the subnet’s overall traffic flow.

Security rules interact next. A network security group can block an outbound flow before translation is ever attempted, which produces a different failure than exhaustion and is why the two must be told apart during diagnosis. A flow denied by a rule fails immediately and deterministically; a flow that cannot find a translation port fails under load and intermittently. Confusing the two sends investigations down the wrong path, which is part of why exhaustion stays a mystery for so long.

Private connectivity changes the demand side. When a workload reaches an Azure service through a private endpoint rather than over the public internet, that traffic does not consume SNAT ports at all, because it never leaves the private address space and never needs translation. Moving heavy internal traffic onto private endpoints is therefore a way to reduce SNAT pressure indirectly, by removing flows from the pool entirely rather than enlarging the pool. A workload that talks mostly to other Azure services can cut its outbound port demand sharply this way, leaving the public outbound path to carry only genuinely external traffic.

Finally, the precedence among outbound methods matters when more than one is present. A NAT gateway, when attached to a subnet, takes precedence over the load balancer outbound path and over any implicit access for that subnet’s traffic, so attaching one cleanly supersedes the weaker methods rather than layering on top of them. This is convenient, because it means migrating a subnet to a NAT gateway is mostly a matter of attaching the gateway and letting it take over, without first unwinding the previous method.

Designing outbound on purpose for production

Pulling the model together into a design stance gives a short set of decisions that hold across most workloads.

Treat an explicit outbound method as mandatory, not optional. With implicit access retiring, every subnet that needs internet egress must name its method, and the safe default to name is a NAT gateway. Building it into your infrastructure as code now, rather than discovering its absence when new instances come up with no egress after the retirement date, is the difference between a planned change and an outage.

Default to a NAT gateway for any workload that opens many outbound connections, which is most workloads that integrate with external APIs, send to third-party endpoints, or pull from public package and update sources under load. The on-demand allocation and the address-by-address scaling make it the path that does not require you to predict per-instance peaks, and the cost of a NAT gateway is modest against the cost of the intermittent, hard-to-diagnose failures it prevents.

Reserve load balancer outbound rules for cases where the workload already sits behind a public load balancer, the instance count and peak demand are stable and predictable, and you want explicit manual control over the address-to-port mapping. In those cases outbound rules are a legitimate and well-understood choice, provided you size them for peak and revisit the sizing whenever the instance count changes.

Make connection reuse a code-review item rather than a network afterthought. The most common cause of exhaustion lives in the application, in the per-call client creation that recreates the chatty pattern, and no amount of port supply fully compensates for an application that burns a port on every call. Pooling and long-lived clients are the demand-side half of the solution, and they are cheaper than addresses.

Put outbound port metrics on the same dashboard as the metrics the team already watches. Used and allocated ports and the failed-connection count are the early warning that turns a future incident into a capacity ticket. A workload whose used ports are creeping toward the ceiling over weeks of growth is telling you, well in advance, that it needs another address or a pooling fix, and the only cost of hearing it is having opened the metric.

Inside the NAT gateway: why on-demand allocation changes the math

The NAT gateway earns its place in the rule through specific mechanics, not just a bigger number, and the mechanics are worth understanding because they explain why it removes failure modes rather than merely postponing them.

The first mechanic is on-demand port allocation. A load balancer reserves a fixed slice of ports for each instance up front, and that reservation stands whether the instance uses one port or all of them, so the supply is partitioned before any traffic arrives. A NAT gateway holds its ports in a single shared pool and hands one to whatever instance asks at the moment it asks, returning it when the flow ends. The consequence is that capacity follows demand instead of being frozen against a prediction. An instance under a sudden burst can pull thousands of ports while its idle neighbors hold none, and a moment later the distribution can invert, all without any reconfiguration. The uneven-load failure that defines the static model, where one instance starves beside idle capacity, has no mechanism to occur, because nothing is reserved per instance to run out.

The second mechanic is aggressive port reuse across destinations. Because a flow is identified by its full tuple, a NAT gateway can use the same source port for many simultaneous flows as long as each goes to a distinct destination, since the differing destination keeps the tuples unique. This lets a single address carry far more than its raw port count in distinct flows when the workload talks to many endpoints, and it is the reuse that makes the pool stretch so much further in practice than the headline number suggests. The reuse is most effective for workloads with diverse destinations and least effective for the pathological single-destination case, which is exactly the case where adding addresses or pooling connections is the answer.

The third mechanic is scaling by address attachment. A NAT gateway accepts multiple public addresses, and it accepts a public IP prefix, which is a contiguous block of addresses allocated as a unit. Each address contributes its own roughly 64,000 ports to the shared pool, so supply grows linearly and predictably as you attach more. Attaching a prefix rather than individual addresses has a second benefit beyond capacity: the egress addresses are a known, contiguous range that a partner can place on an allowlist once, rather than a scattered set that changes as you add addresses one at a time. For workloads whose external partners restrict inbound by source address, predictable egress through a prefix is as important as the capacity it brings.

The fourth mechanic is the idle timeout, which governs how long an idle flow holds its port before the gateway reclaims it. A longer idle timeout keeps connections ready for reuse, which suits workloads that benefit from warm connections, while a shorter timeout reclaims ports faster, which suits workloads with many brief flows that would otherwise pin ports needlessly. Tuning the idle timeout is a finer lever than adding addresses, but for a workload near its ceiling it can recover meaningful capacity by returning ports to the pool sooner.

Together these mechanics explain why the gateway is the recommended path and not merely a larger version of the load balancer’s outbound function. It does not just hand out more ports; it hands them out in a way that matches supply to demand instant by instant, reuses them across destinations, scales by simple address attachment, and gives you a timeout lever to reclaim them. That combination is what turns exhaustion from a recurring incident into a non-event for the great majority of workloads.

Connection pooling across the common stacks

The demand-side fix deserves stack-specific detail, because the most common cause of exhaustion is application code that defeats connection reuse, and the right pattern differs by platform even though the principle is constant: hold connections open and reuse them rather than opening one per call.

In .NET, the canonical mistake is creating an HttpClient per request inside a using block, which disposes the client and its underlying connection after a single call and recreates the chatty pattern at full force. The connection a disposed client held drops into its teardown wait and pins a port, so a service doing this under load consumes ports as fast as it makes calls. The correct pattern is a single long-lived HttpClient shared across requests, or, better, an HttpClient produced by a factory that pools the underlying socket handlers and manages their lifetime. Pooling the handler introduces one subtlety worth naming: a connection held open indefinitely will not pick up a change in the destination’s address resolution, so the pooled-connection lifetime should be set to a finite value that balances reuse against periodic refresh of name resolution. The goal is a working set of warm connections that are reused heavily yet recycled often enough to track changes, not a single connection held forever and not a fresh one per call.

For database access, connection pooling is on by default in the common data access libraries, and the failure mode is usually a pool sized too small or disabled rather than absent. A pool that is too small forces the application to open and close physical connections under load even though the intent was to reuse them, which recreates the churn the pool was meant to prevent. The fix is to size the maximum pool to the workload’s real concurrency and leave pooling enabled, so that a working set of connections persists and is handed back to callers rather than torn down between queries. The same caution about indefinitely held connections applies: a sensible connection lifetime lets the pool recycle stale connections without churning under steady load.

For message and streaming protocols, the principle holds in the same shape: establish the connection once, keep it alive with the protocol’s keep-alive mechanism, and multiplex calls over it rather than reconnecting per message. Protocols that multiplex many logical streams over one transport connection are especially friendly to the pool, because a large number of concurrent logical calls can ride a single translated flow and therefore a single port. Choosing a multiplexing protocol where the workload allows it is a quiet but real reduction in port demand.

UDP deserves a brief note because it behaves differently from TCP. UDP has no teardown handshake and no equivalent waiting state, so a UDP flow’s port is held according to the idle timeout rather than a fixed teardown window. High-rate UDP to many destinations still consumes ports, and the idle timeout is the main lever for reclaiming them, so the same diversify-destinations and tune-the-timeout advice applies even though the teardown dynamics differ.

The unifying message is that pooling is a code-side capacity decision, not a network setting, and that no amount of port supply fully compensates for an application that burns a port per call. A workload that reuses connections correctly can run an order of magnitude more requests through the same ports, which makes pooling the highest-impact single change in most exhaustion cases and the first one to reach for.

Sizing load balancer outbound rules when you keep that path

For the workloads that legitimately stay on the load balancer path, sizing outbound rules well is the difference between a path that holds under peak and one that simply relocates the ceiling. The sizing is arithmetic, and the inputs are the number of frontend addresses, the number of backend instances, and the per-instance port allocation you choose.

The governing relationship is that the total ports available are the frontend address count multiplied by roughly 64,000, and that total divided by the per-instance allocation sets the ceiling on how many instances the configuration can support. Put the other way, the per-instance allocation you can afford is the total ports divided by the instance count. If you want a generous allocation for a large pool, you must attach proportionally more addresses, because the per-instance figure and the instance count multiply to a total that cannot exceed the addresses’ combined pool. Outbound rules expose this allocation as an explicit setting rather than leaving it to the default tier, which is the whole point of using them: you take the sliding-scale default off the table and choose the number deliberately.

Three sizing disciplines keep the path healthy. The first is to size for peak concurrent flows, not average, because exhaustion is driven by concurrency at the spike and an allocation that suffices on average will fail at the peak that average conceals. The second is to revisit the allocation whenever the instance count changes, because autoscale that adds instances without adding addresses silently shrinks the per-instance share, recreating the shortfall at the worst moment. The third is to account for the teardown window in the peak estimate, since ports pinned by recently closed connections count against the allocation just as active ones do, and a chatty workload’s effective peak demand is well above its count of active flows.

Outbound rules also expose an idle timeout and a setting that controls whether the platform sends a TCP reset when an idle flow is reclaimed. The reset behavior matters for client behavior on the far side: a reset tells the peer the connection is gone rather than leaving it to discover the silence on its own, which can produce cleaner failure handling. These are finer levers than the port allocation itself, but for a tuned path they round out the configuration. The honest summary is that outbound rules are a capable and well-understood path for stable, predictable pools where you want manual control, and that for pools whose size or load varies the NAT gateway’s on-demand model spares you the recurring sizing work the rules demand.

A second scenario: autoscale, the shrinking slice, and a clean migration

The first scenario showed a chatty workload draining a fixed slice; the second shows how autoscale turns a healthy workload into a failing one without anyone changing the application, and how migrating to a NAT gateway resolves it cleanly.

A media service runs forty instances behind a load balancer with one frontend address, comfortably in the tier that grants 1,024 ports each, and it has run that way through every load test. The application is reasonably well behaved, reusing connections for most of its traffic, and at forty instances each one has ample headroom. Then a viral event drives inbound traffic up by several times, autoscale responds correctly by adding instances, and the pool grows past one hundred and then past two hundred instances. At two hundred instances the per-instance allotment has fallen from 1,024 to 256, a fourfold reduction, at precisely the moment each instance is also handling more outbound calls to the content origin and the recommendation API. The supply per instance collapsed while the demand per instance rose, and the two crossing produces outbound failures that arrive only at high scale and therefore only during the event the business most cares about.

The diagnosis follows the now-familiar signature. Inbound scaled fine, the new instances are healthy, internal calls succeed, and only outbound to the external origins fails, intermittently and in proportion to load. The port metrics confirm it: used ports pinned near the shrunken 256-port allotment on the busiest instances, failed connections rising with the load curve. The cause is not the application, which did nothing different, and not a bug introduced by the scale event; it is the structural interaction between autoscale and a fixed address pool, where adding instances divides the same ports more thinly.

The migration is straightforward because the NAT gateway supersedes the load balancer’s outbound path for the subnet. Attach a NAT gateway to the subnet, give it one or more public addresses or a prefix, and it takes over translation for every instance without unwinding the existing inbound load balancing, which keeps doing its job for inbound traffic. The per-instance slice disappears, replaced by on-demand draw from the shared pool, so the two hundred instances now share capacity that flows to wherever the pressure is rather than being frozen into 256-port boxes. The used-port metric, which had been pinned at the per-instance ceiling on the hot instances, drops to a healthy aggregate well under the pool total, and the failures stop. The lesson the scenario teaches is that any workload expected to autoscale should treat the NAT gateway as the default outbound method from the start, because the static path’s per-instance division is fundamentally at odds with a fleet whose size changes, and discovering that during a traffic event is the worst possible time to learn it.

Cost, predictable egress, and when a NAT gateway is overkill

A complete picture has to address the trade-offs honestly, because the rule recommends the NAT gateway by default and a default is only sound if its costs are understood.

A NAT gateway carries an hourly charge for the resource plus a charge for the data it processes, alongside the cost of the public addresses attached to it. For a workload at real scale, these costs are modest against the alternative, which is the engineering time spent diagnosing intermittent exhaustion, the manual sizing work the load balancer path demands, and the business cost of outbound failures during traffic peaks. The data-processing charge is the component to model for a high-throughput egress workload, because it scales with traffic, and it is worth comparing against the data costs of routing egress through a firewall or appliance instead, since those paths carry their own processing charges. The comparison usually favors the NAT gateway for pure outbound scaling, with the firewall justified separately by the inspection and filtering it adds rather than by translation alone.

Predictable egress is a benefit that does not show up on a capacity chart but matters operationally. Because a NAT gateway uses the addresses or prefix you attach, the source addresses your traffic presents to the outside world are known and stable, which lets external partners allowlist them. A workload on implicit access or on a changing set of load balancer addresses presents a less predictable source, which complicates partner allowlisting and can break integrations when the addresses shift. Attaching a prefix gives a contiguous, stable egress range that a partner configures once, turning egress address management from a recurring coordination problem into a one-time setup.

There are cases where a NAT gateway is more than a workload needs. A subnet whose instances make little or no outbound internet traffic, or one whose external dependencies have all moved to private endpoints so that traffic never needs translation, gains little from a gateway and can reasonably do without one once an explicit method satisfies the retirement requirement. A small, stable workload behind a load balancer that comfortably fits the default port tier and shows no pressure in its metrics is also a fair candidate to leave on outbound rules, provided someone watches the port metric as the workload grows. The rule’s force is proportional to outbound concurrency: the more connections a workload opens, and the more its scale varies, the more decisively the NAT gateway is the right default, and the lighter and more static the workload, the more room there is for a simpler choice. Naming that boundary keeps the recommendation honest rather than absolute.

A migration playbook for the default-outbound retirement

The retirement of implicit outbound access on September 30, 2025 is the change most likely to surprise a team that has not been watching for it, because it does not break anything immediately and then breaks new deployments quietly after the date. A playbook turns that surprise into a routine change, and the work divides into finding the exposure, choosing the method, and validating the result.

Finding the exposure starts with an inventory of every subnet whose instances reach the internet without a named outbound method. These are the subnets relying on the implicit path, and they are easy to overlook precisely because they appear to work with no configuration. The tell is the absence of a NAT gateway association, the absence of an outbound rule binding the instances to a public address, and the absence of a public address on the instances themselves, combined with the presence of outbound internet traffic in the flow records. Any subnet matching that description is on the implicit path and on the clock. Infrastructure as code makes this inventory tractable, because the templates and modules either declare an outbound method or they do not, and the ones that do not are the ones to fix.

Choosing the method is the same decision the rest of this article has framed, applied at migration time. For any subnet with meaningful outbound concurrency, attach a NAT gateway, because it satisfies the retirement requirement and removes the exhaustion failure mode in one step. For a subnet whose instances already sit behind a public load balancer and whose load is small and stable, outbound rules are an acceptable choice that also satisfies the requirement. For a subnet whose external dependencies have moved or can move to private endpoints, reducing the outbound need to little or nothing, an explicit method may be a formality that satisfies the requirement without carrying real traffic. The point is that every internet-reaching subnet must name a method, and the method named should match the subnet’s actual outbound profile rather than being chosen by reflex.

Validating the result is the step teams skip and regret. After attaching the method, confirm that outbound traffic still flows by exercising the workload’s real external dependencies, not just by pinging a public address, because a workload can have egress to one destination and not another depending on routing and security rules. Confirm that the port metrics now report against the new method, since a NAT gateway exposes its own port and connection metrics that should replace the load balancer’s in the team’s dashboards. And confirm in a non-production environment first that the infrastructure as code produces a working method, because the failure this migration prevents is precisely a template that silently stops granting egress, and catching that in a test subscription is far cheaper than catching it when a production deployment comes up with no internet access after the retirement date.

The framing that keeps this from feeling like a chore is that the migration is an upgrade, not a tax. A subnet moved from implicit access to a NAT gateway does not merely keep the egress it had; it gains the on-demand allocation, the scalable pool, the predictable egress addresses, and the metrics that the implicit path never offered. The retirement is best read as the platform retiring a fragile default in favor of a deliberate, better one, and the migration as the moment a workload stops depending on something opaque and starts owning something it can reason about.

Common misdiagnoses and the verification that ends them

Because exhaustion presents as a phantom, it attracts a predictable set of wrong turns, and naming them shortens the next investigation by telling the team where not to spend its time.

The first misdiagnosis is treating the failure as a downstream outage. When calls to an external API begin timing out, the natural first guess is that the API is down or throttling, and a team can spend an afternoon contacting the provider before noticing that calls from a quiet test machine to the same API succeed perfectly. The verification that ends this is the contrast: if the dependency answers fine from a low-traffic source but fails from the busy production fleet, the dependency is not the problem and the local outbound path is. That contrast points straight at the pool.

The second misdiagnosis is blaming the application’s timeout or retry configuration. Seeing timeouts, a team may lengthen timeouts or add retries, which makes the symptom worse, because longer waits and more retries mean more connections held open and more ports consumed, accelerating the very exhaustion that caused the timeouts. The verification is to look at the port metric before touching timeouts: if used ports sit near the ceiling, the timeouts are a symptom and adding retries pours fuel on the fire rather than putting it out.

The third misdiagnosis is scaling out to fix outbound failures. When a workload struggles under load, adding instances is the reflex, and for inbound-bound problems it is correct. For SNAT exhaustion on a static path it backfires, because adding instances to a fixed address pool shrinks every instance’s port slice, so the workload can fail harder after scaling out than before. The verification is the allocation table: if scaling out moved the pool into a lower per-instance tier, the scale-out reduced outbound headroom, and the fix is more addresses or a NAT gateway, not more instances.

The fourth misdiagnosis is chasing security rules and routes. Because outbound is failing, a team may audit the security rules and the route table exhaustively, finding nothing because nothing is wrong there. The verification that separates this from a real rule or route problem is the failure pattern: a rule or route problem fails deterministically and from the first attempt, identically every time, while exhaustion fails intermittently and under load, tracking the traffic curve. Deterministic and always means rule or route; intermittent and load-correlated means the pool.

The fifth misdiagnosis is assuming a single quiet metric absolves the outbound path. A team that checks inbound latency, instance CPU, and memory, finds them healthy, and concludes the platform is fine has simply not looked at the one metric that would show the problem. The verification is to make the outbound port metric a standard part of the health check rather than an exotic one consulted only after every other avenue is exhausted, so that the signal is present from the start of the investigation rather than discovered at its end.

The thread through all five is that the outbound port and failed-connection metrics are the ground truth, and that most wasted investigation time comes from not consulting them early. A team that learns to open those metrics first, the moment outbound to the public internet degrades while everything else stays healthy, converts a recurring multi-hour mystery into a five-minute confirmation.

Building monitoring that catches the pool before it drains

The strongest posture treats outbound capacity as a metric to watch over weeks rather than an incident to diagnose in the moment, and that requires putting the right signals in front of the team before anything fails.

The leading indicator is used SNAT ports as a fraction of allocated, tracked over time and per instance where the path allocates per instance. A workload whose used-port fraction creeps upward over weeks of organic growth is announcing, well in advance, that it will exhaust its allocation at some future load, and that creep is a capacity ticket rather than an incident if someone is watching it. An alert set at a fraction comfortably below the ceiling, well before failures begin, converts the eventual exhaustion into a planned addition of an address or a pooling fix made on a calm afternoon instead of during an outage.

The confirming indicator is the failed or dropped connection count, which should sit at zero in healthy operation and whose first nonzero readings are the earliest hard evidence that the pool has begun to refuse connections. An alert on any sustained nonzero value catches the problem at its onset, when only the busiest instances or the briefest peaks are affected, rather than after it has spread across the fleet. Pairing the two alerts, a warning on the rising used-port fraction and a sharper alert on the failed-connection count, gives both the long-range forecast and the immediate confirmation.

The context indicator is the correlation between outbound port pressure and the load curve. An outbound failure that lines up with traffic peaks is exhaustion; one that does not is something else. Putting the port metric on the same dashboard as the request-rate metric makes that correlation visible at a glance, so that the next time outbound degrades, the team can confirm or rule out exhaustion in the time it takes to look at one chart rather than in the hours it takes to audit everything else.

The discipline that makes this durable is to treat the outbound metric as first-class, not specialist. Inbound request rate, latency, and error rate are on every team’s primary dashboard by habit; outbound port usage usually is not, and that omission is why exhaustion stays a mystery so often. Adding the used-port fraction and the failed-connection count to the standard dashboard, and reviewing them as part of routine capacity planning, is a small one-time effort that removes an entire category of recurring, hard-to-diagnose incidents from the team’s future. For teams that want a place to rehearse this monitoring against a workload they can push to the edge on purpose, the VaultBook labs include exercises that drive a subnet to exhaustion and back while watching exactly these metrics, which builds the instinct to read them correctly when it matters.

Forced tunneling, firewall egress, and the limits that move with them

Outbound translation is not always performed by a NAT gateway or a load balancer, because routing can send egress somewhere else first, and when it does the translation and its limits move with the traffic.

Forced tunneling is the configuration where a route sends internet-bound traffic from a subnet back to an on-premises network through a gateway, so that egress leaves through the corporate edge rather than directly from Azure. The reason is usually policy: an organization wants all internet traffic to pass through its established on-premises inspection and logging. The consequence for SNAT is that Azure no longer performs the outbound translation for that traffic at all, because the traffic never reaches an Azure outbound path; instead the on-premises edge translates it, and the on-premises edge’s own address and port capacity becomes the ceiling. A team that forced-tunnels a high-concurrency workload can exhaust the on-premises edge’s translation capacity exactly as it would have exhausted an Azure address pool, with the added difficulty that the limit now lives in a device the cloud team may not control or even see. The lesson is that forced tunneling does not remove the exhaustion problem; it relocates it to the edge and hands it to whoever runs that edge.

Routing egress through an Azure firewall or a network virtual appliance is the middle case. Here the traffic stays in Azure but passes through an inspection device before leaving, and that device performs the outbound translation rather than a NAT gateway behind it. The firewall has its own translation capacity, governed by how many public addresses it carries, and a high-concurrency workload can pressure that capacity just as it would pressure a bare NAT gateway. The standard pattern for scaling a firewall’s outbound capacity is to attach a NAT gateway behind it for the egress translation, so that the firewall does the inspection and the NAT gateway provides the scalable port pool, combining the inspection the firewall adds with the on-demand allocation the gateway provides. This composition is the cleanest way to get both filtering and outbound scale, and it shows that the NAT gateway’s role as the scalable translation layer holds even when a firewall sits in front of it.

The unifying principle is that the outbound method only governs translation for traffic that actually reaches it, and routing decides what reaches it. A route that sends egress to an appliance or back to on-premises supersedes the subnet’s own outbound method for that traffic, so reasoning about exhaustion requires first asking where the traffic actually translates. Get that wrong and you can attach a NAT gateway to a subnet whose traffic never touches it because a user-defined route diverts egress to a firewall, and wonder why the gateway’s port metrics stay flat while the failures continue at the device doing the real work.

Zone resilience and dual-stack considerations

Two further design points round out a production outbound posture: how the outbound resource behaves across availability zones, and how it handles dual-stack addressing.

Availability-zone behavior matters because outbound is a dependency like any other, and a workload spread across zones for resilience wants its outbound path to be resilient too. A NAT gateway is deployed into a single zone or configured for zonal resilience depending on how it is provisioned, and the choice affects what happens to outbound if a zone has trouble. A workload that places one zonal gateway per zone, with each subnet’s traffic egressing through its own zone’s gateway, keeps outbound localized so that a single zone’s failure takes only that zone’s egress with it rather than the whole workload’s. Aligning the outbound design with the zone design, rather than funneling all zones through one zonal resource, is the pattern that keeps a zone-resilient workload resilient on the way out as well as on the way in. The trade-off is more resources to manage, weighed against the blast radius reduction they provide, and for a workload whose availability target justified spreading across zones in the first place, extending that to the outbound path is usually consistent with the goal.

Dual-stack addressing, where a workload carries both IPv4 and IPv6, changes the SNAT picture because the pressure that drives exhaustion is specific to the shared-address translation that IPv4 requires. IPv6’s vast address space means a workload can be given globally routable addresses directly rather than sharing a few public ones through translation, which sidesteps the port-pool ceiling for the IPv6 portion of its traffic entirely. The IPv4 portion still translates and still faces the finite pool, so a dual-stack workload’s exhaustion risk lives in its IPv4 traffic, and the outbound design has to account for the translated leg even as the IPv6 leg escapes the constraint. For workloads whose external dependencies support IPv6, shifting traffic onto the IPv6 path is another way to relieve IPv4 port pressure, in the same family of remedies as moving internal traffic to private endpoints: both reduce demand on the shared IPv4 pool by removing flows from it rather than enlarging it.

These considerations do not change the core rule, but they refine its application. The NAT-gateway-for-scale rule answers how to size and allocate the IPv4 translation that a high-concurrency workload depends on; zone alignment answers how to keep that translation resilient; and dual-stack answers how to move traffic off the constrained path where the dependencies allow. Together they turn the rule from a single decision into a coherent outbound design that holds up under load, survives a zone failure, and uses the address families it has to best effect.

Configuring the outbound method and confirming it works

Translating the design into a running configuration is straightforward, and walking the steps clarifies what each one accomplishes so that the result is something you understand rather than something you copied. The shape of the work is the same whether you express it through the portal, the command line, or a declarative template: create the public addressing that supplies the ports, create the gateway that allocates them, associate the gateway with the subnet whose traffic it should translate, and then verify that egress flows and the metrics report against the new path.

The addressing comes first because it is the supply. You create one or more standard public addresses, or a public IP prefix when you want a contiguous, allowlist-friendly range, and these are what contribute their port pools to the gateway. Choosing a prefix here rather than individual addresses is the decision that pays off later when a partner needs to allowlist your egress, because the prefix is a single stable range rather than a set that grows as you add addresses one by one. The number of addresses you provision is the lever that sets the ceiling, so a workload you expect to push hard gets more addresses from the start rather than waiting to add them under pressure.

The gateway comes next, created as its own resource and given the public addressing you just provisioned along with an idle timeout suited to the workload. A workload with many brief flows benefits from a shorter idle timeout that returns ports to the pool quickly, while a workload that keeps warm connections benefits from a longer one; the default is a reasonable starting point, and the timeout is a knob you turn later if the metrics suggest ports are being held longer than the traffic warrants.

The association is the step that actually changes behavior, because attaching the gateway to a subnet is what makes it take over translation for that subnet’s instances. Until the association exists the gateway is an idle resource translating nothing; once it exists, every instance in that subnet egresses through it, superseding the load balancer outbound path and any implicit access for that traffic. A single gateway can serve multiple subnets, so a common pattern is one gateway per zone serving the subnets in that zone, which aligns the outbound path with the zone design discussed earlier.

A minimal command-line sketch of the sequence makes the shape concrete:

# Create the public addressing that supplies the ports
az network public-ip create \
  --resource-group rg-egress \
  --name pip-natgw-1 \
  --sku Standard \
  --allocation-method Static

# Create the NAT gateway and give it the public address and an idle timeout
az network nat gateway create \
  --resource-group rg-egress \
  --name natgw-prod \
  --public-ip-addresses pip-natgw-1 \
  --idle-timeout 4

# Associate the gateway with the subnet whose egress it should translate
az network vnet subnet update \
  --resource-group rg-egress \
  --vnet-name vnet-prod \
  --name snet-workload \
  --nat-gateway natgw-prod

Verification is the step that closes the loop, and it has three parts. Confirm that the workload’s real external dependencies are reachable, by exercising the actual calls the application makes rather than a generic connectivity check, because routing and security rules can permit one destination and not another. Confirm that the egress source address is what you expect, which is the address or prefix you attached, since a wrong source here is what would break a partner’s allowlist. And confirm that the gateway’s own port and connection metrics are now populating, because those metrics are the early-warning system going forward and they should replace the load balancer’s outbound metrics in the team’s dashboards. A configuration that reaches its dependencies, presents the expected source address, and reports clean port metrics is a configuration you can trust under load, and the few minutes spent confirming all three is the difference between a change you tested and a change you hoped about.

Expressing this same configuration in a declarative template is the durable form, because it makes the outbound method a property of the environment that is recreated identically every time rather than a manual step someone might forget. A template that declares the addressing, the gateway, and the subnet association ensures that every environment built from it has a working, explicit outbound method, which is exactly the property the default-outbound retirement requires and exactly the omission that retirement will punish in templates that lack it.

Why outbound design is a reliability decision

It is tempting to file SNAT under networking trivia, a detail that matters only when something breaks, but the framing that serves a team better is that outbound capacity is a reliability dimension of the workload, on the same footing as the inbound capacity and the dependency health that teams already track.

The case for that framing is the failure mode itself. SNAT exhaustion produces an outage that is partial, intermittent, and load-correlated, which is the hardest kind of outage to diagnose and the kind that erodes trust fastest, because it strikes during the traffic peaks that coincide with the moments the business cares about most and then vanishes before anyone can catch it. A workload that has not designed its outbound path has an unmeasured ceiling sitting between it and its external dependencies, and that ceiling will be discovered, eventually, at the worst possible time, by the workload hitting it under load. Designing the outbound path on purpose, sizing it for peak, and watching its metrics is the same discipline applied to inbound capacity, and it pays off the same way: by turning a future incident into a capacity decision made calmly in advance.

The dependency angle reinforces this. Modern workloads lean heavily on external services, payment processors, identity providers, content origins, telemetry sinks, and the reliability of a workload is increasingly the reliability of its ability to reach those services. An outbound path that exhausts under load is a single point of failure for every external dependency at once, because they all share the same translation pool, which is why an exhaustion event takes out the payment call and the logging call and the weather feed together. Treating outbound capacity as a first-class reliability concern recognizes that the path to every dependency is a shared resource that can be exhausted, and that protecting it protects all of them.

The economic angle closes the case. The cost of a NAT gateway and the addresses it carries is modest and predictable, while the cost of an unmanaged outbound path is the engineering time spent diagnosing phantom failures, the business impact of outages during peaks, and the operational drag of sizing a static path by hand and revisiting it on every scale change. Spending a small, known amount to make outbound capacity abundant and self-balancing buys out a category of incident that is expensive precisely because it is hard to diagnose. Framed that way, the NAT-gateway-for-scale rule is not a networking preference; it is a reliability investment with a favorable return, and the kind of deliberate design choice that separates a workload that holds under load from one that surprises its owners.

The verdict

SNAT port exhaustion looks like a phantom because the failure is narrow and the instinct is to look everywhere else. The model dissolves the mystery: outbound traffic from a private address is translated to a shared public one, that translation draws from a pool of roughly 64,000 ports per address, flows to the same destination each consume a port, and rapidly opened and closed connections pin ports through their teardown wait, so a chatty workload against a busy address can drain the pool while every other signal stays green. The path the workload uses decides how the pool is allocated: the retiring implicit access offers the least control and the most fragility, load balancer outbound rules offer manual static division, and a NAT gateway offers on-demand allocation from a pool that scales with the addresses you attach.

The NAT-gateway-for-scale rule is the durable takeaway. Because ports are finite per address, a workload that opens many outbound connections should route egress through a NAT gateway, which provides far more usable ports than the alternatives and removes the failure mode where one instance starves while capacity sits idle elsewhere. Pair that supply-side fix with connection pooling on the demand side and outbound port metrics for early warning, and the phantom stops recurring. With implicit outbound access retiring on September 30, 2025, choosing the method explicitly is no longer a refinement; it is a requirement, and the method worth choosing by default is the one this rule names.

If you take one action from this article, make every internet-reaching subnet name an explicit outbound method, and make a NAT gateway the method you name for anything that opens connections in volume. That single decision raises the translation ceiling, removes the uneven-load failure where one busy instance starves beside idle capacity, gives you predictable egress addresses a partner can allowlist, and brings the port metrics that turn a future incident into an early capacity ticket. Pair it with connection reuse in the application and a dashboard that watches used ports against the allocation, and outbound stops being the quiet limit that surfaces only at the worst moment and becomes a designed, measured property of the workload like any other. The phantom does not return once you have named what was always there.

Frequently asked questions

What is SNAT and how does outbound connectivity work in Azure?

SNAT, source network address translation, is how a private virtual machine reaches the public internet. The machine has only a private address that the internet cannot route to, so when it opens an outbound connection, Azure rewrites the source to a shared public address and a port from a managed pool, records the mapping, and reverses it on the reply. The application sees a direct connection while every flow actually passes through this translation layer.

What causes SNAT port exhaustion?

Exhaustion happens when a workload consumes all the translation ports available to it. The usual drivers are flows that all target one destination, since each demands its own port, connections opened and closed rapidly, since each pins a port through its teardown wait, and too few public addresses backing too much concurrency. A chatty service with no connection reuse against a single busy endpoint is the most common cause.

How does a NAT gateway fix SNAT exhaustion?

A NAT gateway takes over outbound translation for a subnet and allocates ports on demand rather than dividing them statically per instance. Any instance can draw from the full pool when it needs to, so uneven load no longer starves one instance while others sit idle. You also attach multiple public addresses, each adding roughly 64,000 ports, so supply scales linearly. The combination of on-demand allocation and a large, growable pool removes the exhaustion problem for most workloads.

How do outbound rules on a load balancer affect SNAT?

Outbound rules let you bind public addresses to a backend pool and set how many ports are preallocated per instance, giving you manual control the implicit path never offered. The allocation is static, though: you choose the per-instance count in advance and it is reserved whether used or not. Sized for peak demand with enough addresses, outbound rules work well; sized for average, they simply relocate the same ceiling.

How is default outbound access changing?

Default outbound access, the implicit public address Azure historically gave virtual machines deployed without an explicit outbound method, is being retired on September 30, 2025. After that date, new machines created without a named method will not receive implicit internet egress. The replacement is to choose a method deliberately, with a NAT gateway the recommended option, which moves outbound from an inherited default to an owned design decision.

How does connection pooling reduce SNAT usage?

Connection pooling keeps a working set of connections open and reuses them across many calls instead of opening and closing one per call. Because flows to the same destination can ride a reused connection rather than each claiming a fresh port, a pooled workload runs far more requests through the same number of ports. It is the demand-side fix and usually the most effective single change, since the most common cause of exhaustion is per-call client creation that defeats reuse.

Why do my outbound calls fail only under heavy load?

Because exhaustion is a function of concurrent flows, not of total traffic. At low load the pool has headroom and every connection finds a port immediately. As concurrency rises, used ports climb toward the ceiling, and once they reach it new connections wait and then time out. The failures cluster at peak because that is when concurrent demand crosses the allocation, which is why the symptom appears and disappears with the load curve.

How many SNAT ports does a single public IP provide?

A port number is sixteen bits, giving 65,536 values, of which roughly 64,000 are usable for translation after reserved ranges. That figure is a property of the protocol, so no configuration raises it for a single address. More translation capacity comes only from attaching additional addresses, each contributing its own pool, or from using the existing ports more efficiently through reuse across destinations and connection pooling.

Why does one instance run out of ports while others have plenty?

Under a load balancer’s static preallocation, the port pool is divided into fixed per-instance slices, and one instance cannot borrow from another’s slice. When load lands unevenly, a hot instance can exhaust its slice while neighbors sit idle, so the pool as a whole is far from empty yet that instance’s connections fail. A NAT gateway eliminates this because it allocates on demand from a shared pool rather than reserving a slice per instance.

Does the TCP teardown state really consume ports?

Yes. When a connection closes, the side that initiated the close holds its local port in a waiting state for a period before the port can be reused, to absorb any late packets from the old connection. During that window the SNAT port stays allocated even though no data moves. A workload that opens and closes connections rapidly therefore consumes ports faster than its count of active connections suggests, which is why brief, frequent calls are harder on the pool than long-lived ones.

Do private endpoints affect SNAT port usage?

They reduce it. Traffic to an Azure service over a private endpoint stays within the private address space and never needs translation, so it consumes no SNAT ports. Moving heavy internal traffic onto private endpoints removes those flows from the pool entirely, which lowers outbound port demand without enlarging supply. A workload that talks mostly to other Azure services can cut its public outbound pressure substantially this way.

Which outbound method takes precedence when more than one is configured?

A NAT gateway attached to a subnet takes precedence over the load balancer outbound path and over implicit access for that subnet’s traffic. This makes migration clean: attaching a NAT gateway lets it take over translation without first unwinding the previous method. The weaker methods simply yield to the gateway for the subnets it serves.

How do I tell SNAT exhaustion apart from a blocked outbound flow?

By how the failure behaves. A flow blocked by a security rule or a missing route fails immediately and consistently, the same way every time. A flow that cannot find a translation port fails under load and intermittently, tracking the traffic curve, with used ports near the allocated ceiling and a rising failed-connection count in the metrics. The intermittent, load-correlated pattern paired with high port usage is the signature of exhaustion specifically.

What metrics confirm a SNAT problem?

The allocated and used SNAT port counts on the load balancer or NAT gateway, and the failed or dropped connection count. Healthy traffic shows used ports well under allocated with headroom; approaching exhaustion shows used ports climbing toward the ceiling along the load curve, and a rising failure count alongside that is the confirmation. Seeing those two together rules out the inbound and routing causes teams usually check first.

Should every subnet use a NAT gateway?

Any subnet whose workload opens many outbound connections should, because the on-demand allocation and address-by-address scaling remove the per-instance ceiling and the uneven-load failure mode. A subnet with little or no internet egress, or one whose external traffic moves to private endpoints, may not need one. With implicit access retiring, though, every subnet that needs egress must name some explicit method, and a NAT gateway is the recommended default to name.

Will my existing virtual machines lose internet access after the retirement?

The retirement targets new deployments: machines created without an explicit outbound method after September 30, 2025 will not receive implicit egress. The safe action regardless is to give every subnet an explicit method now, so that infrastructure as code which previously relied on the implicit path produces machines with working egress rather than silent failures once the change takes effect. Adding a NAT gateway to those subnets ahead of the date converts a future surprise into a planned, tested change.