Azure Network Watcher and Diagnostics

A connectivity problem in Azure rarely announces its cause. A request times out, a health probe fails, a database connection hangs, and the symptom looks identical whether the fault sits in a network security group, a route table, a firewall appliance, a private endpoint, or the application itself. The instinct under pressure is to start changing things: widen an NSG rule, restart the virtual machine, add a route, open a port. That instinct is the reason so many outages last longer than they should. Each blind change introduces a new variable, and after three or four of them nobody can say what the original problem was. Azure Network Watcher exists to break that loop. It is the regional diagnostic service that lets you ask one precise question at a time and get one precise answer, so you localize a fault to its exact cause before you touch a single rule.

The reason engineers reach for the wrong tool, or for no tool at all, is that the Azure Network Watcher feature set looks like a grab bag. IP flow verify, next hop, effective security rules, connection troubleshoot, connection monitor, packet capture, NSG flow logs, traffic analytics, topology, VPN diagnostics: it reads like a menu with no organizing principle. There is an organizing principle, and once you hold it, the menu becomes a decision. Each tool answers exactly one kind of question. The skill is not memorizing every tool. The skill is matching the symptom in front of you to the single question that will confirm or eliminate a cause, then running the tool that answers that question. This article gives you that mapping, the internals behind each tool so you trust its answer, the commands to run each one, and a worked diagnosis that walks a real connectivity failure from symptom to root cause without a single speculative change.

What Azure Network Watcher actually is

Network Watcher is a regional Azure service, not a resource you deploy onto a virtual machine and not an agent you install everywhere. When you enable it, Azure creates a Network Watcher instance in each region where you intend to diagnose traffic, and that instance becomes the control point through which the platform inspects the network state of resources in that region. The important word is regional. A Network Watcher in East US can diagnose a virtual machine, an NSG, or a route table in East US, but it cannot reach across into West Europe. If you operate in several regions, you enable the service in each one, and the diagnostics you run always target the region that holds the resource you are investigating.

The second thing to understand is what Network Watcher reads versus what it changes. Almost everything it does is read-only inspection of state the platform already holds. When you ask whether an NSG would allow a flow, Network Watcher does not send a packet and watch it arrive. It evaluates the rule set the platform has stored for that network interface against the hypothetical flow you describe, and it returns the verdict the platform would reach. When you ask where a packet goes, it reads the effective route table the platform has computed for that interface and returns the next hop. This matters because it means most diagnostics are safe to run against production at any time. You are querying the platform’s own model of the network, not perturbing live traffic. Packet capture is the one tool that does observe real traffic, and it carries the only meaningful runtime cost, which is why it sits at the bottom of the decision rather than the top.

The third idea is that Network Watcher operates on the same primitives the rest of Azure networking is built from, so its answers are only as good as your understanding of those primitives. A clean mental model of how the virtual network routes and then filters traffic is the foundation every one of these tools rests on, and the diagnosis goes faster when you already hold that model. If the route-then-filter sequence is not solid in your head, the connectivity model walked through in our Azure networking fundamentals for engineers guide is worth an hour before you go deep here, because Network Watcher will hand you answers in the vocabulary of routes, next hops, and security rules, and those answers only help if you know what each one means for the path a packet takes.

Why does Network Watcher need to be enabled per region?

Because the service inspects regional network state and the platform scopes that inspection to the region that holds the resource. A Network Watcher instance in one region has no view into another region’s interfaces, NSGs, or route tables. Enable it everywhere you intend to diagnose, or a diagnostic call will simply fail to find a watcher.

The historical wrinkle worth knowing is that Azure used to auto-enable Network Watcher in every region the first time you created a virtual network there, and many subscriptions still carry watchers created that way. Newer guidance leans toward explicit enablement so the service exists on purpose rather than by side effect, and so you can govern it with policy. Whichever path your subscription took, the practical check before any diagnostic is the same: confirm a watcher exists in the target region. If it does not, the tools below return an error that names the missing watcher rather than a network verdict, and engineers waste real minutes reading that error as a connectivity failure when it is only a missing prerequisite.

To see what you have, list the watchers across the subscription and read off the regions:

az network watcher list --query "[].{name:name, region:location, state:provisioningState}" -o table

If the region you care about is absent, create one before going further:

az network watcher configure --resource-group NetworkWatcherRG --locations eastus --enabled true

The convention is that watchers live in a dedicated resource group named NetworkWatcherRG, which Azure creates automatically. You do not interact with that resource group day to day, but knowing it exists prevents confusion when you see a resource group you did not create.

The tool-answers-the-question rule

Here is the claim this article is built on, and the one worth remembering after everything else fades: each Network Watcher tool answers exactly one precise question, so diagnosing connectivity is the act of choosing the tool whose question matches your symptom, not the act of running everything and reading tea leaves. Call it the tool-answers-the-question rule. A 503 from a load balancer probe, a hung database connection, a request that reaches a firewall but never returns, a peering that came up but carries no traffic: each of these maps to a question, and each question maps to a tool. When you internalize the mapping, the panic of a connectivity incident turns into a short, ordered set of yes-or-no checks.

The findable artifact for this article is the InsightCrunch Network Watcher tool-to-question map. It is the table you pin to your runbook. Read it as a lookup: identify the question your symptom raises, and the tool in the second column is the one that answers it without changing anything.

Diagnostic question	Tool that answers it	What the answer tells you
Would the NSG layer allow or deny this specific flow?	IP flow verify	Allow or Deny, plus the exact rule name that decided
Where does routing actually send a packet to this destination?	Next hop	The next-hop type and IP, and the route that produced it
Given all NSGs in play, what is the combined effective rule set on this NIC?	Effective security rules	The flattened, ordered rules the platform actually enforces
Does an end-to-end path from this source to that destination work right now?	Connection troubleshoot	Reachable or unreachable, with the hop and the reason at the failing point
Is that path still working continuously, and how is its latency trending?	Connection monitor	Ongoing reachability, latency, and the topology between endpoints
What is actually inside the traffic on the wire for this conversation?	Packet capture	A capture file you open in a packet analyzer to read the bytes
What traffic happened across this NSG in the past, allowed and denied?	NSG flow logs and traffic analytics	A historical record of flows, decisions, and aggregated patterns
What does the network around this resource look like?	Topology	A generated diagram of resources and their relationships
Why will my VPN or gateway connection not come up?	VPN troubleshoot	A health verdict on the gateway or connection with a diagnostic log

The map encodes a priority that is easy to miss. The questions near the top are the cheap, instant, read-only ones, and they answer the overwhelming majority of real incidents. IP flow verify and next hop together resolve most connectivity tickets in under a minute because most connectivity tickets are an NSG denying a flow or a route sending it somewhere unexpected. Packet capture sits low not because it is weak but because it is heavy: it observes live traffic, produces a file you then have to analyze, and answers a question the cheaper tools have usually already answered. The discipline the rule enforces is to start at the top of the map and descend only when the question genuinely demands it.

Which tool should I reach for first?

Start with the question, not the tool. If the symptom is traffic blocked or denied, IP flow verify answers in seconds. If traffic seems to vanish or go to the wrong place, next hop reveals the routing decision. Reach for packet capture only when a cheaper tool has not already named the cause, because it is the slowest and most invasive option.

This ordering is the single most common mistake the rule corrects. Under incident pressure, packet capture feels decisive: you will see the actual traffic, so surely that settles it. In practice you spend several minutes setting up the capture, waiting for traffic to flow, downloading the file, and loading it into an analyzer, only to confirm what IP flow verify would have told you instantly, which is that an NSG rule was dropping the SYN before it ever reached the destination. The cheaper tools are cheaper precisely because they read the platform’s decision rather than waiting to observe its consequence. Honor the order in the map and you turn a fifteen-minute investigation into a sixty-second one.

IP flow verify: would an NSG allow this flow?

IP flow verify is the tool you will reach for most, and understanding precisely what it does and does not do is what makes it trustworthy. You hand it a five-tuple: a direction (inbound or outbound), a protocol (TCP or UDP), a local IP and port, and a remote IP and port, all relative to a target network interface. It returns one of two words, Allow or Deny, and when the answer is Deny it names the security rule responsible. It does this by evaluating the network security group rules that apply to that interface against the hypothetical flow you described. It does not send a packet. It does not test routing. It does not check whether the destination is listening. It answers one question and one question only: given the NSG rules in force on this NIC, would a packet matching this description be permitted or blocked by the security layer?

That narrowness is the strength. When a connection fails and you suspect a security rule, you do not have to reason through a stack of rules across a subnet NSG and a NIC NSG, work out priorities, and account for default rules in your head. You describe the flow, and the platform does the evaluation it would do for a real packet and reports the verdict plus the deciding rule. If the answer is Allow, you have eliminated the NSG layer as the cause and can move on with certainty rather than suspicion. If the answer is Deny, you have not only confirmed the NSG is the cause, you have the exact rule name to fix.

Consider a web tier that cannot accept traffic on port 443. Rather than open the NSG blade and squint at rule priorities, you ask the question directly:

az network watcher test-ip-flow \
  --resource-group AppRG \
  --vm web-vm-01 \
  --direction Inbound \
  --protocol TCP \
  --local 10.1.1.4:443 \
  --remote 203.0.113.10:51000 \
  --query "{access:access, rule:ruleName}" -o table

If the result reads Deny and names a rule, you know the security layer is rejecting the flow and exactly which rule to amend. If it reads Allow, the NSG is not your problem and you have saved yourself from widening a rule that was never the cause. That second outcome is the underrated value of the tool. Half of its worth is in the cases where it tells you to stop looking at the NSG, because the most expensive mistakes in connectivity troubleshooting come from fixing the layer that was working and leaving the real fault untouched.

What does IP flow verify check that effective security rules does not?

IP flow verify evaluates a single concrete flow and returns the decision plus the deciding rule. Effective security rules returns the entire flattened rule set on the interface without testing any particular flow. Use IP flow verify when you have a specific connection to test, and effective security rules when you want to audit the whole combined policy.

The distinction is worth holding because the two tools feel similar and engineers reach for the wrong one. If your question is “why is this exact connection failing,” IP flow verify gives you a verdict in one call. If your question is “what is the full set of rules actually being enforced here, across the subnet NSG and the NIC NSG combined,” effective security rules gives you the merged, ordered list. The first is a probe; the second is an audit. A useful workflow is to run IP flow verify first to get the verdict on the failing flow, and if the deny rule it names is unexpected, follow with effective security rules to see how that rule landed in the combined policy and what surrounds it.

One subtlety to keep exact: IP flow verify evaluates the NSG decision for the flow you describe, but it works from the IPs and ports you provide. If you describe the wrong flow, you get a correct answer to the wrong question. A common error is testing the public IP of a load balancer as the remote address when the actual flow the NSG sees has already been translated, or testing the wrong direction relative to the NIC. Describe the flow as the network interface sees it: the local side is the NIC you are targeting, and the direction is from that NIC’s perspective. Get the five-tuple right and the verdict is authoritative.

Next hop: where does routing send this packet?

Where IP flow verify answers the filtering question, next hop answers the routing question, and the two together cover the large majority of connectivity faults. You give next hop a source IP on a target interface and a destination IP, and it returns the next-hop type the platform’s routing would choose, the next-hop IP address where relevant, and the route that produced the decision. The next-hop types are the vocabulary of Azure routing: Internet, VirtualNetworkGateway, VirtualAppliance, VnetLocal, VnetPeering, and None. Each one tells a story about where your traffic is actually headed, and one of them, None, tells you it is headed nowhere.

The reason next hop is so powerful is that routing problems are invisible from the symptom. A packet that gets silently sent to a virtual appliance that is down, or dropped by a route whose next hop is None, produces the same timeout as an NSG block or a dead listener. You cannot tell from the application’s perspective whether the packet was filtered, misrouted, or black-holed. Next hop tells you the routing decision directly, reading the effective route table the platform computed for that interface after combining system routes, any user-defined routes, and routes learned over BGP.

Suppose outbound traffic to the internet from a subnet is failing, and you suspect a user-defined route is steering it somewhere unexpected:

az network watcher show-next-hop \
  --resource-group AppRG \
  --vm app-vm-02 \
  --source-ip 10.1.2.5 \
  --dest-ip 8.8.8.8 \
  --query "{type:nextHopType, ip:nextHopIpAddress, route:routeTableId}" -o table

If the next-hop type comes back as VirtualAppliance with the IP of your firewall, you have learned that a user-defined route is sending internet-bound traffic through the appliance, which is correct if you intended forced tunneling and a problem if the appliance is the thing that is down. If it comes back as None, you have found a route that is black-holing the traffic, and the fix is in the route table, not the NSG and not the application. The reasoning here connects directly to how user-defined routes override system routes, a subject worth the full treatment in our Azure route tables and UDRs explained deep dive, because next hop tells you which route won and that companion piece tells you why it won and how to change the outcome deliberately.

How does next hop reveal a route that black-holes traffic?

A route with a next-hop type of None drops every packet that matches it, silently. Next hop returns None as the type when such a route is the most specific match for your destination, which immediately localizes the fault to the route table. No NSG check or packet capture is needed; the routing decision itself is the answer, and you go straight to the offending route.

This is the case engineers most often misdiagnose, because a black-holed packet behaves exactly like a filtered one from the outside. You see a timeout, you assume a firewall, you spend twenty minutes auditing NSG rules that were never involved, and the whole time a single user-defined route with next hop None was quietly discarding everything bound for that prefix. Next hop collapses that entire investigation into one call. When the type comes back None, the route table is the cause, full stop, and you can move directly to reading the route that matched and deciding whether to remove it, repoint it, or narrow its address prefix so it stops swallowing the traffic you care about.

The interplay between next hop and IP flow verify is the core of fast diagnosis. Run next hop to confirm the packet is routed where you expect; if it is misrouted or black-holed, the route table is the cause and you stop there. If routing is correct, run IP flow verify to confirm the security layer permits the flow; if it denies, the NSG is the cause. Between these two read-only, instant checks you have separated the two layers that produce the overwhelming majority of connectivity failures, and you have done it without sending a packet or changing a rule.

Effective security rules: which rule actually decides?

A single network interface can be governed by two network security groups at once, one attached to the subnet and one attached to the NIC, and the rules from both combine into a single effective policy that the platform evaluates in priority order. This is where engineers lose the thread, because the rule that is actually deciding a flow may live in the NSG they are not looking at, or may be a default rule they forgot exists, or may be shadowed by a higher-priority rule that matches first. Effective security rules cuts through all of it by returning the flattened, ordered list of rules the platform actually enforces on that interface, merged across both NSGs and including the default rules.

Reading the effective rules is how you answer the question “what is the real policy here,” as opposed to “what does this one NSG say.” The combined view shows you precedence concretely: rules sorted by priority, with the lower numbers evaluated first, and the first match winning. When IP flow verify names a deny rule you did not expect, effective security rules shows you where that rule sits in the combined order and what else is around it, which is how you discover that a broad deny at priority 200 is shadowing the allow you carefully wrote at priority 300.

az network nic list-effective-nsg-rules \
  --resource-group AppRG \
  --name web-vm-01-nic \
  --query "value[].{name:name, priority:priority, access:access, direction:direction, proto:protocol, dst:destinationPortRange}" -o table

The output is the truth of what is enforced, and it is the artifact to attach to an incident ticket when you need to show a teammate the actual policy rather than describe it. The deeper mechanics of how priorities resolve, how default rules behave, and how the subnet and NIC NSGs interact are the subject of our Network Security Groups deep dive, and effective security rules is the tool that makes that theory concrete for a specific interface on a specific day.

Why does the effective rule set differ from what I configured?

Because the effective set merges the subnet NSG and the NIC NSG and adds the platform default rules, then sorts everything by priority. A rule you wrote can be shadowed by a higher-priority rule in the other NSG, or overridden by a default rule you did not account for. The effective view shows the real, combined order rather than either NSG in isolation.

This gap between the configured and the effective is the source of a whole class of confusing incidents. You add an allow rule, the connection still fails, and the NSG you edited clearly shows the allow. What the single NSG view hides is that a deny rule at a lower priority number in the other NSG matches first and wins, so your allow never gets evaluated. Effective security rules reveals this immediately by placing both rules in the same sorted list, and the fix becomes obvious: renumber the priorities so the allow precedes the deny, or narrow the deny so it stops matching the flow you want to permit. Without the combined view you are debugging two policies as if they were one, which is how an afternoon disappears.

Connection troubleshoot: does the whole path work end to end?

IP flow verify and next hop each test one layer in isolation. Connection troubleshoot tests the whole path. You give it a source, which can be a virtual machine or an application gateway, and a destination, which can be another Azure resource, an IP, or a fully qualified domain name with a port, and it attempts an actual end-to-end reachability test along the path. The result tells you whether the destination is reachable, and when it is not, it identifies the hop where the path breaks and the probable reason, whether that is a security rule, a routing issue, or a destination that is simply not responding.

Connection troubleshoot is the tool for the question “is the entire path healthy right now,” and it shines when the layer-by-layer checks have each come back clean but the connection still fails. There are paths where routing is correct and the NSG permits the flow, yet the connection still does not complete, because the destination’s own NSG denies it, or a downstream appliance drops it, or the service on the far end is not listening. Connection troubleshoot exercises the full path and reports the hop-level result, so it catches the failure that lives somewhere the source-side checks cannot see.

az network watcher test-connectivity \
  --resource-group AppRG \
  --source-resource app-vm-02 \
  --dest-address api.internal.contoso.com \
  --dest-port 443 \
  --protocol Tcp \
  --query "{status:connectionStatus, hops:hops[].{address:address, issues:issues}}" -o json

The connectionStatus tells you Reachable or Unreachable at a glance, and the per-hop detail shows where along the path the trouble sits. If the status is Unreachable and the issue is reported at the destination hop with a security-rule problem, you have learned the destination side is blocking the flow, which is something a source-side IP flow verify against your own NIC would never have revealed. This is the value of an end-to-end test: it sees the whole path, including the layers on the other end that you do not directly control.

When should I use connection troubleshoot instead of IP flow verify?

Use IP flow verify when you want to test the security decision on one interface, instantly and without sending traffic. Use connection troubleshoot when the per-layer checks pass but the connection still fails, because it exercises the actual end-to-end path and finds the breaking hop, including problems on the destination side or in a downstream appliance that source-side tools cannot see.

The two are complementary, not redundant. IP flow verify is faster and read-only, so it belongs at the top of your decision for the security question. Connection troubleshoot is heavier, because it generates real probe traffic and walks the path, so it belongs a step later, after the cheap source-side checks have cleared and you need to confirm the destination and the intermediate hops. A reliable sequence is next hop to confirm routing, IP flow verify to confirm the local security decision, and then connection troubleshoot to confirm the path actually completes end to end. Each step either finds the cause or hands a clean path to the next, and you reach packet capture only if all three come back healthy and the application still cannot talk.

Connection monitor: is the path still working over time?

Connection troubleshoot answers whether a path works right now, as a single point-in-time test. Connection monitor answers whether the path keeps working, by running continuous reachability and latency checks between endpoints and recording the results over time. It is the tool for the question that point-in-time diagnosis cannot answer: not “is it broken at this instant,” but “when did it start failing, and was the latency degrading before it broke.” For anything you care about staying up, connection monitor is the difference between learning about a problem from a user ticket and learning about it from your own telemetry.

You define a connection monitor with a source endpoint, a destination endpoint, and a test configuration that sets the protocol, the port, and the test frequency. From then on the platform probes the path on that interval and stores the results, which you can alert on and graph. The destination can be another Azure resource or an external endpoint, so connection monitor covers both intra-Azure paths and connectivity to an on-premises service or a public API your application depends on. The continuous record is what turns an intermittent problem from a mystery into a timeline. When a connection fails once a day for thirty seconds, no point-in-time test will ever catch it; connection monitor catches it because it is always testing.

The latency dimension matters as much as the up-or-down dimension. A path that is technically reachable but whose round-trip time has climbed from two milliseconds to two hundred is a problem in the making, and connection monitor surfaces that trend before it becomes an outage. It also renders the topology between the endpoints, so when latency degrades you can see which hop introduced it, which is invaluable when the path crosses a gateway, a peering, or a virtual appliance that might be the source of the slowdown.

How is connection monitor different from a one-time connectivity test?

Connection troubleshoot runs once and tells you the path state at that moment. Connection monitor runs continuously on a schedule, recording reachability and latency over time, so it catches intermittent failures and degrading latency that a single test would miss entirely. Use the one-time test for active diagnosis and connection monitor for ongoing assurance of a path you depend on.

The practical division is that you use connection troubleshoot while you are actively debugging, with an incident open and a hypothesis to confirm, and you use connection monitor for the paths whose health you need to know about before a human notices. A database connection from your application subnet, a path to an on-premises identity provider over a VPN, an external payment API your checkout depends on: these are the kinds of paths worth a standing connection monitor, because the cost of finding out late is high and the cost of continuous probing is low. The two tools answer the same shape of question across different time horizons, and a mature environment uses both.

Packet capture: what is happening inside the conversation?

Packet capture is the only Network Watcher tool that observes real traffic rather than reading the platform’s model of it. It captures packets on a target virtual machine to a file, which you then open in a packet analyzer to inspect the actual bytes on the wire: the handshake, the retransmits, the resets, the payloads, the timing. It answers the question none of the other tools can, which is what is actually happening inside a conversation that the layer checks all say should be working. When IP flow verify says Allow, next hop says the route is correct, connection troubleshoot says the path is reachable, and the application still misbehaves, packet capture is where you go to see the truth on the wire.

The reason packet capture sits at the bottom of the decision is not that it is unreliable. It is the most authoritative tool there is, because it shows you the literal traffic. It sits low because it is the heaviest and the slowest. It requires the Network Watcher agent extension on the target virtual machine, it observes live traffic so it has a runtime footprint, it produces a file you must store somewhere and then download, and it hands you that file rather than an answer, so you still have to do the analysis. Every one of those costs is justified when the question genuinely requires seeing the bytes, and every one of them is wasted when a cheaper tool would have answered first. That is the discipline the tool-answers-the-question rule enforces here: never open a capture for a question IP flow verify could have settled.

az network watcher packet-capture create \
  --resource-group AppRG \
  --vm app-vm-02 \
  --name app-vm-02-capture-01 \
  --storage-account appdiagstorage \
  --time-limit 120

You set a time limit or a size limit so the capture stops on its own, you let the problematic traffic occur during the window, and then you download the resulting capture from the storage account and open it in your analyzer of choice. What packet capture earns you is the application-layer and protocol-layer truth: a TLS handshake that starts but never completes, a server that sends a reset immediately after the SYN, a stream of retransmits that points to packet loss rather than a block, an MTU problem that fragments and stalls. These are the failures that live above the routing and filtering layers, where the network is delivering packets correctly and the trouble is in what the endpoints are saying to each other. For those, and only for those, packet capture is the right and necessary tool.

When is packet capture the right tool rather than a faster check?

Reach for packet capture only after next hop, IP flow verify, and connection troubleshoot have each come back clean and the application still fails. At that point the fault is above the routing and filtering layers, in the protocol or application conversation itself, and you need to see the actual packets. For any fault those cheaper checks can localize, packet capture is wasted effort.

The trap is that packet capture feels like the most decisive move precisely when you are most stressed, and the stress is what pushes engineers to it too early. The honest accounting is that for the common connectivity ticket, an NSG denying a flow or a route black-holing it, you will spend ten minutes capturing and analyzing to confirm what a five-second IP flow verify would have told you. Reserve the capture for the genuinely mysterious case, the one where the network plainly works but the conversation does not, and it becomes the precision instrument it is meant to be rather than a slow detour.

NSG flow logs and traffic analytics: what happened before?

Every tool so far answers a question about the present or the future. NSG flow logs answer a question about the past. They record the flows that traversed a network security group, allowed and denied, with the five-tuple and the decision, to a storage account. They are the historical ledger of what the security layer actually did, and they answer the question no live tool can: not “would this flow be allowed,” but “what flows actually happened, and what did the NSG do with them, over the last hour or day or week.”

The fact to keep exact is that NSG flow logs require an NSG. The logging attaches to a network security group, and the data it produces is the record of that NSG’s decisions about the traffic it saw. Without an NSG in the path there is nothing to log, because the flow log is fundamentally a record of NSG evaluations. This is why enabling flow logs is part of designing an NSG for production, not an afterthought: the historical record only exists from the moment you turn it on, so the time to enable it is before the incident, not during.

Traffic analytics sits on top of flow logs and turns the raw flow records into aggregated insight. Where a flow log is a firehose of individual flows, traffic analytics processes them into patterns: which hosts talk to which, how much traffic flows where, what is being denied and from where, which ports are most active, and where unexpected traffic is appearing. It is the tool for understanding the shape of your traffic rather than a single flow, and it answers questions like “is this denied traffic a misconfiguration or a scan,” “what is actually talking to this subnet,” and “did the traffic pattern change after that deployment.” For a denied-traffic mystery, traffic analytics often reveals the source and frequency at a glance, where reading raw logs would take an hour.

How do flow logs and traffic analytics help after the fact?

Flow logs record every allowed and denied flow through an NSG to storage, giving you a historical record of exactly what the security layer did. Traffic analytics aggregates those records into patterns, so you can see who talked to whom, what was denied and from where, and how traffic changed over time. Together they answer questions about past traffic that no live diagnostic can.

The reason this pairing matters for diagnosis is that many problems are intermittent or already over by the time you are looking. A connection that failed at three in the morning, a burst of denied traffic during a deployment, a slow leak of traffic to an unexpected destination: none of these can be reproduced on demand, and a live tool can only test the present. Flow logs preserve the past so you can investigate it, and traffic analytics makes that past legible without you having to parse raw flow records by hand. The investment to make ahead of time is enabling flow logs on the NSGs that matter and pointing traffic analytics at them, so that when an incident is already over, you still have the evidence to reconstruct it. An NSG denying a flow it should permit is exactly the failure mode walked through in our fix NSG blocking traffic unexpectedly guide, and flow logs are how you prove after the fact whether the NSG was the layer that dropped the traffic.

Topology and VPN diagnostics

Two further tools round out the set, each answering a question the core diagnostics do not. Topology generates a diagram of the resources in a virtual network and the relationships between them, which answers the orientation question: what does the network around this resource actually look like. When you inherit an environment you did not build, or when an incident spans resources whose relationships you are unsure of, topology gives you the map before you start the diagnosis. It is not a troubleshooting tool in the sense of returning a verdict, but it is the tool that prevents you from diagnosing a network whose shape you have misunderstood, and a wrong mental model of the topology is the quiet cause behind many misdirected investigations.

VPN troubleshoot answers the gateway question. When a site-to-site VPN connection or a virtual network gateway will not come up, or comes up and then drops, VPN troubleshoot runs a diagnostic against the gateway or the specific connection and returns a health verdict along with a diagnostic log written to storage. It is the dedicated tool for the connectivity that crosses out of Azure to on-premises or to another network over a tunnel, where the failure modes are specific to the gateway, the connection configuration, and the negotiation with the remote device. For a tunnel that will not establish, VPN troubleshoot is far faster than trying to infer the gateway’s state from the symptoms on either end.

Why is topology a diagnostic tool if it does not return a verdict?

Because most misdiagnoses start from a wrong mental model of the network. Topology shows you the actual resources and relationships, so you diagnose the network that exists rather than the one you imagine. Orienting correctly before you run a single verdict-returning tool prevents the whole class of investigations that chase a cause in a part of the network that was never in the path.

The value is clearest in inherited or sprawling environments. You are paged for a connectivity problem in a subscription you did not design, and the first cost is not the diagnosis itself but the time spent building a mental picture of what connects to what. Topology collapses that cost. With the map in front of you, the questions sharpen: you can see which peering carries the traffic, which gateway sits in the path, which subnet holds the appliance, and you point the verdict-returning tools at the right interfaces on the first try. It is the cheapest insurance against the most expensive mistake, which is diagnosing the wrong part of the network thoroughly and correctly while the actual fault sits somewhere you never looked.

How Network Watcher fits the rest of the network

Network Watcher does not replace your understanding of Azure networking; it instruments it. Every answer it returns is phrased in the primitives of the platform, so the tools are most useful to an engineer who already reasons in routes, next hops, security rules, and peerings. The route-then-filter model is the spine of all of it: a packet leaving an interface is first routed, by the effective route table that combines system routes, user-defined routes, and BGP-learned routes, and then filtered, by the effective security rules that combine the subnet and NIC NSGs. Next hop instruments the routing half of that model and IP flow verify instruments the filtering half, which is why those two tools together cover so much ground. They are not arbitrary features; they are direct windows into the two sequential decisions every packet undergoes.

The interaction with peering is where many engineers first appreciate the regional and model-reading nature of the tools. When two virtual networks are peered and traffic between them fails, the symptom is a timeout, and the cause could be a missing peering, a peering that exists but lacks the right traffic-forwarding settings, an NSG on either side, or a route table that does not account for the peered range. Next hop will return VnetPeering as the type when the route correctly sends traffic across the peering, and something else when it does not, which immediately tells you whether the routing half is healthy. IP flow verify on each side tells you whether either security layer is the cause. The tools decompose a peering failure into the same routing and filtering questions as any other connectivity failure, which is the point: there is no special peering troubleshooter because peering failures are routing or filtering failures, and the existing tools answer them.

The interaction with private endpoints and DNS is the case where the tools have a known boundary worth naming honestly. Network Watcher diagnoses the network path: routing, filtering, reachability. It does not resolve names, so a private endpoint problem that is actually a DNS resolution problem will look healthy to the path tools while the application still fails, because the application is resolving the name to the wrong address before any packet is sent. The discipline is to confirm name resolution separately when a private endpoint is involved, and then use Network Watcher to confirm the path to the resolved address. The tools are honest about what they test; the burden is on you to test the layer they do not, which for private endpoints is resolution.

The interaction with virtual appliances is where next hop earns its place repeatedly. In a hub-and-spoke design where spoke traffic is steered through a central firewall, a user-defined route sets the next hop to the appliance, and when the appliance is unhealthy the symptom is that all egress fails with no obvious cause. Next hop returns VirtualAppliance with the firewall’s IP, which confirms the route is doing what it was designed to do and points you at the appliance rather than the network, and that is often the entire diagnosis: the route is correct, the appliance is down, fix the appliance.

Designing Network Watcher into a production environment

The tools so far are mostly reactive: you reach for them when something is already failing. The mature posture is to make the diagnostic surface exist before the incident, so that when you need it, it is already there and already recording. There are a handful of design decisions that turn Network Watcher from a tool you scramble to set up under pressure into infrastructure that is ready when you are.

The first decision is enablement and governance. Enable Network Watcher in every region you operate in, deliberately, and govern that with policy so a new region does not silently lack the service. The cost of the service itself is negligible; the cost of discovering during an incident that the region has no watcher is real, measured in the minutes you spend reading a missing-watcher error as a network failure. Make the watcher a precondition of operating in a region, not a thing you remember to add.

The second decision is flow logs. Because the historical record only exists from the moment logging is on, the time to enable NSG flow logs is when you create the NSG, not when you wish you had the data. Decide which NSGs are worth logging, which in practice is the ones guarding anything you would be paged about, enable flow logs to a storage account with a sensible retention, and point traffic analytics at them. The payoff is that every after-the-fact investigation has evidence, and the intermittent failures that no live tool can reproduce become tractable because the past was recorded. This is a design choice about NSG hygiene as much as about diagnostics, and it belongs in the same conversation as the rest of your security-group strategy.

The third decision is standing connection monitors on the paths you cannot afford to learn about late. For each critical path, the database connection, the path to an identity provider, the external dependency your revenue depends on, define a connection monitor so its health is telemetry rather than a surprise. Alert on reachability and on latency trend, not just on hard failure, so a degrading path warns you before it breaks. The continuous probe is cheap; the late discovery is not.

The fourth decision is packet capture readiness. Because packet capture needs the agent extension on the target virtual machine, deciding whether that extension is present is a choice you make ahead of time for the machines where you might need to capture. You do not capture continuously, but you ensure the capability is one command away on the machines that matter, so that when you reach the rare case that genuinely needs the bytes, you are not installing an extension during an incident.

How do I make Network Watcher part of infrastructure as code?

Enable the regional watcher, NSG flow logs, traffic analytics, and standing connection monitors through your templates rather than the portal, so the diagnostic surface is reproducible and governed. Treat flow-log enablement as a property of every production NSG and a connection monitor as a property of every critical path, defined alongside the resources they observe rather than added by hand afterward.

The principle is that diagnostics you set up by hand during an incident are diagnostics you did not have when the incident started. Encoding the watcher, the flow logs, and the monitors as code means a new environment is observable the moment it exists, a new region cannot quietly lack a watcher, and a new NSG cannot quietly lack logging, because the template will not let it. This converts the entire reactive scramble into a property of the platform, and it is the difference between an environment where diagnosis is fast because the surface was always there and one where the first ten minutes of every incident are spent building the surface you should have had.

A worked diagnosis from symptom to cause

Theory becomes useful when you watch it run. Here is a single connectivity failure walked from the symptom to the root cause using nothing but the read-only tools, in the order the tool-answers-the-question rule prescribes, with no speculative changes along the way.

The symptom: an application virtual machine in a spoke virtual network can no longer reach an internal API in a peered spoke. The application logs show connection timeouts to the API’s address on port 443. Nothing changed in the application. Under pressure, the temptation is to widen an NSG rule or restart the API, but the rule says start with the question, and the first question is about routing, because a timeout is equally consistent with a misroute, a black-hole, a filter, or a dead listener.

First, the routing question. Run next hop from the application VM’s IP to the API’s IP. The next-hop type comes back as VnetPeering, which is exactly what it should be for traffic crossing into the peered spoke, and the route that produced it is the peering route. Routing is healthy; the packet is being sent toward the right place across the peering. That eliminates the entire routing layer in one read-only call, and it eliminates the most common silent killer, a user-defined route with next hop None, because the type was not None. Had it come back VirtualAppliance pointing at a firewall, the next move would be to check that appliance, and had it come back None, the diagnosis would already be over with the route table as the cause.

Second, the local filtering question. Run IP flow verify outbound from the application VM’s NIC, TCP, local IP to the API IP on 443. The verdict is Allow. The source-side security layer is not blocking the flow, so the NSG on the application side is not the cause. Two read-only calls in, and both the routing and the local filtering layers are confirmed healthy, which already rules out the two most frequent causes and tells you the fault is either on the destination side or above the network layers.

Third, the destination-side and end-to-end question. Run connection troubleshoot from the application VM to the API’s address on port 443. The status comes back Unreachable, and the per-hop detail places the failure at the destination with a security-rule issue. This is the revelation the source-side checks could not provide: the destination’s own NSG is denying the inbound flow. The fault was never on the side you control. To confirm and identify the exact rule, run IP flow verify inbound on the API’s NIC for the same flow, and it returns Deny naming a specific rule. There is the root cause, named precisely: an inbound deny rule on the API’s network security group is rejecting traffic from the application spoke, and the fix is to correct that rule, not to touch anything on the application side.

Notice what did not happen in that diagnosis. No rule was widened on a hunch. No virtual machine was restarted. No packet capture was opened, because the question never descended to the protocol layer; the cheaper tools localized the fault to a specific NSG rule on a specific interface. The whole investigation was a short ordered sequence of read-only questions, each one either eliminating a layer or pointing at the next, and it ended with a named cause and a precise fix. That is the tool-answers-the-question rule in operation, and it is reproducible for almost any connectivity failure you will meet. When you want to run this exact sequence against a reproduced problem in a safe environment, you can run the hands-on Azure labs and command library on VaultBook, where each of these diagnostics is set up against a deliberately broken network so the commands and the readings become muscle memory rather than something you improvise under pressure.

What is the fastest reliable order to diagnose a connectivity failure?

Next hop first to confirm routing, then IP flow verify to confirm the local security decision, then connection troubleshoot to confirm the end-to-end path including the destination side, and only then packet capture if the network plainly works but the conversation does not. Each step is cheaper than the next and either names the cause or hands a clean layer to the following check.

This order is not arbitrary; it tracks the route-then-filter model and then extends past it. Routing happens before filtering, so you confirm routing first. Local filtering is the next decision and the cheapest to test, so it comes second. The end-to-end test covers the destination side and the intermediate hops that the source-side checks cannot see, so it comes third. Packet capture covers the protocol and application layers above the network, so it comes last and only when needed. Following the order means you spend the least effort that the fault allows, and you never reach for the heavy tool to answer a question the light one would have settled.

The verdict

Azure Network Watcher is not a single troubleshooter you point at a problem; it is a set of instruments, each tuned to one question, and its value is unlocked the moment you stop treating it as a menu and start treating it as a decision. The tool-answers-the-question rule is the whole discipline in one sentence: identify the question your symptom raises, run the one tool that answers it, and descend from the cheap read-only checks to the heavy live ones only as the fault demands. IP flow verify answers whether the security layer permits a flow. Next hop answers where routing sends a packet. Effective security rules answers what the combined policy actually is. Connection troubleshoot answers whether the whole path works end to end, and connection monitor answers whether it keeps working. Packet capture answers what is happening on the wire when everything else says it should be fine. Flow logs and traffic analytics answer what already happened. Topology orients you, and VPN troubleshoot covers the tunnel.

Hold the map, follow the order, and the panic of a connectivity incident becomes a short sequence of yes-or-no questions that end in a named cause and a precise fix. The engineers who resolve these fastest are not the ones who know the most obscure feature; they are the ones who match symptom to question to tool every time, who never widen a rule on a hunch, and who reach for packet capture last rather than first. That habit, more than any single tool, is what Network Watcher is really for: it lets you diagnose by knowing, not by guessing, and in a connectivity incident the difference between knowing and guessing is the difference between a sixty-second fix and an afternoon lost to blind changes.

The recurring connectivity patterns and the tool that matches each

Most connectivity failures an engineer meets are not novel. They fall into a small set of recurring patterns, and each pattern has a signature symptom and a matching tool. Learning the patterns is how you skip the search and go straight to the question, because once you recognize the shape of a failure you already know which tool answers it. Here are the patterns worth committing to memory, each described as the symptom, the question it raises, and the tool that settles it.

The first pattern is the suspected NSG block. The symptom is a connection that times out or is refused, and the suspicion is that a security rule is dropping it. The question is whether the NSG layer would permit the flow, and the tool is IP flow verify. You describe the exact five-tuple, you get Allow or Deny, and if it is Deny you get the rule name. This pattern is the single most common one, which is why IP flow verify is the tool you run most, and the discipline is to confirm the block with the tool before widening any rule, because the rule you would have widened is frequently not the one doing the blocking.

The second pattern is the misrouted egress. The symptom is that outbound traffic to a destination fails, often everything to the internet or everything to a particular range, with no NSG obviously involved. The question is where routing is actually sending the traffic, and the tool is next hop. The revealing answer is usually VirtualAppliance, meaning a user-defined route is steering egress through a firewall or network virtual appliance, which is correct by design but a problem when that appliance is unhealthy. Next hop confirms the route is doing what it was built to do and redirects your attention from the network to the appliance, which is where the fault actually sits.

The third pattern is the black-holed prefix. The symptom is identical to the misrouted egress, a timeout with no obvious filter, but the cause is more severe: a route whose next hop is None, silently discarding every packet bound for the matched prefix. The question is again where routing sends the traffic, and the tool is again next hop, but the answer that defines this pattern is None. When you see None, the route table is the cause without any further checking, and you go directly to the route that matched to remove it, repoint it, or narrow its prefix. This is the pattern engineers most often misread as a firewall, and next hop is what prevents the twenty-minute detour through NSG rules that were never in the path.

The fourth pattern is the shadowed rule. The symptom is that you added an allow rule, the connection still fails, and the NSG you edited plainly shows the allow. The question is what the combined effective policy actually enforces, and the tool is effective security rules. The combined view reveals the deny rule, often in the other NSG or at a lower priority number, that matches first and wins, shadowing your allow. The fix is to renumber the priorities so the allow precedes the deny, or to narrow the deny. This pattern is invisible from any single NSG and obvious from the effective set, which is why the effective view is the tool that resolves it.

The fifth pattern is the destination-side block. The symptom is a connection that fails even though every source-side check is clean: routing is correct, the source NSG permits the flow, and yet the connection does not complete. The question is whether the whole path works end to end, and the tool is connection troubleshoot, which exercises the actual path and reports the failing hop. When it places the failure at the destination, you follow with an inbound IP flow verify on the destination NIC to name the deny rule. This pattern is the reason connection troubleshoot exists: it sees the layers on the other end that source-side tools cannot, and it is how you discover the fault was never on the side you control.

The sixth pattern is the application-layer mystery. The symptom is that the network plainly works, every layer check is clean, the path is reachable, and yet the application still misbehaves: a handshake that stalls, intermittent resets, throughput that collapses. The question is what is actually happening on the wire, and the tool is packet capture, the only one that shows the literal traffic. This is the pattern where the heavy tool is the right tool, because the fault lives above routing and filtering, in the protocol or application conversation, where only the bytes themselves can tell you what is wrong. Recognizing this pattern is what tells you the cheaper tools have nothing left to offer and the capture is justified.

The seventh pattern, and the one easiest to forget, is the already-over incident. The symptom is a failure that happened in the past and cannot be reproduced: a connection that dropped overnight, a burst of denied traffic during a deployment, a pattern change after a release. No live tool can test the past, so the question is what actually happened, and the tools are NSG flow logs and traffic analytics, provided you enabled them before the incident. This pattern is the argument for designing the diagnostic surface ahead of time, because the only way to investigate an already-over failure is to have been recording when it occurred.

How do I tell a routing fault from a filtering fault when the symptom is identical?

Run next hop and IP flow verify in sequence. If next hop returns None or an unexpected type, the fault is routing and lives in the route table. If next hop is correct but IP flow verify returns Deny, the fault is filtering and lives in an NSG. The two read-only checks separate the two layers in under a minute, which is why they are the first two questions for any timeout.

The deeper point is that the symptom genuinely cannot distinguish these causes, so guessing is hopeless and testing is fast. A timeout from a black-holed route, a timeout from a misrouted packet to a dead appliance, and a timeout from an NSG deny are indistinguishable to the application, which sees only that nothing came back. The platform, however, knows exactly which decision it made, and next hop and IP flow verify read that knowledge directly. This is the whole philosophy in miniature: rather than infer the cause from the consequence, which is ambiguous, ask the platform what decision it took, which is definite. Two questions, two definite answers, and the ambiguous symptom resolves into a specific layer.

Common misdiagnoses and how to avoid them

The tools are reliable; the errors are in how engineers use them, and the same few mistakes recur often enough to be worth naming so you can catch yourself making them.

The first misdiagnosis is reaching for packet capture first. It feels decisive under pressure, but it is the slowest path to an answer the cheap tools usually already hold, and it adds the overhead of setup, capture time, download, and analysis to a question that IP flow verify settles in seconds. The correction is to honor the order: capture only when the network plainly works and the conversation does not. If you find yourself opening a capture before you have run next hop and IP flow verify, stop and run them first, because the capture will almost always confirm what they would have told you for free.

The second misdiagnosis is describing the wrong flow to IP flow verify. The tool answers the question you ask, so if you give it the wrong five-tuple, the wrong direction, or a translated address instead of the address the NIC actually sees, you get a correct answer to a question that is not yours. The correction is to describe the flow from the targeted interface’s perspective: the local side is that NIC, the direction is relative to it, and the remote address is the one the NIC genuinely communicates with rather than a public or load-balanced front that has been translated before the NIC sees it.

The third misdiagnosis is forgetting that flow logs require an NSG and only record from the moment they are enabled. Engineers reach for historical evidence during an incident and discover there is none, because logging was never turned on, or because there is no NSG in the path to log against. The correction is structural: enable flow logs on the NSGs that matter as part of building them, so the evidence exists before you need it, and remember that a path with no NSG produces no flow log because the log is a record of NSG decisions.

The fourth misdiagnosis is treating a private endpoint connectivity failure as a path problem when it is a name resolution problem. The path tools will report a healthy path to whatever address the name resolved to, while the application fails because the name resolved to the wrong address entirely. The correction is to confirm resolution separately whenever a private endpoint is involved, because Network Watcher tests the path, not the name, and a clean path to the wrong address looks exactly like success to the tools and like failure to the application.

The fifth misdiagnosis is diagnosing the wrong region. Because Network Watcher is regional, a diagnostic against a resource in a region with no watcher fails with a missing-watcher error that reads, to a stressed engineer, like a connectivity failure. The correction is to confirm a watcher exists in the target region before reading any error as a network verdict, and to enable watchers everywhere as a matter of design so this class of false alarm never arises.

Why does my diagnostic return an error instead of a verdict?

The most common cause is a missing Network Watcher in the target region, since the service is regional and a diagnostic needs a watcher where the resource lives. The next most common is insufficient permission, since these tools read network state and require the appropriate reader-level access. Confirm the watcher exists and that your role grants the needed access before treating the error as a network problem.

The reason this matters is that an infrastructure or permission error and a genuine connectivity failure can look superficially similar when you are moving fast, and conflating them sends you diagnosing a network that was never the problem. The habit that prevents it is to read the error text rather than pattern-matching it to a timeout: a missing-watcher error names the missing watcher, an authorization error names the missing permission, and neither is a statement about the network. A few seconds of actually reading the error saves the minutes you would otherwise spend diagnosing a phantom.

Permissions, cost, and operational notes

The diagnostic tools read network state, so they require read access to the network resources involved, and the live tools that act, such as packet capture, require the corresponding action permissions plus a storage account to write to. The practical guidance is to grant operators the network reader access they need to run the read-only diagnostics freely, because those tools are safe against production, and to scope the heavier capabilities like packet capture more deliberately, since they touch live traffic and write data. Exact role names and the permissions each grants are the kind of detail that shifts over time, so verify the current role definitions against the official documentation when you set up access rather than relying on a remembered role name.

On cost, the read-only diagnostics are effectively free to run, which is part of why they belong at the top of the decision: there is no reason not to run next hop or IP flow verify, so run them first and often. The tools that store data, flow logs and traffic analytics in particular, carry the cost of the storage and processing they consume, which scales with traffic volume and retention. The sensible posture is to enable logging where the diagnostic value justifies the storage, set a retention that matches how far back you would realistically investigate, and revisit it rather than logging everything forever by default. Packet capture’s cost is the storage for the capture file and the brief runtime footprint of observing traffic, both of which are bounded because you set a time or size limit on every capture.

Operationally, the single highest-leverage habit is to make the diagnostic surface exist before the incident. A watcher in every region, flow logs on the NSGs that matter, connection monitors on the critical paths, and packet-capture readiness on the machines where you might need it: each of these is cheap to set up ahead of time and expensive to wish for during an outage. The engineers who diagnose fastest are not improvising the surface under pressure; they built it in advance, encoded it as code, and so begin every incident with the instruments already in place and already recording. That preparation, more than familiarity with any single tool, is what separates a fast diagnosis from a slow one.

Turning diagnostics into alerts and automation

The tools are most valuable when their answers reach you before a human reports the problem. Connection monitor is the bridge from manual diagnosis to standing telemetry, because its continuous results can drive alerts on reachability and on latency trend, so a degrading path warns you while it is still merely slow rather than fully broken. The pattern that pays off is to alert not only on a hard failure but on the early signal: a round-trip time that has crept upward, a probe success rate that has dipped below its baseline, a path that has begun to flap. Each of these precedes an outage, and catching the precursor is what turns a would-be incident into a quiet ticket nobody outside the team ever notices.

Traffic analytics extends the same idea to the security layer. Because it aggregates the historical record into patterns, it is the natural place to watch for the shapes that signal trouble: a rise in denied attempts from an unexpected source, a sudden appearance of traffic to a destination that was never part of the design, a shift in the talkers around a sensitive subnet after a release. These are not point-in-time questions, so no live probe answers them; they are pattern questions, and the aggregated view is what makes a developing problem legible while it is still small. Pairing traffic analytics with alerting on those patterns means the security posture reports its own drift rather than waiting for an audit to discover it.

The deeper principle is that diagnosis and monitoring are the same questions on different time horizons. The question IP flow verify answers about a single hypothetical, flow logs answer about every real connection over the past day, and traffic analytics answers about the aggregate trend. The question connection troubleshoot answers once, connection monitor answers continuously. An environment that treats these as one continuum, rather than as separate reactive and proactive toolkits, gets the compounding benefit: the same mental model that localizes a fault under pressure also tells you which paths to monitor, which security groups to log, and which patterns to alert on, so the next fault is caught earlier and diagnosed faster than the last. The tools reward the engineer who thinks past the immediate incident to the standing instrumentation that prevents the next one.

How do I get warned before a path fails completely?

Put a connection monitor on the path and alert on its latency trend and probe success rate, not only on hard failure. A path whose round-trip time is climbing or whose success rate is slipping is failing slowly, and that early signal gives you time to act before users notice. The continuous probe is cheap and the late discovery is expensive, which is the entire argument for standing monitors on anything you cannot afford to learn about late.

Frequently Asked Questions

Q: What does Azure Network Watcher do, in one sentence?

Network Watcher is the regional Azure diagnostic service that lets you inspect and test the network state of resources in a region, so you can localize a connectivity fault to its exact cause rather than guessing. It is a set of tools, each tuned to one question: whether an NSG would allow a flow, where routing sends a packet, what the combined security policy actually is, whether an end-to-end path works, what traffic is on the wire, and what traffic happened in the past. Most of these tools are read-only inspections of the platform’s own model of the network, which makes them safe to run against production at any time. The skill is not knowing every tool but matching the symptom in front of you to the single question that confirms or eliminates a cause, then running the tool that answers it.

Q: Do I have to enable Network Watcher separately in every Azure region?

Yes. Network Watcher is a regional service, and an instance in one region cannot inspect resources in another. A diagnostic call targets the region that holds the resource, so if you operate across several regions you enable the service in each one, or the call returns an error naming the missing watcher rather than a network verdict. Older subscriptions often carry watchers that Azure auto-created the first time a virtual network was made in a region, but the current best practice is explicit enablement so the service exists deliberately and can be governed by policy. The practical check before any diagnostic is simply to confirm a watcher exists in the target region. Enabling watchers everywhere you operate, as a matter of design rather than reaction, prevents the false alarm where a missing-watcher error gets read as a connectivity failure during an incident.

Q: What exactly does IP flow verify tell me, and what does it not test?

IP flow verify takes a five-tuple, a direction, a protocol, a local IP and port, and a remote IP and port, relative to a target network interface, and returns Allow or Deny, naming the deciding rule when it denies. It does this by evaluating the network security group rules in force on that interface against the hypothetical flow you describe. What it does not do is send a packet, test routing, or check whether the destination is listening. It answers one question only: would the security layer permit a packet matching this description. That narrowness is its strength, because an Allow eliminates the NSG as a cause with certainty and a Deny names the exact rule to fix. The most common error in using it is describing the wrong flow, such as a translated public address instead of the address the NIC actually sees, which yields a correct answer to the wrong question.

Q: How does next hop reveal a routing problem I cannot see from the symptom?

Next hop takes a source IP on an interface and a destination IP and returns the next-hop type the platform’s routing would choose, the next-hop IP where relevant, and the route that produced the decision. The next-hop types are Internet, VirtualNetworkGateway, VirtualAppliance, VnetLocal, VnetPeering, and None. This matters because routing faults are invisible from the application’s perspective: a packet sent to a downed appliance or dropped by a route whose next hop is None produces the same timeout as an NSG block or a dead listener. Next hop reads the effective route table directly and tells you the decision, so a type of VirtualAppliance points you at a firewall that may be down, and a type of None tells you a route is silently discarding the traffic. In both cases the route table is the cause, which a symptom alone could never have told you, and the fix lives in routing rather than in the NSG or the application.

Q: What does it mean when next hop returns a type of None?

A next-hop type of None means a route matched your destination and that route discards every packet bound for the matched prefix, a behavior often called black-holing. It is the most severe routing fault and the one most often misread as a firewall block, because a black-holed packet behaves identically to a filtered one from the outside: the application sees only a timeout. When next hop returns None, the route table is the cause with no further checking required, and you go straight to the route that matched. The fix is to remove that route, repoint its next hop to where the traffic should go, or narrow its address prefix so it stops swallowing the traffic you care about. Recognizing None as a definitive routing verdict is what saves you from the common detour of auditing NSG rules that were never in the path while a single black-hole route quietly drops everything.

Q: Why does the effective security rule set differ from the NSG I configured?

A single interface can be governed by two NSGs at once, one on the subnet and one on the NIC, and the platform merges their rules with the built-in default rules into a single effective policy sorted by priority. So the rule actually deciding a flow may live in the NSG you are not looking at, or be a default rule you forgot, or be a higher-priority rule that matches first and shadows the one you wrote. Effective security rules returns the flattened, ordered list the platform actually enforces, which reveals these interactions immediately. The classic case is adding an allow rule that never takes effect because a deny rule at a lower priority number in the other NSG matches first. The single-NSG view hides this; the effective view places both rules in the same sorted list, and the fix becomes obvious: renumber the priorities or narrow the deny.

Q: When should I use connection troubleshoot instead of IP flow verify?

Use IP flow verify when you want to test the security decision on one interface, instantly and without generating traffic. Use connection troubleshoot when the per-layer checks pass but the connection still fails, because it runs an actual end-to-end reachability test along the path and identifies the breaking hop, including problems on the destination side or in a downstream appliance that source-side tools cannot see. IP flow verify is the cheaper, read-only probe and belongs first; connection troubleshoot is heavier because it generates real probe traffic and walks the path, so it belongs a step later. A reliable sequence is next hop to confirm routing, IP flow verify to confirm the local security decision, then connection troubleshoot to confirm the path completes. When connection troubleshoot places the failure at the destination, you follow with an inbound IP flow verify on the destination NIC to name the deny rule.

Q: What is the difference between connection troubleshoot and connection monitor?

Connection troubleshoot runs once and reports the path state at that moment, which makes it the tool for active debugging when you have an incident open and a hypothesis to confirm. Connection monitor runs continuously on a schedule, probing reachability and latency between endpoints and recording the results over time, which makes it the tool for ongoing assurance of a path you depend on. The continuous record is what catches intermittent failures that no point-in-time test will ever see, such as a connection that drops for thirty seconds once a day, and it surfaces latency trends before they become outages. A path that is still reachable but whose round-trip time has climbed tenfold is a warning connection monitor delivers and a one-time test cannot. Use the one-time test while debugging and a standing connection monitor for any path whose late discovery would be costly, such as a database connection or an external dependency.

Q: When is packet capture the right tool rather than a faster check?

Reach for packet capture only after next hop, IP flow verify, and connection troubleshoot have each come back clean and the application still fails. At that point the fault sits above the routing and filtering layers, in the protocol or application conversation itself, and you need to see the actual packets: a TLS handshake that stalls, a server that resets immediately after the SYN, retransmits that signal packet loss, or an MTU problem that fragments and hangs. Packet capture is the only Network Watcher tool that observes real traffic, which makes it the most authoritative and the heaviest. It needs the agent extension on the target virtual machine, it observes live traffic so it carries a runtime footprint, and it produces a file you must download and analyze rather than an answer. For any fault the cheaper tools can localize, a capture is wasted effort; reserve it for the genuinely mysterious case where the network plainly works but the conversation does not.

Q: Why do NSG flow logs require a network security group?

Flow logs are fundamentally a record of NSG evaluations: they capture the flows that traversed a network security group, allowed and denied, with the five-tuple and the decision. Without an NSG in the path there is nothing to record, because the log exists to document what an NSG decided about the traffic it saw. This has two practical consequences. First, a path with no NSG produces no flow log, so if you want historical evidence you must have an NSG in the path. Second, the record only exists from the moment logging is enabled, so the time to turn it on is when you build the NSG, not when you wish you had the data during an incident. Treating flow-log enablement as a standard property of every production NSG, rather than something you add reactively, is what ensures every after-the-fact investigation has the evidence it needs.

Q: How do traffic analytics build on flow logs?

Flow logs are a firehose of individual flow records; traffic analytics processes those records into aggregated patterns. Where a raw flow log gives you one flow at a time, traffic analytics shows you which hosts talk to which, how much traffic flows where, what is being denied and from which sources, which ports are most active, and how the overall pattern changed over time. It is the tool for understanding the shape of your traffic rather than a single flow, which makes it far faster for questions like whether a stream of denied traffic is a misconfiguration or a scan, what is actually communicating with a subnet, or whether a deployment shifted the traffic pattern. Reading those answers from raw logs would take an hour of parsing; traffic analytics surfaces them at a glance. It depends on flow logs being enabled, so the two are configured together, and they answer the past-traffic questions no live diagnostic can.

Q: Can Network Watcher diagnose a private endpoint or DNS resolution problem?

Network Watcher diagnoses the network path, meaning routing, filtering, and reachability. It does not resolve names. So a private endpoint problem that is actually a name resolution problem will look healthy to the path tools while the application still fails, because the application resolved the name to the wrong address before any packet was sent, and the tools are correctly reporting a healthy path to whatever address was resolved. The discipline is to confirm name resolution separately whenever a private endpoint is involved, and then use Network Watcher to confirm the path to the resolved address. The tools are honest about what they test; the burden is on you to test the resolution layer they do not. A clean path to the wrong address looks exactly like success to Network Watcher and like failure to the application, which is why DNS must be checked on its own when a private endpoint is in play.

Q: How do I diagnose a virtual network peering that came up but carries no traffic?

Treat it as the routing and filtering questions it actually is, because there is no separate peering troubleshooter. Run next hop from a source in one virtual network to a destination in the peered one. A next-hop type of VnetPeering means routing correctly sends the traffic across the peering, so the routing half is healthy; anything else means the route is the problem, often a missing peering link or a route table that does not account for the peered range. Then run IP flow verify on each side to confirm neither security layer is denying the flow. Between next hop and IP flow verify you decompose the peering failure into the same routing and filtering checks as any other connectivity fault. If both come back clean and traffic still does not flow, connection troubleshoot exercises the end-to-end path and points at the destination-side layer, which is frequently an NSG on the far end of the peering.

Q: Does packet capture require anything to be installed on the virtual machine?

Yes. Packet capture relies on the Network Watcher agent extension on the target virtual machine, because the capture observes traffic at the machine rather than reading a platform model. This is part of why packet capture sits at the bottom of the diagnostic order: it has a setup prerequisite the read-only tools do not. The operational implication is to decide ahead of time which machines might need a capture and ensure the extension is present or one command away on those machines, so that when you reach the rare case that genuinely requires the bytes you are not installing an extension in the middle of an incident. You also need a storage account for the capture to write to, and you set a time or size limit so the capture stops on its own. Preparing the capability in advance on the machines that matter turns a slow, multi-step setup into a single command when the moment arrives.

Q: What permissions and costs come with running these diagnostics?

The read-only diagnostics read network state, so they need read access to the network resources involved, and they are effectively free to run, which is exactly why they belong first in the order: there is no reason not to run next hop or IP flow verify, so run them often. The live tools that act, such as packet capture, require the corresponding action permissions plus a storage account to write to. On cost, the tools that store data, primarily flow logs and traffic analytics, carry the storage and processing cost that scales with traffic volume and retention, so enable logging where the diagnostic value justifies the spend and set a retention that matches how far back you would realistically investigate. Exact role names and pricing shift over time, so verify the current role definitions and rates against the official Azure documentation when you set up access, rather than relying on a remembered figure.

Q: What is the single habit that makes connectivity diagnosis fastest?

Match the symptom to one question and run the one tool that answers it, descending from the cheap read-only checks to the heavy live ones only as the fault demands. Concretely, that means next hop first to confirm routing, then IP flow verify to confirm the local security decision, then connection troubleshoot to confirm the end-to-end path, and packet capture last and only when the network plainly works but the conversation does not. The engineers who resolve these fastest are not the ones who know the most obscure feature; they are the ones who never widen a rule on a hunch, never reach for packet capture first, and never diagnose the wrong region or the wrong layer. The second habit that compounds the first is preparation: a watcher in every region, flow logs on the NSGs that matter, and connection monitors on the critical paths, all in place before the incident, so diagnosis starts with the instruments already recording.

Q: How do I make the Network Watcher diagnostic surface reproducible?

Define it as code rather than clicking it together in the portal. Enable the regional watcher, NSG flow logs, traffic analytics, and standing connection monitors through your templates, treating flow-log enablement as a property of every production NSG and a connection monitor as a property of every critical path, declared alongside the resources they observe. The principle is that diagnostics you set up by hand during an incident are diagnostics you did not have when the incident started. Encoding the surface as code means a new environment is observable the moment it exists, a new region cannot quietly lack a watcher, and a new NSG cannot quietly lack logging, because the template will not allow it. This converts the reactive scramble into a property of the platform, which is the difference between an environment where the first ten minutes of every incident are spent building the diagnostic surface and one where that surface was always there, always governed, and always recording.

What Azure Network Watcher actually is

Why does Network Watcher need to be enabled per region?

The tool-answers-the-question rule

Which tool should I reach for first?

IP flow verify: would an NSG allow this flow?

What does IP flow verify check that effective security rules does not?

Next hop: where does routing send this packet?

How does next hop reveal a route that black-holes traffic?

Effective security rules: which rule actually decides?

Why does the effective rule set differ from what I configured?

Connection troubleshoot: does the whole path work end to end?

When should I use connection troubleshoot instead of IP flow verify?

Connection monitor: is the path still working over time?

How is connection monitor different from a one-time connectivity test?

Packet capture: what is happening inside the conversation?

When is packet capture the right tool rather than a faster check?

NSG flow logs and traffic analytics: what happened before?

How do flow logs and traffic analytics help after the fact?

Topology and VPN diagnostics

Why is topology a diagnostic tool if it does not return a verdict?

How Network Watcher fits the rest of the network

Designing Network Watcher into a production environment

How do I make Network Watcher part of infrastructure as code?

A worked diagnosis from symptom to cause

What is the fastest reliable order to diagnose a connectivity failure?

The verdict

The recurring connectivity patterns and the tool that matches each

How do I tell a routing fault from a filtering fault when the symptom is identical?

Common misdiagnoses and how to avoid them

Why does my diagnostic return an error instead of a verdict?

Permissions, cost, and operational notes

Turning diagnostics into alerts and automation

How do I get warned before a path fails completely?

Frequently Asked Questions

Q: What does Azure Network Watcher do, in one sentence?

Q: Do I have to enable Network Watcher separately in every Azure region?

Q: What exactly does IP flow verify tell me, and what does it not test?

Q: How does next hop reveal a routing problem I cannot see from the symptom?

Q: What does it mean when next hop returns a type of None?

Q: Why does the effective security rule set differ from the NSG I configured?

Q: When should I use connection troubleshoot instead of IP flow verify?

Q: What is the difference between connection troubleshoot and connection monitor?

Q: When is packet capture the right tool rather than a faster check?

Q: Why do NSG flow logs require a network security group?

Q: How do traffic analytics build on flow logs?

Q: Can Network Watcher diagnose a private endpoint or DNS resolution problem?

Q: How do I diagnose a virtual network peering that came up but carries no traffic?

Q: Does packet capture require anything to be installed on the virtual machine?

Q: What permissions and costs come with running these diagnostics?

Q: What is the single habit that makes connectivity diagnosis fastest?

Q: How do I make the Network Watcher diagnostic surface reproducible?

Please disable your content blocker

Read the rest with bitcoin

Related Reading

Azure Network Watcher and Diagnostics

Write to Ryan