Fix Azure VM SSH Connection Refused

When an Azure VM SSH connection refused message lands in your terminal at the worst possible moment, the instinct is to assume the machine is gone and start planning a rebuild. That instinct is almost always wrong, and acting on it destroys the one thing that would have told you what actually broke. The phrase your client printed is not noise. It is the single most precise diagnostic you will get for free, and the difference between “connection refused,” “connection timed out,” and “permission denied (publickey)” points at three completely different layers of the stack. Refused means the host answered and told you nothing is listening on port 22. Timed out means nothing answered at all. Permission denied means the network and the daemon are both fine and the problem is purely authentication. Read the message first, and you eliminate two thirds of the possible causes before you touch a single setting.

This article treats the SSH error string as the entry point to a directed diagnosis rather than a generic “VM is unreachable” panic. You will learn to decode each message into the layer it implicates, confirm which specific cause is yours with a command that proves it, apply the matching fix, and recover access through the serial console or the VMAccess extension when you cannot get a shell at all. The recovery paths matter because the most common mistake after misreading the error is reaching for a redeploy, which often wipes the evidence and, on a VM with an ephemeral or improperly backed disk, can lose data outright. The goal is to get you back into the machine, understand why you were locked out, and leave you with the prevention that stops it from happening again.

Decoding Azure VM SSH connection refused versus timed out and permission denied errors - Insight Crunch

Read the SSH Error String Before You Change Anything

Every productive SSH diagnosis on Azure starts the same way: you read exactly what the client printed and you resist the urge to act until you know which layer it names. The three message families are not interchangeable, and conflating them is the root of most wasted hours. A connection that is refused has reached the host. Something on the other end actively returned a TCP reset because no process was bound to the port the client knocked on. A connection that times out reached nobody. The packets left your machine, crossed the internet, and vanished into a dropped state somewhere along the path, which on Azure almost always means a Network Security Group, a firewall, or a route problem swallowed them silently. A connection that produces “Permission denied (publickey)” completed the TCP handshake, negotiated the SSH protocol, reached a running daemon, and failed only when the daemon evaluated your credential. These are three different worlds, and the message tells you which one you are in.

What does “port 22: Connection refused” actually mean?

It means a host at that address answered your TCP request and explicitly rejected it with a reset packet, because no service is listening on port 22. The network path is open. The cause is local to the machine: sshd is stopped, crashed, bound to a different port, or blocked by a host firewall like ufw or firewalld.

That distinction is the whole game. If the client says refused, you can stop thinking about Network Security Groups, Azure Firewall, route tables, and peering, because those failures present as a timeout, not a reset. The host is up enough to answer at the TCP layer, which means the operating system booted, the network interface attached, and the kernel is processing packets. What is missing is a listener on port 22. The work moves inside the VM: check whether the sshd process is alive, whether the disk filled and starved it, whether a configuration change moved it to a non-default port, and whether a local firewall on the guest closed the door. A refused message is a gift, because it has already told you the network is not your problem.

Why does the difference between refused and timed out matter so much?

Because they live on opposite sides of the network boundary and share almost no causes. Refused is a host-and-service problem you fix inside the VM. Timed out is a path problem you fix in Azure networking, usually a missing or misordered NSG rule. Reading which one the client printed eliminates an entire half of the cause list instantly, before you run a single command.

This is the refused-versus-timed-out rule, and it is the most useful single idea in this article: refused means sshd or the local port, timed out means the network path, and the message your client already printed has done half the diagnosis for you. Engineers who skip this step regenerate SSH keys to fix a deleted NSG rule, restart sshd to fix a network route, and open support cases for problems the error string had already localized. The discipline is small and the payoff is large. Before you open the portal, before you reset anything, read the message and name the layer.

How do I get a verbose SSH trace to confirm the layer?

Run the client with increased verbosity so you can watch exactly how far the connection gets before it fails. The command is ssh -vvv azureuser@<public-ip>, and the last few lines before the failure tell you whether the client got a TCP reset (refused), waited and gave up (timed out), or completed the handshake and was rejected at authentication (permission denied).

The verbose trace turns a one-line error into a frame-by-frame account of the handshake. With -vvv you see the client resolve the address, open the socket, and either connect or stall. If you see Connection established followed by key exchange lines and then Permission denied, the network and the daemon are healthy and you are looking at an authentication failure. If you see the socket attempt followed immediately by Connection refused, the host reset you and sshd is the suspect. If the trace hangs after connect to address for the full timeout and then prints Operation timed out or Connection timed out, nothing answered and you are in NSG and routing territory. Keep the last twenty lines of that trace handy; they are the evidence that justifies every step that follows. For the broader picture of how the VM, its network interface, its public IP, and its security rules fit together, the complete engineering guide to Azure Virtual Machines lays out the resource model that every one of these failures sits inside.

There is one more reading skill worth building before the causes. The message can lie about its own simplicity. “Connection timed out” sometimes masks a half-open path where the SYN reaches the VM but the SYN-ACK never returns, which can be an asymmetric route or a misapplied user-defined route rather than a blunt NSG drop. “Connection refused” can come not from the VM at all but from a load balancer or a NAT rule pointing at a backend with no healthy listener. And “Permission denied (publickey)” occasionally hides a server that crashed mid-handshake because its disk filled, so the daemon was alive enough to start the protocol but could not read the authorized_keys file. These edge cases are the reason the next section gives you a confirming command for every cause rather than asking you to trust the message alone. The message narrows the search; the command proves the cause.

The InsightCrunch SSH Error Decoder

Before working the causes one by one, anchor the diagnosis in a single reference you can return to mid-incident. The SSH error decoder maps each exact message to the layer it implicates, the command that confirms whether that cause is yours, and the recovery action that fixes it. This table is the findable artifact for this article and the structure the rest of the diagnosis follows. The classification column is the one to internalize: it is the refused-versus-timed-out rule made concrete across every message you are likely to see.

Exact client message	Layer implicated	Most likely cause	Confirming check	Recovery action
`port 22: Connection refused`	Host service (local)	sshd stopped, crashed, moved port, or local firewall closed 22	Serial console: `systemctl status sshd`; `ss -tlnp \\| grep :22`	Restart sshd via run-command or serial console; open 22 in ufw/firewalld
`Connection timed out` / `Operation timed out`	Network path	NSG missing allow-22, route table blackhole, public IP detached	`az network nic list-effective-nsg`; check effective routes	Add or reorder the allow-22 NSG rule; restore the route or public IP
`Permission denied (publickey)`	Authentication	Wrong key, bad authorized_keys permissions, wrong user	`ssh -vvv` shows offered keys; serial console reads `~/.ssh/authorized_keys`	Reset the key with VMAccess (`az vm user update`); fix file modes
`Permission denied (password)`	Authentication policy	Password auth disabled in sshd_config	Serial console: `grep PasswordAuthentication /etc/ssh/sshd_config`	Use the key path, or enable password auth deliberately and reload sshd
`Host key verification failed`	Client trust store	Host key changed after redeploy, reimage, or IP reuse	Compare client `known_hosts` entry to the VM fingerprint	Remove the stale `known_hosts` line with `ssh-keygen -R <ip>`
`No route to host`	Network path (lower)	NIC detached, IP misconfigured, instance deallocated	`az vm get-instance-view` power state; NIC association	Start the VM; reattach the NIC or public IP
`kex_exchange_identification: read: Connection reset`	Host service (early)	sshd starting then dying, or disk full mid-handshake	Boot diagnostics; serial console disk usage `df -h`	Free disk space, restart sshd, extend the OS disk

The decoder makes the namable claim operational. The first two rows are the refused-versus-timed-out split, and they account for the overwhelming majority of real cases. The middle rows are the authentication family, which is where engineers waste the most time because the failure looks dramatic but the network and the daemon are both healthy. The last rows are the rarer presentations that still resolve to one of the three layers once you read them carefully. Work down the column that matches your message, run the confirming check, and apply the recovery only after the check proves the cause. Every section below expands one band of this table with the full diagnosis and the tested commands.

Cause One: The Network Path Dropped Port 22 (Connection Timed Out)

When the client times out, the packets never reached a listener, and on Azure the first suspect is always the Network Security Group. An NSG is a stateful packet filter attached to either the subnet, the network interface, or both, and it evaluates inbound rules in priority order until one matches. SSH works only if a rule with a lower priority number than any blocking rule allows TCP traffic to destination port 22 from your source address. The default inbound rule set denies everything that is not explicitly allowed, so a VM created without an SSH rule, or one whose rule was deleted during a cleanup, presents exactly as a timeout. The packets arrive at the Azure fabric, hit the deny, and are dropped without a reply, which is precisely the silent failure a timeout describes.

How do I confirm an NSG is blocking my SSH connection?

Ask Azure to compute the effective rules that apply to the network interface, because a VM can have an NSG on the subnet and another on the NIC, and the union is what matters. Run az network nic list-effective-nsg --name <nic-name> --resource-group <rg> and look for an inbound allow rule covering port 22 from your source. If none exists or a deny precedes it, that is your cause.

The effective rules view is authoritative in a way that reading a single NSG is not. A subnet-level NSG and a NIC-level NSG both apply, and the more restrictive outcome wins for any given flow, so a permissive NIC rule does not save you if the subnet denies the traffic at a lower priority number. Pull the effective set and read it as Azure will evaluate it. If you see the default DenyAllInBound at priority 65500 with no allow for 22 above it, add a rule. The command to add an allow is straightforward, and you should scope the source to your address rather than the internet wherever possible:

az network nsg rule create \
  --resource-group <rg> \
  --nsg-name <nsg-name> \
  --name Allow-SSH \
  --priority 300 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes <your-public-ip>/32 \
  --destination-port-ranges 22

Priority is the trap most people miss. NSG rules evaluate from the lowest number to the highest, and the first match wins. If a broad deny sits at priority 200 and your new allow lands at priority 300, the deny matches first and your allow never fires. When the timeout persists after you add a rule, list the rules sorted by priority and confirm nothing denies 22 ahead of your allow. The deeper mechanics of how these rules interact, including the priority ordering and the subnet-versus-NIC layering that produces the most confusing cases, are covered in the dedicated walkthrough of why an NSG blocks traffic unexpectedly, and it is worth reading in full if your timeouts recur after you think you have allowed the port.

A timeout is not always the NSG. Two other path failures present identically. The first is a route table problem: a user-defined route that sends 0.0.0.0/0 to a virtual appliance which then drops or fails to return SSH traffic creates an asymmetric path where your SYN arrives but the SYN-ACK never comes back. Check the effective routes on the NIC with az network nic show-effective-route-table --name <nic> --resource-group <rg> and confirm the route for your return path points where you expect. The second is a detached or changed public IP: if the VM was deallocated and its dynamic public IP was released and reassigned to another resource, you are timing out against an address that no longer belongs to your machine. Confirm the current public IP in the VM instance view rather than trusting a value you cached days ago. When the path is genuinely clear and the timeout continues, the cause is usually a host firewall on the guest, which is the bridge to the refused family and the subject of the next section, because a guest firewall configured to drop rather than reject will produce a timeout even when the NSG is open.

Cause Two: sshd Is Not Running on the VM (Connection Refused)

A refused message moves the investigation inside the machine. The TCP handshake reached the host and the host sent a reset, which only happens when the kernel received the packet on port 22 and found no process bound to it. The daemon is the first thing to check. On most modern distributions the service is managed by systemd and named either sshd or ssh, and its state tells you immediately whether the listener exists. The challenge is that you cannot SSH in to check why SSH is down, so the diagnosis runs through the serial console or the run-command channel, both of which reach the VM through the Azure control plane rather than the network data path.

How do I check sshd status when I cannot log in?

Use the run-command channel, which executes a script on the VM through the Azure agent without needing the network or SSH. Run az vm run-command invoke --resource-group <rg> --name <vm> --command-id RunShellScript --scripts "systemctl status sshd; ss -tlnp | grep :22". The output shows whether the daemon is active and whether anything is bound to port 22.

The run-command path is the workhorse of refused diagnosis because it sidesteps the broken layer entirely. The Azure Linux agent (waagent) runs inside the guest and takes instructions from the platform, so as long as the agent is healthy and the VM is running, you can execute arbitrary shell without a shell session. If systemctl status sshd reports the service as inactive or failed, you have your cause and the fix is to start it. If the status is active but ss -tlnp shows nothing on 22, the daemon is running but bound elsewhere, which points at a changed Port directive in sshd_config. Restart the service and confirm the bind:

az vm run-command invoke \
  --resource-group <rg> \
  --name <vm> \
  --command-id RunShellScript \
  --scripts "systemctl restart sshd && systemctl is-active sshd && ss -tlnp | grep :22"

When the daemon refuses to start, read why before you restart it again. A common pattern is a syntax error introduced into sshd_config by a hand edit or a configuration management run, which makes the daemon exit on startup. Validate the configuration with sshd -t, which parses the file and prints the offending line without starting the service. Another pattern is a missing or wrongly permissioned host key: sshd will not start if its host key files are absent or have permissions that are too open, and the journal will say so. Pull the service journal through run-command with journalctl -u sshd -n 50 --no-pager and let the daemon tell you why it died. The journal is rarely silent about its own failures, and reading it saves you from blindly restarting a service that will exit again the moment it tries to parse a broken file.

A local guest firewall deserves a mention here because it straddles refused and timed out. If ufw on Ubuntu or firewalld on RHEL-family systems is configured to reject connections on 22, the client sees a refused message even though sshd is running, because the firewall sends the reset. If it is configured to drop instead, the client sees a timeout. Either way, the fix is to allow the port at the guest level. Through run-command you can inspect and open it: ufw status then ufw allow 22/tcp on Ubuntu, or firewall-cmd --list-all then firewall-cmd --add-service=ssh --permanent && firewall-cmd --reload on RHEL-family hosts. The guest firewall is easy to forget because the Azure NSG feels like the only firewall in the picture, but the operating system has its own packet filter, and a hardening script or a CIS baseline applied after provisioning frequently closes 22 to all but a management subnet.

Cause Three: A Full OS Disk Killed sshd (Connection Refused)

One of the most under-diagnosed reasons for a refused connection is a full operating system disk. When the root filesystem reaches 100 percent, the daemon cannot write to its log, cannot create the temporary files it needs during a session setup, and in many cases cannot read its own configuration cleanly. The service either fails to start after a restart or dies mid-handshake, which produces either a clean refused or the more cryptic kex_exchange_identification: read: Connection reset that the decoder maps to an early host-service failure. Engineers chase keys and NSG rules for an hour before someone runs df and finds the disk at zero bytes free. The signature is a VM that worked yesterday, accepted no configuration change, and suddenly refuses or resets connections, often after a log file or an application wrote until the partition filled.

Can a full disk really cause SSH connection refused?

Yes. A root filesystem at 100 percent prevents sshd from writing logs and session files, so the daemon fails to fork a session and the connection is reset or refused. The tell is a previously working VM that broke without a config change. Confirm with df -h through the serial console, free space, and the daemon recovers.

The confirmation is a single command, and you run it through run-command or the serial console since you cannot get a shell:

az vm run-command invoke \
  --resource-group <rg> \
  --name <vm> \
  --command-id RunShellScript \
  --scripts "df -h /; du -xh / 2>/dev/null | sort -rh | head -20"

The first command shows the fill level per filesystem and the second finds the largest directories so you can identify what consumed the space. The usual culprits are unrotated application logs, a journald journal with no size cap, an apt or dnf cache that ballooned, core dumps from a crashing process, or a database write-ahead log that grew unbounded. Free enough space to let the daemon breathe and restart it: clear the obvious offender, vacuum the journal with journalctl --vacuum-size=200M, clean the package cache, and then restart sshd. If the disk is genuinely undersized for the workload, freeing space is a stopgap and the real fix is to extend the OS disk, which on Azure you do by resizing the managed disk and then growing the partition and filesystem inside the guest.

Extending the disk is worth walking through because the order matters and people miss the in-guest step. You resize the managed disk from the control plane with az disk update --resource-group <rg> --name <disk> --size-gb <new-size>, which on many configurations requires the VM to be deallocated first because you cannot grow an attached OS disk while it is running on every SKU. After the resize, the larger disk is visible to the guest but the partition and filesystem still occupy the old size, so you grow the partition with growpart /dev/sda 1 and then extend the filesystem with resize2fs /dev/sda1 for ext4 or xfs_growfs / for xfs. Skipping the in-guest growth leaves you with a larger disk that the operating system still reports as full, which is a frequent source of “I extended the disk and it is still broken” confusion. The platform gave you the space; the guest has to claim it.

There is a related disk failure that presents differently and deserves a note. If the OS disk is intact but the data disk that holds /home or /var filled or failed to mount after a reboot, sshd may start fine but the user’s home directory and authorized_keys file are unreachable, which surfaces as a permission denied rather than a refused. The boundary between these cases is exactly why reading the message first pays off: a refused points at the daemon and the root disk, while a permission denied with a healthy daemon points at the home directory and the key file. The same df -h check covers both, but the recovery diverges, and knowing which message you saw tells you which directory to inspect.

Cause Four: Permission Denied (publickey) Is Authentication, Not Connectivity

When the client prints Permission denied (publickey), the network is open, the daemon is running, and the SSH protocol negotiated successfully. Everything up to authentication worked. The failure is that the key your client offered did not match an entry the server would accept, or the server could not read the entry it needed. This is the message that triggers the most counterproductive instinct in the entire diagnosis, because it sounds like a credential catastrophe and tempts you to regenerate keys, when in most cases the key is fine and a file permission or a username is the real problem. Reaching for a key reset here is the classic misdiagnosis, and it often makes things worse by replacing a working key with a new one while the actual cause, a wrong home directory permission, remains untouched.

Why do I get permission denied publickey when my key is correct?

Because SSH refuses to use a key when the home directory, the .ssh directory, or the authorized_keys file have permissions that are too open, since a world-writable path is a security risk the daemon will not trust. The home directory must not be group or world writable, .ssh must be 700, and authorized_keys must be 600 owned by the user.

The permission rule is strict and silent. sshd checks the ownership and mode of the path leading to authorized_keys, and if any segment is writable by group or other, it ignores the keys there and rejects the login as if no key matched. The verbose client trace will show your key being offered and the server declining it, which looks like a key mismatch but is actually a server-side trust refusal. Confirm and fix the permissions through run-command, targeting the user you log in as:

az vm run-command invoke \
  --resource-group <rg> \
  --name <vm> \
  --command-id RunShellScript \
  --scripts "chmod 700 /home/azureuser/.ssh; chmod 600 /home/azureuser/.ssh/authorized_keys; chown -R azureuser:azureuser /home/azureuser/.ssh; chmod 755 /home/azureuser; ls -la /home/azureuser/.ssh"

Beyond permissions, three other authentication causes hide behind the same message. The first is the wrong username: Azure Linux images use a provisioning user you chose at creation, commonly azureuser, and SSHing as root or as ubuntu when the key lives in azureuser’s authorized_keys fails with publickey even though the key is valid for the right account. Confirm the user by reading the home directories on the VM and try the correct one. The second is the wrong key offered by the client: if you have many keys in your agent, the server may reject several before reaching the right one, and depending on configuration it can fail before trying yours. Force the specific key with ssh -i ~/.ssh/the_right_key -o IdentitiesOnly=yes azureuser@<ip> so the client offers only that identity. The third is a corrupted or empty authorized_keys file, which can happen if a cloud-init run failed partway or a configuration tool truncated it. Read the file through run-command and confirm your public key is present and intact on a single line, because a key split across lines by a bad copy paste will never match.

When the key genuinely needs replacing, because it was lost, leaked, or never made it onto the VM, you do not rebuild the machine. You reset it with the VMAccess extension, which is covered in the recovery section below and is the correct tool for the job. The point of this section is to exhaust the cheap causes first: a chmod fixes more publickey failures than a key reset does, and reading the verbose trace tells you which one you are looking at before you change anything.

Cause Five: Password Authentication Is Disabled

A subset of permission-denied failures are not about keys at all. If you are trying to log in with a password and the client reports Permission denied (password) or simply rejects every password you enter, the likely cause is that password authentication is disabled in the daemon configuration. Azure Linux images and most hardening baselines set PasswordAuthentication no in sshd_config by design, because key-based authentication is stronger and password logins invite brute-force attacks against a public-facing port. The login fails not because your password is wrong but because the server will not consider passwords at all. This is a policy outcome, not a fault, and the fix depends on whether you actually want password logins or whether you should be using a key.

The confirmation is a one-line read of the configuration through the serial console or run-command:

az vm run-command invoke \
  --resource-group <rg> \
  --name <vm> \
  --command-id RunShellScript \
  --scripts "grep -Ei '^[[:space:]]*PasswordAuthentication|^[[:space:]]*ChallengeResponseAuthentication' /etc/ssh/sshd_config /etc/ssh/sshd_config.d/* 2>/dev/null"

Note that the directive can live in the main sshd_config or in a drop-in file under sshd_config.d, and a drop-in often overrides the main file, so check both. If the effective setting is no and you genuinely need a password login, change it deliberately, reload the daemon, and understand the security trade-off you are accepting by exposing a password-authenticated service. The far better path in nearly every case is to add or reset an SSH key rather than enable passwords, because a password-authenticated SSH port on a public IP is a magnet for credential-stuffing bots and will show thousands of failed login attempts in the auth log within hours of exposure. If you must enable it for a short, controlled window, restrict the source to your address in the NSG so the world cannot reach the password prompt at all.

There is a quieter variant of this failure where password auth is enabled but the account is locked or has no password set. A VM provisioned with only a key has no password for the user, so a password login fails even when the daemon would accept one. The auth log shows the attempt reaching PAM and failing there rather than at the SSH policy layer, which is a subtle but useful distinction when you read journalctl -u sshd or /var/log/auth.log. The cleaner resolution remains the same: set a key with VMAccess and log in with it, leaving password authentication off where it belongs.

Cause Six: The Host Key Changed After a Redeploy

The message Host key verification failed, often wrapped in a loud warning about a possible man-in-the-middle attack, is the one failure in this set that is entirely on the client side. SSH remembers the host key of every server it connects to and stores it in your local known_hosts file. When the key the server presents no longer matches the one you recorded, the client refuses to proceed and prints the alarming warning, because a changed host key is exactly what an interception attack would look like. On Azure, the benign and common explanation is that the VM was redeployed, reimaged, or recreated at the same public IP, which regenerates the host keys, or that a dynamic public IP was released and handed to a different machine you now reach at the cached address. The server is new; your client is comparing it to a fingerprint that belongs to a machine that no longer exists.

Why does SSH warn that the host key changed after a redeploy?

Because redeploying or reimaging an Azure VM regenerates its SSH host keys, so the fingerprint your client cached in known_hosts no longer matches what the server now presents. SSH treats a changed host key as a possible interception and refuses to connect until you remove the stale entry from your local known_hosts file.

The fix is local and quick, but you should pause before applying it to confirm the change is expected. If you did redeploy, reimage, or know the IP was reassigned, the new key is legitimate and you simply remove the stale entry:

ssh-keygen -R <public-ip-or-hostname>

That command deletes the old fingerprint, and your next connection will record the new one after asking you to confirm it. If you did not expect the host key to change, do not blindly clear the entry. A genuinely unexpected host key change on a machine you did not touch warrants a moment of suspicion: confirm through the Azure portal or the instance view that the VM was not redeployed by automation, a deployment pipeline, or a teammate, and that the IP still maps to your resource. The warning exists for a reason, and the discipline of confirming the change is expected before clearing it is the difference between routine maintenance and ignoring a real signal. In practice on Azure the cause is almost always a redeploy or an IP reuse, and once you confirm that, the ssh-keygen -R clears it and you reconnect normally.

A subtle related case appears when you use a hostname rather than an IP and the DNS record was repointed to a new VM. The known_hosts entry is keyed by the name you connect with, so a CNAME or A record that now resolves to a different machine triggers the same warning even though nothing about your VM changed. Reading the warning carefully helps here: it prints the offending line number in known_hosts and the host it refers to, so you can confirm exactly which cached entry is stale before removing it. This is also a reason to prefer reserving a static public IP for any VM you connect to regularly, because a static address keeps the host-key relationship stable across stop and start cycles, which removes one recurring source of this warning entirely.

Recover SSH Access Without Rebuilding the VM

The thread running through every cause above is that you can fix it without destroying the machine, and the two tools that make that possible are the VMAccess extension and the serial console. Both reach into the VM through the Azure control plane rather than the broken network path, which is exactly why they work when SSH does not. Understanding them turns a lockout from a crisis into a routine repair, and it is the single most valuable capability to build before you ever need it, because the worst time to learn the recovery path is during the outage.

How do I reset the SSH key on an Azure VM I cannot log into?

Use the VMAccess extension through the az vm user update command, which writes a new public key into the user’s authorized_keys file via the Azure agent without needing a shell. Run az vm user update --resource-group <rg> --name <vm> --username azureuser --ssh-key-value "$(cat ~/.ssh/new_key.pub)" and reconnect with the matching private key.

VMAccess is the canonical recovery for the authentication family of failures. It does not touch the network, the daemon configuration, or the disk; it asks the Azure agent inside the guest to update the account’s key or reset its password. Because it runs through waagent, it works whenever the agent is healthy and the VM is running, which covers the great majority of permission-denied lockouts. The command writes your new public key, and the matching private key then logs you in. The same extension can reset a password or even repair the SSH configuration on some images, but the key reset is the one you reach for most. A critical caution: VMAccess on the default invocation can replace the existing key for that user rather than appending, so if multiple people share the account, coordinate before you reset, and prefer per-user accounts over a shared login precisely so a reset never locks out a colleague.

When the failure is not authentication but a stopped daemon, a full disk, or a broken sshd_config, the serial console is the tool. The serial console gives you a real text console to the VM, the equivalent of plugging a keyboard and monitor into a physical server, delivered through the Azure portal over the boot diagnostics channel rather than the network. It requires that boot diagnostics is enabled on the VM, which is why enabling boot diagnostics at creation time is one of the highest-value defaults you can set. Through the serial console you log in with a local account, inspect systemctl status sshd, read the journal, run df -h, validate the config with sshd -t, free disk space, fix permissions, and restart the daemon, all without the network. It is the universal recovery surface, and it is the reason a VM is rarely truly unreachable: the network can be entirely broken and you can still get a console.

The run-command channel, used throughout the cause sections above, is the third recovery surface and the most automatable. Where the serial console is interactive, run-command executes a script and returns the output, which makes it ideal for scripted diagnosis and for environments where opening a portal console is awkward. It runs through the same agent path as VMAccess, so it shares the dependency on a healthy agent and a running VM, and it cannot help if the VM is deallocated or the agent itself is broken. For those deeper failures, where the VM will not boot far enough to start the agent, the diagnosis moves to boot diagnostics and the recovery moves toward attaching the OS disk to a rescue VM, which crosses into the territory of a full no-boot recovery rather than an SSH-specific one. If your VM is not merely refusing SSH but failing to boot at all, the dedicated walkthrough of Azure VM boot failures and no-boot recovery covers the rescue-disk procedure that picks up where the SSH recovery tools stop.

For day-to-day access that sidesteps this entire class of problem, Azure Bastion deserves a mention as the structural fix. Bastion provides browser-based SSH and RDP to your VMs over private IP through a managed jump host, which means you never expose port 22 to the internet and you never depend on a public IP or an internet-facing NSG rule for access. It changes the failure surface: many of the timed-out and exposure-driven causes simply cannot happen when access flows through Bastion over the private network. Setting it up is a one-time investment that removes a recurring risk, and the full configuration is laid out in the guide to setting up Azure Bastion for secure access, which is worth reading if SSH lockouts are a repeated theme in your environment rather than a one-time incident.

A Worked Diagnosis: From Error String to Fix in Five Minutes

It helps to see the whole method run end to end on a realistic incident, because the value is in the sequence, not any single command. Picture a VM that accepted connections fine yesterday and now refuses them. You read the client message and it says port 22: Connection refused. The refused-versus-timed-out rule fires immediately: this is a host-and-service problem, not a network one, so you do not open the NSG. You skip straight to run-command and ask the daemon how it is doing.

az vm run-command invoke --resource-group prod-rg --name web-01 \
  --command-id RunShellScript \
  --scripts "systemctl is-active sshd; df -h /; journalctl -u sshd -n 20 --no-pager"

The output shows sshd as failed, the root filesystem at 100 percent, and a journal line about being unable to write a temporary file. The diagnosis is now complete and it took one command: a full disk killed the daemon. You free space and restart in a second invocation.

az vm run-command invoke --resource-group prod-rg --name web-01 \
  --command-id RunShellScript \
  --scripts "journalctl --vacuum-size=200M; apt-get clean 2>/dev/null; rm -f /var/log/*.gz; df -h /; systemctl restart sshd; systemctl is-active sshd"

The disk now shows free space, sshd reports active, and your next SSH attempt succeeds. The entire incident resolved without a redeploy, without a key reset, and without touching the network, because the error string told you which layer to investigate and the confirming command proved the cause before you changed anything. Contrast this with the panic path: a reimage would have fixed the symptom by replacing the disk, but it would also have destroyed the evidence, hidden the unbounded log that filled the disk, and guaranteed a recurrence the next time that log grew. The method is faster and it teaches you something the rebuild never would. This is the discipline the rest of this article is built to instill, and it is the same discipline that the hands-on labs reinforce when you can break a VM safely and practice the recovery; you can run exactly these scenarios in a sandbox through the hands-on Azure labs and command library on VaultBook and drill the diagnostic sequence until reading the error and reaching for the right command is reflex.

Prevent SSH Connection Refused From Recurring

Fixing the immediate lockout is half the job. The other half is making sure the same machine, or the next one you build, does not lock you out the same way. Prevention here is not a vague best practice; it is a small set of concrete defaults that each close a specific cause from the decoder. Set them at build time and you remove most of this article’s failure modes before they can occur.

Enable boot diagnostics on every VM at creation, because it is the prerequisite for the serial console, and the serial console is your recovery surface when everything else is broken. A VM without boot diagnostics is a VM you cannot rescue through the console when the network and the daemon both fail, which forces the slower disk-swap recovery. The cost is trivial and the capability is the difference between a five-minute fix and an afternoon. Pair it with the Azure agent in a healthy state, since run-command and VMAccess both depend on it, and confirm the agent is reporting in the instance view so you know your control-plane recovery paths are live before you need them.

Reserve a static public IP for any VM you connect to regularly, or better, route access through Azure Bastion and stop exposing port 22 to the internet at all. A static IP removes the host-key-changed warning that dynamic IP reassignment causes and keeps your known_hosts entry stable across stop and start cycles. Bastion goes further by eliminating the public SSH surface entirely, which closes the timed-out causes driven by NSG misconfiguration and the brute-force exposure that password authentication invites. For Linux fleets that you manage at scale, consider disabling password authentication uniformly and standing up a key management process, so a lost key is a VMAccess reset rather than a lockout, and so no machine ever presents a password prompt to the open internet.

Cap the things that fill disks before they fill them. Configure journald with a SystemMaxUse limit so the journal cannot grow unbounded, rotate application logs with logrotate and confirm the rotation actually runs, and put data that grows, such as databases and large logs, on a separate data disk so a runaway writer fills that disk rather than starving the root filesystem and the daemon that lives on it. The full-disk cause is one of the most common refused failures and one of the most preventable, because it is almost always an uncapped log or an unrotated file rather than a genuinely undersized disk. Monitoring closes the loop: an alert on root filesystem usage crossing 85 percent gives you days of warning before the disk fills and sshd dies, which turns a midnight outage into a routine ticket. The diagnostic muscle for all of this is something you build by practicing the failure and the fix rather than reading about it, and the scenario-based troubleshooting drills on ReportMedic walk through the refused, timed-out, and permission-denied cases as guided exercises so the diagnosis becomes automatic.

Finally, guard the configuration that locks people out. Validate every sshd_config change with sshd -t before you reload the daemon, so a syntax error never bricks the listener, and reload rather than restart where possible so an in-flight bad config does not drop your only session before you can revert it. When you harden a host with a CIS baseline or a configuration management run, test SSH access from a second session before you close the first, because the most reliable way to lock yourself out is to apply a hardening change that closes the port you are connected through and then disconnect. The defensive habit of keeping one known-good session open while you change SSH configuration has saved more engineers than any single command in this article.

Failures Often Confused With SSH Connection Refused

Several adjacent problems wear the SSH lockout costume without being SSH problems at all, and recognizing them stops you from applying an SSH fix to a non-SSH cause. The first is a deallocated or stopped VM. A VM in the stopped (deallocated) state has released its compute and, if it had a dynamic public IP, its address, so connections either time out or report no route to host. This is not a daemon or a key problem; the machine is simply not running. Confirm the power state with az vm get-instance-view --resource-group <rg> --name <vm> and read the statuses, because a deallocated VM cannot refuse a connection in the sshd sense at all, and the right action is to start it, not to debug SSH.

The second is a load balancer or NAT rule pointing at an unhealthy backend. If you connect through a public load balancer with an inbound NAT rule mapping a port to a backend VM, and that backend has no healthy listener, the load balancer can refuse or reset the connection on the daemon’s behalf, which looks like a VM-level refused but is actually a backend health problem. Check the backend pool health and the NAT rule target before you blame the VM’s daemon, because restarting sshd on a healthy backend will not help if the load balancer is sending you to a different, broken instance.

The third is the Windows analog, where the same diagnostic discipline applies to RDP rather than SSH. The refused-versus-timed-out logic translates directly: an RDP connection that times out points at the NSG and the network path, while one that refuses points at the Remote Desktop service inside the guest, and the recovery tools (run-command, serial console, the access extension) are the same. If your lockout is on a Windows VM rather than a Linux one, the parallel diagnosis is laid out in the guide to fixing Azure VM RDP connection errors, which mirrors this method on the Windows side and is the natural companion when your fleet runs both.

The fourth is Just-in-Time VM access or a time-bound NSG rule that has expired. Environments using JIT access open port 22 only for a requested window and then close it automatically, so a connection that worked an hour ago times out now not because anything broke but because the access window lapsed. If your organization uses JIT, request access again before you diagnose a timeout as a misconfiguration, because the expiry is working as designed and the fix is a new access request, not an NSG edit. The same applies to automation that opens and closes management ports on a schedule: a timeout at the wrong time of day can simply be the schedule doing its job.

The Verdict

An Azure VM SSH connection refused is one of the most over-escalated failures in cloud operations, and almost all of that escalation is avoidable. The error string your client prints has already done the hardest part of the diagnosis: refused names the host and the daemon, timed out names the network path, and permission denied names authentication. Internalize that single split, the refused-versus-timed-out rule, and you walk into every lockout knowing which half of the cause list to ignore. From there the method is mechanical: match the message to the decoder, run the confirming command through the serial console or run-command, and apply the recovery that the proven cause calls for. The full disk gets space and a restart, the missing NSG rule gets an allow, the bad permission gets a chmod, the lost key gets a VMAccess reset, and the changed host key gets an ssh-keygen -R. None of these requires destroying the machine, and reaching for a rebuild before you have read the error is how engineers turn a five-minute fix into a data-loss incident.

The deeper lesson is that an Azure VM is rarely truly unreachable. The network can be entirely broken and the serial console still reaches the guest through boot diagnostics; the daemon can be dead and run-command still executes a script through the agent; the key can be lost and VMAccess still writes a new one. These control-plane paths exist precisely so a network or service failure never becomes a lockout, and the engineers who never panic over a refused connection are simply the ones who enabled boot diagnostics, kept the agent healthy, and practiced the recovery before they needed it. Build those defaults into every VM, read the error before you act, and SSH connection refused stops being a crisis and becomes a routine, well-understood repair.

Gather the Diagnostic Signal Through the Control Plane

Reading the error tells you the layer; gathering the signal proves the cause, and on Azure the richest signal comes from the control plane rather than the broken network. Three surfaces matter, and each one answers a different question. Boot diagnostics answers whether the operating system actually booted and what it printed on the way up. Network Watcher answers whether a packet can reach port 22 and where it dies if it cannot. The Azure agent, through run-command and the instance view, answers what is happening inside a running guest. Together they cover almost every refused, timed-out, and permission-denied case without ever needing a working SSH session.

How do I see the boot output of a VM I cannot SSH into?

Enable boot diagnostics and read the serial log, which captures the kernel and init output the VM produces during boot. Run az vm boot-diagnostics get-boot-log --resource-group <rg> --name <vm> to retrieve the console log, where a filesystem mount failure, a kernel panic, or an init service that hung will appear as plain text.

The boot log is the closest thing to standing in front of the machine while it starts. A VM that refuses SSH because a data disk failed to mount will show the mount error and the resulting emergency shell in the boot log. A VM that hangs partway through boot, never reaching the point where sshd starts, shows exactly which service stalled. This matters because a timeout or a refused connection on a VM that never finished booting is not an SSH problem at all; it is a boot problem, and the boot log redirects you to the right diagnosis. Capturing a screenshot through boot diagnostics complements the text log on images that render a graphical console, and the two together tell you whether the guest reached a state where SSH could even be expected to work.

Network Watcher provides the path proof. Its IP flow verify feature takes a source, a destination, a port, and a direction, and tells you whether the effective NSG rules would allow or deny that exact flow, naming the rule responsible. This removes the guesswork from a timeout: instead of reading rules and reasoning about priority yourself, you ask Azure to simulate the packet and report the verdict. The connection troubleshoot feature goes further by actually attempting a connection from one resource to another and reporting where it succeeds or fails along the hops. When a timeout persists after you believe the NSG is correct, IP flow verify either confirms the allow, which sends you to look at route tables and guest firewalls, or names the deny rule you missed. It is the fastest way to settle the question of whether the network path is genuinely open.

The agent and the instance view round out the picture for running guests. The instance view, pulled with az vm get-instance-view, reports the power state, the provisioning state, the agent status, and the state of any extensions, which tells you at a glance whether the VM is running, whether the agent is healthy enough to accept run-command, and whether a recent extension deployment failed in a way that might have broken access. An agent reporting as not ready is itself a finding: run-command and VMAccess will not work, and your recovery shifts to the serial console, which does not depend on the agent. Reading the instance view first, before you attempt a run-command, saves you from waiting on a command that will never return because the agent that should execute it is down.

When the Key Never Made It: Provisioning and cloud-init Failures

A distinct flavor of permission-denied failure happens not because a key broke but because it never arrived. When you create a Linux VM with an SSH public key, the platform hands that key to the guest provisioning process, which on most modern images is cloud-init, and cloud-init writes it into the user’s authorized_keys file during first boot. If cloud-init fails partway, hangs, or runs against a custom image that altered the provisioning user, the key may never be written, and your first connection attempt fails with publickey even though you supplied a perfectly valid key at creation. The signature is a brand-new VM that has never accepted a connection, as opposed to an existing VM that suddenly stopped, and that distinction tells you to look at provisioning rather than at a regression.

Confirming this requires reading the cloud-init logs and the authorized_keys file through the serial console or run-command, since you cannot get a shell. The cloud-init output log records each module it ran and any errors, and the authorized_keys file either contains your key or does not:

az vm run-command invoke --resource-group <rg> --name <vm> \
  --command-id RunShellScript \
  --scripts "cloud-init status --long; tail -n 40 /var/log/cloud-init-output.log; cat /home/azureuser/.ssh/authorized_keys 2>/dev/null || echo 'no authorized_keys'"

If cloud-init reports an error state or the authorized_keys file is missing or empty, the provisioning step is your cause. The recovery is the same VMAccess reset used for any key problem: az vm user update writes the key directly through the agent, bypassing cloud-init entirely, which gets you in even when first-boot provisioning failed. Custom images are a common trigger here, because an image captured without properly generalizing it, or one that hardcoded a different provisioning user, can leave the platform writing the key to an account you are not trying to log in as. When you build VMs from a custom image and they consistently reject the creation-time key, suspect the image’s provisioning configuration before you suspect the key, and confirm which user the image expects by reading its home directories.

There is a related failure where cloud-init succeeded but a later configuration management run overwrote authorized_keys. Tools that manage SSH keys declaratively will enforce their own list, and if your key is not in that managed list, a config run can remove it, locking you out of a VM that worked yesterday. The boundary between this and the provisioning failure is timing: provisioning failures lock you out from the first connection, while a config-management overwrite locks you out after a previously working period, often right after a deployment. Reading when access last worked, which the auth log can tell you through the serial console, points you at the right culprit.

SSH Key Formats, Agents, and Client-Side Pitfalls

Not every permission-denied failure lives on the server. A meaningful share are client-side, where the key, the agent, or the client configuration is the problem, and these are worth isolating because no amount of server-side fixing will resolve them. The first pitfall is key format. Azure expects an OpenSSH-format public key, and a key generated or exported in a different format, such as the PuTTY .ppk format or an SSH2 public key with its distinctive header lines, will not work directly. If you generated your key in PuTTYgen, you need to export the OpenSSH public key, not the .ppk, and on the private side you either convert the .ppk to OpenSSH format or use a client that reads .ppk natively. A key pasted with a wrong header, or split across lines by a terminal that wrapped it, produces a publickey failure that looks like a mismatch but is really a malformed key.

The second pitfall is the agent offering the wrong key. When your SSH agent holds several identities, the client offers them in turn, and some server configurations will reject too many failed offers before reaching the one that works, or a server with a low MaxAuthTries will cut you off mid-list. The verbose trace shows each key being offered, so you can see whether your intended key is even reaching the server. The fix is to force the specific identity and stop the agent from offering others:

ssh -i ~/.ssh/the_correct_key -o IdentitiesOnly=yes azureuser@<ip>

IdentitiesOnly=yes tells the client to offer only the key named with -i rather than every identity the agent holds, which both speeds up the handshake and removes the ambiguity about which key actually authenticated. This single flag resolves a surprising number of intermittent publickey failures on multi-key workstations.

The third pitfall is key algorithm compatibility. Older clients or servers may not negotiate a modern key type, and conversely a very new key algorithm against an older sshd can fail to agree on an acceptable method. The trend is toward ed25519 keys for their shorter length and strong security, but an older image’s sshd may expect RSA, and a deprecation of older signature algorithms in a newer OpenSSH client can refuse an RSA key the server would have accepted. The verbose trace names the algorithms offered and the point of failure in the key exchange, so when authentication fails before any key is even offered, suspect an algorithm negotiation problem rather than a credential one. Generating a modern ed25519 key and adding it through VMAccess sidesteps most of these compatibility edges, and standardizing on one key type across your fleet removes the negotiation surprises entirely.

The fourth and most mundane pitfall is local file permissions on the client. SSH refuses to use a private key that is readable by other users on your own machine, printing a warning about unprotected key files and then falling back to other authentication methods, which can end in a publickey failure. The fix is the same strictness the server enforces: chmod 600 ~/.ssh/the_key on the private key. It is the client-side mirror of the server-side permission rule, and it catches people who copied a key between machines and lost its restrictive mode in the transfer.

Read the Auth Log to Pinpoint the Authentication Layer

When the failure is in the authentication family, the server’s own auth log is the most direct evidence you can gather, because it records exactly what the daemon thought of each attempt. On Debian and Ubuntu the relevant log is /var/log/auth.log, while on RHEL-family systems it is /var/log/secure, and the systemd journal carries the same lines under the sshd unit. Reading the tail of this log through run-command or the serial console converts a vague client-side rejection into a precise server-side reason, which is the difference between guessing and knowing.

az vm run-command invoke --resource-group <rg> --name <vm> \
  --command-id RunShellScript \
  --scripts "journalctl -u sshd -n 40 --no-pager; tail -n 40 /var/log/auth.log 2>/dev/null; tail -n 40 /var/log/secure 2>/dev/null"

The log distinguishes causes that look identical from the client. A line about authentication failure for an invalid user tells you the username is wrong, not the key. A line about a refused connection because the home directory or authorized_keys has bad ownership tells you the permission rule fired, and it often names the exact path it objected to. A line showing the key being accepted followed by a session that immediately closes points at a shell or a disk problem rather than authentication, because the credential passed and something after it failed. And a flood of failed password attempts from addresses you do not recognize tells you the port is exposed to brute-force traffic, which is both a security signal and a reason to move to key-only authentication or Bastion. The auth log is the single richest source of truth for the permission-denied family, and reading it should precede any key reset, because it frequently shows that the key was never the problem.

The log also dates the last successful login, which resolves the timing question that separates a provisioning failure from a regression. If the log shows successful logins until a specific moment and only failures afterward, something changed at that moment, and correlating that timestamp with deployments, configuration runs, or disk-fill events points straight at the trigger. If the log shows no successful login ever, the VM never accepted the key and you are looking at a provisioning or image problem. This temporal reading is a small habit with a large payoff, because it turns “it stopped working” into “it stopped working at 02:14, right after the nightly config run,” which is a diagnosis rather than a complaint.

When the NSG Is Innocent: Firewalls, Route Tables, and Hybrid Paths

The timed-out family is not always a missing NSG rule, and assuming it is can send you in circles when the real block lives elsewhere in the network. Three other path elements can swallow SSH traffic, and each one presents as the same silent timeout. The first is Azure Firewall or a network virtual appliance sitting in the path because a route table forces traffic through it. If a user-defined route sends your subnet’s outbound or return traffic to a firewall that has no rule permitting the SSH flow, the firewall drops it, and from the client it is indistinguishable from an NSG deny. The fix lives in the firewall’s rule collection, not the NSG, and you find it by reading the effective route table on the NIC and following where 0.0.0.0/0 or your return prefix points.

The second is an asymmetric route. Azure routing is generally symmetric, but a misconfigured UDR can send inbound SSH to the VM directly while forcing the return traffic through an appliance that does not preserve the connection, so the SYN arrives and the SYN-ACK never returns by the path the client expects. Stateful firewalls hate asymmetric routes and will drop the orphaned half of the flow, producing a timeout that no NSG change will fix. Reading the effective routes for both directions and confirming they are consistent is the only way to catch this, and it is a frequent culprit in hub-and-spoke topologies where a security appliance inspects all traffic.

The third is the hybrid path, where you connect not over the public internet but across a VPN or ExpressRoute from an on-premises network. Here the public IP and its NSG may be irrelevant because you are reaching the VM by its private IP, and the block can live in an on-premises firewall, a gateway route, or an NSG rule scoped to the wrong source prefix. A timeout over a private path sends you to confirm the gateway is connected, the routes propagate, and the NSG allows 22 from your on-premises address space rather than from a public source. The diagnostic discipline is the same as the public case, but the surfaces to check are the connection’s gateway and routing rather than an internet-facing rule. In all three of these cases, Network Watcher connection troubleshoot is the tool that localizes the drop, because it tests the actual path and reports the hop where the connection dies, which is far faster than reasoning about a multi-element network by hand.

Frequently Asked Questions

Q: What is the difference between SSH connection refused and connection timed out on Azure?

Connection refused and connection timed out point at opposite sides of the network boundary. Refused means a host received your TCP request on port 22 and actively rejected it with a reset packet, because no service is listening there; the network path is open and the cause is inside the VM, such as a stopped sshd, a full disk, or a guest firewall. Timed out means nothing answered at all, so your packets were dropped silently somewhere along the path, which on Azure almost always means a Network Security Group missing an allow-22 rule, a route table sending traffic into a blackhole, or a detached public IP. Reading which message your client printed eliminates half the possible causes immediately: refused sends you inside the machine to the daemon and the disk, while timed out sends you to the Azure network configuration. This single distinction, the refused-versus-timed-out rule, is the most useful diagnostic shortcut for SSH lockouts on Azure.

Q: How do I fix permission denied publickey on an Azure VM?

Start by confirming it is genuinely a key problem and not a permission or username issue, because most publickey failures are not about a bad key. Run the client with ssh -vvv to see which keys are offered and how the server responds. The most common real cause is file permissions: SSH refuses to use authorized_keys if the home directory, the .ssh directory, or the file itself are too open, so the home directory must not be group or world writable, .ssh must be mode 700, and authorized_keys must be mode 600 owned by the user. Fix these through run-command since you cannot log in. Other frequent causes are logging in as the wrong username rather than the provisioning user, the client offering the wrong key from a crowded agent, or a corrupted authorized_keys file. Force a specific key with ssh -i <key> -o IdentitiesOnly=yes. Only when the key is genuinely lost or wrong should you reset it with the VMAccess extension via az vm user update.

Q: Why does my Azure VM SSH connection time out?

A timeout means your packets reached no listener and were dropped silently, which on Azure is a network path problem rather than a daemon problem. The leading cause is a Network Security Group without an inbound allow rule for TCP port 22 from your source address, or one where a broader deny rule sits at a lower priority number and matches first. Confirm by pulling the effective NSG rules on the network interface and reading them in priority order. If the NSG allows 22, look next at the route table for a user-defined route sending traffic to a firewall or appliance that drops it, at a detached or reassigned public IP, and at a guest firewall like ufw or firewalld configured to drop rather than reject. Azure Network Watcher’s IP flow verify simulates the exact flow and names the rule responsible, which settles the question quickly. A timeout never points at the SSH daemon itself, so do not restart sshd to fix it.

Q: How do I reset the SSH key on an Azure VM I cannot log into?

Use the VMAccess extension through the Azure CLI, which writes a new public key into the user’s authorized_keys file via the Azure agent without needing any network access or shell session. The command is az vm user update --resource-group <rg> --name <vm> --username azureuser --ssh-key-value "$(cat ~/.ssh/new_key.pub)", after which you connect with the matching private key. This works whenever the agent is healthy and the VM is running, which covers nearly every authentication lockout. Be aware that the default behavior can replace the existing key for that user rather than appending to it, so coordinate before resetting a shared account, and prefer per-user accounts so a reset never locks out a colleague. If the agent is unhealthy and VMAccess cannot run, fall back to the serial console, log in with a local account, and edit authorized_keys directly. Resetting the key is the correct recovery for a lost or leaked credential; it is not the right tool for a permission or network problem.

Q: Can a full disk cause SSH connection refused?

Yes, and it is one of the most under-diagnosed causes. When the root filesystem reaches 100 percent, the SSH daemon cannot write its logs or create the temporary files it needs to set up a session, so it either fails to start after a restart or dies during the handshake. The client sees a clean refused, or sometimes the cryptic message about the connection being reset during key exchange identification. The tell is a VM that worked yesterday and broke without any configuration change. Confirm by running df -h / through the serial console or run-command, and if the disk is full, find the offender with a directory size scan. The usual culprits are unrotated application logs, an uncapped systemd journal, a bloated package cache, or core dumps. Free enough space, restart sshd, and the daemon recovers. If the disk is genuinely undersized, extend the managed disk and grow the partition and filesystem inside the guest, remembering that the in-guest growth step is required after the platform resize.

Q: Why does SSH warn that the host key changed after a redeploy?

Redeploying, reimaging, or recreating an Azure VM regenerates its SSH host keys, so the fingerprint your client cached in known_hosts no longer matches what the server now presents. SSH treats a changed host key as a possible interception attack and refuses to connect until you resolve it, which is why you see the loud warning. The same thing happens when a dynamic public IP is released and reassigned to a different machine, or when a DNS record is repointed to a new VM, because your client keys the cached fingerprint to the address or name you connect with. The fix is to remove the stale entry locally with ssh-keygen -R <ip-or-hostname>, after which your next connection records the new key. Before clearing it, confirm the change was expected, because an unexpected host key change on a machine you did not touch is exactly the signal the warning exists to raise. On Azure the cause is almost always a redeploy or IP reuse, and a static public IP keeps the fingerprint stable across stop and start cycles.

Q: How do I check the SSH daemon status when I cannot log in?

Use the run-command channel, which executes a script inside the guest through the Azure agent without needing the network or an SSH session. Run az vm run-command invoke --resource-group <rg> --name <vm> --command-id RunShellScript --scripts "systemctl status sshd; ss -tlnp | grep :22", and the output shows whether the daemon is active and whether anything is bound to port 22. If the service is failed, read why with journalctl -u sshd -n 50 --no-pager before restarting, because a syntax error in sshd_config or a missing host key will make it exit again immediately. If the daemon is active but nothing is bound to 22, suspect a changed Port directive. When the agent itself is unhealthy and run-command does not return, switch to the serial console, which reaches the guest through boot diagnostics rather than the agent and works even when the agent is down. The serial console gives you an interactive login where you can run the same checks directly.

Q: What does kex_exchange_identification connection reset mean on Azure?

That message means the SSH daemon began the handshake, started exchanging identification strings, and then the connection was reset before key exchange completed. The daemon was alive enough to answer but could not carry the session through, which on Azure most often means the daemon is starting and immediately dying, or the disk filled and sshd cannot complete a session setup. It can also appear when a load balancer or proxy in front of the VM resets the connection, or when fail2ban or a similar tool banned your address mid-connection after earlier failures. Diagnose it like a refused-adjacent failure: check the daemon status and the disk through the serial console or run-command, read the sshd journal for the reason it dropped the session, and confirm no intrusion-prevention tool is resetting connections from your address. It is rarely a key problem, so resist the urge to reset credentials; the daemon told you it died during setup, which points at the service and the host, not authentication.

Q: Does an Azure NSG block SSH by default?

A Network Security Group denies all inbound traffic that is not explicitly allowed, so a VM without an inbound allow rule for port 22 will reject SSH, which presents as a timeout. The default rule set includes a low-priority DenyAllInBound rule that catches anything no higher-priority rule permitted. When you create a Linux VM through the portal and choose to allow SSH, the platform adds the allow rule for you, but a VM created without that option, or one whose rule was later deleted during a cleanup, has no path for SSH and times out. NSGs can attach to both the subnet and the network interface, and both sets apply, so a permissive NIC rule does not help if the subnet denies the traffic. Always read the effective rules on the NIC rather than a single NSG, scope the allow rule’s source to your address rather than the whole internet, and confirm no broader deny sits at a lower priority number ahead of your allow.

Q: How do I use the Azure serial console to fix SSH?

The serial console gives you a text console to the VM through the Azure portal, delivered over the boot diagnostics channel rather than the network, which is why it works when SSH is entirely broken. It requires boot diagnostics to be enabled on the VM, so enable that at creation as a default. Open the serial console from the VM’s blade in the portal, log in with a local account, and you have a shell that bypasses the network and the SSH daemon completely. From there you inspect the daemon with systemctl status sshd, read the journal, check disk usage with df -h, validate the configuration with sshd -t, fix authorized_keys permissions, free disk space, and restart the daemon. It is the universal recovery surface for the refused and full-disk families and for any problem that left you unable to reach the agent. Because it does not depend on the agent or the network, it works in situations where run-command and VMAccess cannot, making it the tool of last resort that almost always succeeds.

Q: Why does my SSH connection get refused only after the VM runs for a while?

A refused connection that appears after a period of healthy operation, with no configuration change, is the classic signature of a resource exhaustion problem, most commonly a full disk. Something on the VM wrote until the root filesystem filled, after which the SSH daemon could no longer create session files and began refusing or resetting connections. The slow-burn timing distinguishes it from a configuration error, which would break access immediately on change. The usual culprits are an application log that grows without rotation, a systemd journal with no size cap, a database write-ahead log, or accumulating core dumps. Confirm with df -h through the serial console, free the space, and restart the daemon. The durable fix is to cap the growing files with logrotate and a journal size limit, move volatile data to a separate disk, and add a monitoring alert on filesystem usage so you get days of warning before the disk fills. Memory pressure leading to the daemon being killed by the out-of-memory killer can produce the same delayed refusal, so check the kernel log if the disk is clear.

Q: Can I SSH to an Azure VM without exposing port 22 to the internet?

Yes, and it is the recommended approach for production. Azure Bastion provides browser-based SSH and RDP to your VMs over their private IP through a managed jump host, so you never open port 22 to the public internet and you never depend on a public IP or an internet-facing NSG rule for access. This eliminates an entire class of timed-out failures driven by NSG misconfiguration and removes the brute-force exposure that a public SSH port invites. Alternatives include connecting over a VPN or ExpressRoute to the VM’s private address, or using Just-in-Time access to open the port only for a requested window and close it automatically afterward. Each of these shrinks or removes the public attack surface. Bastion is the cleanest because access flows entirely over the private network and through Azure authentication, and once it is in place, the recurring lockout patterns tied to public-port exposure largely disappear. The trade-off is a managed-service cost and a one-time setup, which is modest against the security and reliability benefit.

Q: How do I confirm whether an NSG or a guest firewall is blocking SSH?

Separate the two layers because they fail differently and are fixed in different places. For the NSG, use Azure Network Watcher’s IP flow verify, which takes your source address, the VM, port 22, and the inbound direction, then reports whether the effective rules allow or deny the flow and names the responsible rule. If IP flow verify says allow, the Azure network is open and the block is elsewhere. For the guest firewall, reach the VM through the serial console or run-command and check the local packet filter: ufw status on Ubuntu or firewall-cmd --list-all on RHEL-family systems. A guest firewall that rejects connections produces a refused message while one that drops them produces a timeout, so the client message also hints at which layer is involved. Checking both in order, NSG first with IP flow verify and then the guest firewall through the console, isolates the responsible layer without guesswork and stops you from editing the NSG when the block is actually inside the guest.

Q: My key worked yesterday and now fails with publickey, what changed?

When access worked and then stopped without your touching the key, something on the server side changed the authorized_keys file or the path permissions, and the auth log will tell you what and when. Read /var/log/auth.log or /var/log/secure through the serial console and find the timestamp of the last successful login; the change happened around then. The most common causes are a configuration management run that enforces its own key list and removed yours, a permission change on the home or .ssh directory that made sshd distrust the keys, or a disk that filled so the daemon can no longer read the file. Correlate the failure timestamp with your deployments and scheduled jobs to find the trigger. The recovery is to restore your key, either by adding it back to the managed list so the next config run keeps it, or by resetting it with VMAccess for immediate access. Fixing only the symptom without finding the trigger guarantees the next config run locks you out again.

Not necessarily. Many Azure Linux images and most hardening baselines disable password authentication entirely by setting PasswordAuthentication to no in the daemon configuration, so a password login fails as a matter of policy regardless of whether the password is correct. The directive can live in the main sshd_config or in a drop-in file under sshd_config.d, and the drop-in often wins, so check both through the serial console. A second cause is that a VM provisioned with only an SSH key has no password set for the user, so there is nothing to authenticate against. The right fix in almost every case is to use a key rather than a password, because a password-authenticated SSH port on a public IP attracts brute-force traffic within hours of exposure. If you genuinely need a password login for a controlled scenario, enable it deliberately, set a password for the account, reload the daemon, and restrict the NSG source to your address so the world cannot reach the prompt.

Q: How do I get a verbose SSH log to diagnose the failure?

Run the SSH client with maximum verbosity using ssh -vvv azureuser@<public-ip>, which prints a detailed trace of every step from address resolution through the TCP connection, the protocol negotiation, the key exchange, and the authentication attempts. The last several lines before the failure tell you which layer broke. If you see the connection established and key exchange completing before a permission-denied message, the network and daemon are healthy and the problem is authentication, and the trace also shows which keys the client offered. If you see an immediate connection refused after the socket attempt, the host reset you and the daemon is the suspect. If the trace hangs after attempting to connect and then reports a timeout, nothing answered and you are in network-path territory. The verbose trace turns a one-line error into evidence you can act on, and keeping the last twenty lines handy justifies each step of the diagnosis that follows. It is the first command to run on any SSH failure you cannot immediately explain.

Q: Why does run-command not work on my locked-out VM?

The run-command channel and the VMAccess extension both execute through the Azure Linux agent inside the guest, so they require the agent to be healthy and the VM to be in the running state. If the VM is deallocated, the agent is not running and neither tool can reach it; start the VM first. If the VM is running but the agent is unhealthy, which the instance view reports as a not-ready agent status, run-command will hang or fail because nothing inside the guest is listening for the instruction. An agent can become unhealthy if the disk filled and starved it, if a configuration change broke its service, or if the guest is hung partway through boot. In all these cases the recovery shifts to the serial console, which reaches the guest through boot diagnostics rather than the agent and works independently of it. Always read the instance view before attempting a run-command, because confirming the power state and agent health up front saves you from waiting on a command that cannot complete.

Q: How do I extend a full OS disk that is blocking SSH on Azure?

Extending the disk is a two-stage operation, and missing the second stage is why people say the extension did not help. First, resize the managed disk from the control plane with az disk update --resource-group <rg> --name <disk> --size-gb <new-size>, which on many SKUs requires deallocating the VM because you cannot grow an attached OS disk while it runs. Second, after the VM is back up, grow the partition and filesystem inside the guest, because the larger disk is visible but the partition still occupies the old size. Use growpart /dev/sda 1 to expand the partition, then resize2fs /dev/sda1 for ext4 or xfs_growfs / for xfs to extend the filesystem onto the new space. Only after the in-guest growth does the operating system report the additional capacity. If the disk fill was caused by an uncapped log rather than a genuinely small disk, freeing space and capping the log is the better fix, and extending the disk only delays a recurrence. Confirm the new free space with df -h before declaring the problem solved.

Q: Is connection refused ever a network problem rather than a VM problem?

Almost never at the SSH daemon layer, but a refused message can come from something other than the target VM. A public load balancer with an inbound NAT rule can refuse or reset a connection on behalf of an unhealthy backend, so the refused you see originates at the load balancer pointing at a VM with no healthy listener rather than at the VM you think you are reaching. A guest firewall configured to reject rather than drop also produces a refused even though it sits in the network layer of the guest. And a proxy or bastion host in the path can refuse if its own backend connection fails. So while a true refused from the VM itself means the daemon or local port, you should confirm the path: check the load balancer backend health and NAT target if you connect through one, and check the guest firewall mode. Once you confirm you are reaching the VM directly and no intermediary is resetting on its behalf, a refused reliably means the daemon, the local port, or a full disk inside that machine.

Q: What is the fastest way to triage an Azure SSH lockout?

Read the client message first, because it does most of the work for free. Refused sends you inside the VM to the daemon and the disk; timed out sends you to the Azure network and the NSG; permission denied sends you to authentication, where you check permissions and username before ever resetting a key. With the layer named, run one confirming command through the serial console or run-command: for refused, systemctl status sshd and df -h; for timed out, the effective NSG rules and Network Watcher IP flow verify; for permission denied, the verbose client trace and the authorized_keys permissions. Apply the recovery only after the check proves the cause: free the disk and restart the daemon, add the NSG allow, fix the permission, or reset the key with VMAccess. The entire sequence takes minutes and never requires destroying the machine. The discipline that makes it fast is refusing to act before you have read the message and proven the cause, which is exactly what separates a five-minute repair from an afternoon of guessing.