The reason most teams cannot answer a simple incident question, “what was this service doing at 02:14 last night,” is not that Azure Monitor and Log Analytics failed to capture the signal. It is that nobody routed the signal anywhere it could be queried. The resource ran fine, emitted its telemetry into the void, and left the on-call engineer staring at an empty result set at the worst possible moment. Azure Monitor and Log Analytics are the platform that prevents that outcome, and the gap between switching them on and actually understanding them is where almost every observability failure on Azure begins.
That gap is the subject of this guide. Azure Monitor is the umbrella platform that collects, stores, and acts on telemetry from every layer of your estate, from the bare metal under a virtual machine to a single line of application code. Log Analytics is the query engine and storage tier underneath it, the place where log records land and where you interrogate them with a query language called KQL. The two names get used interchangeably in hallway conversation, and that imprecision is the source of more confusion than any other single thing about Azure observability. They are not the same component, they do not bill the same way, and they do not fail for the same reasons.

By the end of this guide you will be able to draw the boundary between the two with confidence, decide for any given signal whether it belongs in metrics or in logs, design a workspace topology that does not collapse under access-control or cost pressure, write KQL that answers real questions instead of returning timeouts, wire a resource so its telemetry actually arrives, and read an ingestion bill well enough to predict it before it lands. Those are the skills that separate an engineer who has Azure Monitor enabled from one who can use it under fire.
What Azure Monitor and Log Analytics Actually Are
Hold this mental model and the rest of the platform stops being a maze: Azure Monitor is a pipeline with two distinct destinations, and Log Analytics is one of those destinations. Telemetry is produced at a source, it flows through a collection mechanism, and it lands in either the metrics store or a Log Analytics workspace. Alerting, visualization, and automation sit on top of whichever store holds the data you care about. Every feature you will ever touch is a node on that pipeline, and most of your troubleshooting comes down to finding which node is silent.
What is the difference between Azure Monitor and Log Analytics?
Azure Monitor is the whole observability platform: collection, two storage tiers, alerting, dashboards, and automation. Log Analytics is the logs storage and query tier inside it, backed by a workspace you query with KQL. Metrics live in a separate time-series store, not in Log Analytics. Monitor is the platform; Log Analytics is one room in it.
The reason the distinction matters operationally is that the two stores have completely different physics. The metrics store is a lightweight, pre-aggregated, time-series database optimized for one job: answering “what is the value of this counter, sampled at one-minute granularity, over the last few hours or days.” It is fast, it is cheap, and platform metrics for most resources are collected automatically with no configuration on your part. You did not turn on CPU percentage for your virtual machines; it was already there. The trade-off is that the metrics store is dimensionally shallow. It holds numbers across a small fixed set of dimensions and a bounded retention window, and it cannot hold the rich, high-cardinality, free-text records that a real investigation needs.
Log Analytics is the opposite kind of store. A workspace holds typed tables of structured records, each row a discrete event with as many columns as the source cares to emit: a request with its URL, status code, duration, client IP, and trace identifier; a security sign-in with its conditional-access verdict; a container’s stdout line with its pod name and namespace. This is where you reconstruct a sequence of events, correlate across services, and ask questions nobody anticipated when the resource was provisioned. The cost of that flexibility is that nothing lands in a workspace by accident. A resource emits queryable logs only when something explicitly routes its telemetry there, and that single fact is responsible for the majority of “Azure Monitor isn’t working” tickets.
That routing mechanism is the diagnostic setting, and it is the most important configuration object in the entire platform. A diagnostic setting is a small rule attached to a resource that says “send these log categories and these metric categories to this destination.” Without it, a resource keeps running, keeps doing its job, and keeps producing zero rows in any workspace. Engineers who do not internalize this spend hours debugging queries that were never going to return anything, because the pipe upstream of the query was never connected. The principle is worth stating as a rule you carry everywhere on Azure: a resource produces no queryable logs until a diagnostic setting routes them, so “no data” is almost always a missing pipe, not a broken query. Call it the diagnostic-settings-first rule, and check the pipe before you blame the query.
Application telemetry adds one more piece to the model. Application Insights is the application-performance-monitoring face of Azure Monitor, the component that captures requests, dependencies, exceptions, and custom traces from your own code. In its modern form an Application Insights resource is workspace-based, meaning its data physically lives in a Log Analytics workspace and is queryable with the same KQL against tables like requests, dependencies, and exceptions. So Application Insights is not a separate silo to learn from scratch; it is a specialized producer that writes into the same query store as everything else, which is exactly why a unified workspace design pays off.
How the Telemetry Pipeline Works Internally
Understanding the pipeline at the level of who produces what, how it is collected, and where it settles is what lets you reason about a gap instead of guessing at it. There are four producer classes, two collection paths, and two stores, and almost every behavior you will see is explained by which combination is in play.
The four producers are the Azure platform itself, the guest operating system inside a virtual machine, your application code, and custom or third-party sources. The platform emits two things for free and one thing on request. The activity log, a subscription-level audit trail of control-plane operations like who started a virtual machine or changed a network security group rule, is always recorded and retained for a fixed window at no charge. Platform metrics, the numeric counters a resource exposes, are also collected automatically into the metrics store. The third platform signal, resource logs, also called diagnostic logs, are the rich per-resource records, and these are the ones that emit nothing until a diagnostic setting turns them on.
The guest operating system is a separate world from the platform. Azure can tell you a virtual machine’s host-level CPU and disk metrics without ever touching the inside of the machine, because those come from the hypervisor. But anything from inside the guest, memory pressure, a specific process, a Windows event log, a syslog entry, a custom performance counter, requires an agent running in the operating system. The current agent is the Azure Monitor Agent, which replaced the older Log Analytics agent and is configured through data collection rules that specify exactly which counters and event sources to gather and which workspace to send them to. The older agent is on a retirement path, so any new design should standardize on the Azure Monitor Agent and data collection rules from the start rather than carrying legacy collection forward.
How does telemetry actually get into a Log Analytics workspace?
Three paths feed a workspace. Diagnostic settings route Azure resource logs and metrics. The Azure Monitor Agent, driven by data collection rules, sends guest operating-system events and performance counters. Application Insights SDKs and the data collection API push application traces and custom records. Each path is configured independently, so a gap in one does not affect the others.
This independence is why partial telemetry is so common and so confusing. A team will see beautiful application traces in Application Insights, conclude monitoring is healthy, and then discover during an incident that the underlying App Service emitted no platform logs at all, because the diagnostic setting was never created. The application SDK path and the diagnostic-settings path are entirely separate pipes. One working tells you nothing about the other. When you audit an environment, you check each path on its own terms.
Once telemetry lands, the two stores diverge in how they treat it. The metrics store keeps each metric as a stream of time-stamped values across its defined dimensions, pre-aggregated at fixed grains, and it answers chart-shaped questions almost instantly because the aggregation work was done at ingestion time. A Log Analytics workspace, by contrast, stores each record in a strongly typed table and indexes it for query at read time. When you run a KQL query, the engine scans the relevant table over your time range, applies your filters and aggregations, and returns rows. That read-time model is enormously powerful and also explains a class of performance problems: a query over a wide time range against a large, busy table has to scan a great deal of data, and if you have not filtered early it will be slow or time out. The store rewards queries that narrow the time window and the table before they do anything expensive.
There is also an unavoidable latency between an event happening and that event being queryable. Telemetry is batched at the source, transmitted, processed through the ingestion pipeline, and only then indexed and made available to KQL. This delay is normal, it is bounded under healthy conditions, and it varies by data type, but it is not zero. Engineers who expect a log line to appear the instant the event fires will misread normal ingestion latency as a broken pipeline. When data is genuinely late beyond the usual envelope, that is a distinct condition with its own causes, and tracing it is its own discipline; if you are watching records arrive minutes behind where you expect them, the path to diagnosing Log Analytics ingestion delay is a separate investigation from the absence of data entirely. Knowing which of the two you are looking at, late versus never, saves the most time.
Metrics Versus Logs: The Decision That Shapes Everything
The single most consequential choice in any monitoring design is which signal type carries which need, because it determines speed, cost, retention, and the kinds of questions you can later ask. Get it right and your alerts fire fast and cheap while your investigations stay rich and deep. Get it wrong and you either pay to ingest numbers you only ever chart, or you try to alert on something the metrics store cannot express and discover the limitation mid-incident.
When should I use metrics and when should I use logs?
Use metrics when you need a fast, cheap, real-time numeric signal you will threshold or chart, like CPU percentage or request latency over recent time. Use logs when you need the full record, high-cardinality detail, correlation across sources, audit history, or long retention. Metrics answer “how much, right now”; logs answer “what exactly happened, and why.”
The deciding factors are granularity of detail, cardinality, retention horizon, latency of the alert, and cost. Metrics win on alert latency and cost for simple numeric thresholds because they are pre-aggregated and evaluated against a near-real-time stream. If you want to be paged within a couple of minutes when CPU crosses a line, a metric alert is the right and cheap instrument. Logs win on everything that requires the actual content of an event: the exact error message, the specific user, the precise sequence, the join across two different services, and any question you want to ask weeks or months later. The cost shape is different too. Platform metrics are effectively free to collect, while logs are billed by the volume you ingest and retain, which makes the choice of what to send to a workspace a direct cost decision, not just a technical one.
This is the place to put the central artifact of this guide. The table below maps a signal need to the right store, names the workspace-design consequence, and states the cost consequence of each row, so you can make the call deliberately instead of defaulting to “log everything.” Treat it as the decision rule you apply before you create any diagnostic setting or data collection rule. Limits, retention defaults, and any pricing implication shift over time, so verify the current numbers against the official Azure documentation at the moment you design; the shape of the decision is durable even when the figures move.
| Signal need | Right store | What it implies for design | Cost consequence |
|---|---|---|---|
| Real-time numeric threshold (CPU, latency, queue depth) you will page on | Metrics | Metric alert rule on the platform metric; no workspace required | Near-zero; platform metrics collected automatically |
| Chart a numeric trend over recent days for a dashboard | Metrics | Metrics Explorer or a workbook tile sourced from metrics | Near-zero; no ingestion charge for platform metrics |
| Numeric trend you must keep for many months or correlate with events | Logs (or metrics exported to a workspace) | Diagnostic setting routing the metric category to the workspace | Billed by ingested and retained volume |
| Exact error text, stack trace, or per-request detail | Logs | Resource log categories via diagnostic settings, or Application Insights | Billed by volume; the richer the record, the higher the cost |
| Correlation across two or more services in one query | Logs | All sources routed to a shared workspace so KQL can join them | Billed by volume; design for join, not duplication |
| Control-plane audit (who changed what) | Logs (Activity log) | Route the activity log to a workspace for long retention and query | Activity log retained free short-term; workspace retention billed |
| Security sign-in and audit history kept for compliance | Logs | Entra ID diagnostic setting to a workspace with long retention | Billed by volume and by extended retention period |
| High-cardinality custom dimension you will slice many ways | Logs | Custom table or Application Insights custom event | Billed by volume; high cardinality is fine in logs, costly in metrics |
The namable rule the table encodes is simple: metrics are for the question “how much,” logs are for the question “what and why,” and the moment a need crosses from a number you threshold into a record you investigate, it belongs in logs and you must pay to keep it. Designing with that line in mind is the difference between a bill you can predict and one that surprises you.
Designing Log Analytics Workspaces
A workspace is not just a bucket of logs. It is simultaneously the boundary for access control, the boundary for data retention, the boundary for region and data residency, and the boundary for cost attribution. Because all four boundaries collapse into one object, the workspace topology you choose early has consequences that are expensive to unwind later, which is why workspace design deserves a deliberate decision rather than the default of “one per team that asks.”
How many Log Analytics workspaces should I have?
Default to as few workspaces as your access, residency, and retention boundaries allow, often a single central workspace per environment or region. Add a workspace only when a hard requirement forces separation: data must stay in a specific region, a team must be denied access to another team’s logs, or a dataset needs a different retention period. Fewer workspaces make correlation and cost control far easier.
The strongest argument for consolidation is correlation. KQL can join across tables trivially when they live in the same workspace, so if your application traces, your platform logs, and your security sign-ins all land in one place, an investigation that spans them is a single query. Split those across three workspaces and the same investigation becomes a cross-workspace query with more friction, or worse, three separate queries you reconcile by hand at 2am. Consolidation also simplifies cost management, because a commitment-based pricing tier applies per workspace, and concentrating volume into fewer workspaces makes it easier to reach the threshold where a capacity commitment beats pay-as-you-go ingestion.
The legitimate reasons to split are specific and you should be able to name which one applies. Data residency is the hardest boundary: if regulation requires certain data to remain in a particular geography, that data needs a workspace in that region, full stop. Access control is the second: a workspace grants or denies access as a unit at its broadest level, so if one group must never see another group’s data and table-level access controls do not cleanly cover the case, separation is the clean answer. Retention divergence is the third: if one dataset must be kept for years for compliance while another only matters for thirty days, and the per-table retention options do not fit, separate workspaces let you set retention independently without overpaying to keep cheap data for the expensive period. Outside those three forces, every extra workspace you create is correlation you have made harder and cost you have made fuzzier.
Region choice within a workspace deserves a moment of thought. A workspace lives in a region, ingestion and query happen there, and while resources from other regions can send data to it, you should weigh the egress and latency implications of routing telemetry across regions against the residency benefit of keeping it local. For most estates the right pattern is a small number of regional workspaces aligned to where your workloads actually run, with a clear rule for which resources report to which workspace, documented so that the next engineer routing a new resource does not have to guess.
Retention itself is a lever with a default and an extension. Every workspace has an interactive retention period during which data is fully queryable, and beyond that you can configure archived retention that keeps data at lower cost for occasional retrieval. The defaults and the maximum retention windows change over time and differ by table and tier, so confirm the current values against the official documentation when you set policy. The design principle that does not change is to match retention to the actual question horizon: keep operational logs only as long as you would ever query them interactively, archive compliance data for as long as the obligation runs, and never pay interactive-tier prices to hold data you will touch once a year.
Configuring Diagnostic Settings: The Pipe That Makes It Work
Because the diagnostic setting is the pipe that turns a silent resource into a queryable one, configuring it correctly is the highest-leverage thing you can do for observability on Azure. A diagnostic setting attaches to a single resource, selects which log categories and which metric categories to export, and names up to several destinations: a Log Analytics workspace for query and alerting, a storage account for cheap long-term archive, and an event hub for streaming to an external system. Most engineering use cases want the workspace destination, because that is what makes the data queryable with KQL and usable by log-based alerts.
Why is my Azure resource showing no logs at all?
Almost always because no diagnostic setting routes that resource’s logs to a workspace. Azure resources do not log to Log Analytics by default; each resource needs its own diagnostic setting selecting categories and a workspace destination. Check for the setting before you debug the query. A missing or misconfigured pipe, not a broken query, is the usual cause of total silence.
The configuration itself is straightforward once you know it is required. Through the portal you open the resource’s diagnostic settings blade, add a setting, tick the log and metric categories you want, choose the workspace, and save. At scale you do not click through every resource by hand; you express the setting as code. Here is the shape of it with the Azure CLI for a single resource, which is also the unit you would wrap in a loop or a policy for an estate:
# Route an App Service's logs and metrics to a Log Analytics workspace
az monitor diagnostic-settings create \
--name send-to-law \
--resource "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Web/sites/<app-name>" \
--workspace "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<workspace-name>" \
--logs '[{"category":"AppServiceHTTPLogs","enabled":true},{"category":"AppServiceConsoleLogs","enabled":true}]' \
--metrics '[{"category":"AllMetrics","enabled":true}]'
The category names differ by resource type, which is the one detail that trips people up. An App Service exposes categories like HTTP logs and console logs; a key vault exposes audit-event categories; a storage account splits read, write, and delete operations across its sub-services. There is no universal category list, so when you script this you query the available categories per resource type first and select deliberately rather than enabling everything, because every enabled category that lands in a workspace is volume you pay to ingest and retain.
Doing this resource by resource does not scale, and the correct pattern for an estate is to enforce diagnostic settings through Azure Policy so that any new resource of a given type is automatically wired to the right workspace at creation. That converts observability from something each engineer remembers to do into something the platform guarantees, which is the only way to keep coverage complete as the estate grows. The end-to-end mechanics of doing this consistently, including the policy definitions and the per-service category choices, are involved enough to warrant their own treatment; the focused walkthrough on how to configure diagnostic settings across Azure covers the policy-driven approach in depth. For this guide the essential point stands on its own: the diagnostic setting is the pipe, nothing flows without it, and at scale you enforce it rather than remember it.
When the pipe exists and data still does not appear, you have crossed from a configuration gap into a genuine fault, and the failure family there is distinct. Missing categories, a setting pointed at the wrong workspace, a region mismatch, an agent that has stopped reporting, or an ingestion problem each present slightly differently. Working through Azure Monitor missing logs and metrics systematically, rather than re-toggling settings at random, is how you separate the missing-pipe case from the genuine-fault case quickly. The discipline is always the same: confirm the pipe first, then confirm the data is recent rather than merely delayed, then look at the query.
Writing KQL That Answers Real Questions
KQL, the Kusto Query Language, is the language you use to interrogate everything in a Log Analytics workspace, and a small, well-understood subset of it covers the overwhelming majority of real operational work. The language reads top to bottom as a pipeline: you start with a table, then pass the rows through a sequence of operators separated by the pipe character, each one transforming the set of rows the next operator receives. Once that pipeline mental model clicks, KQL stops feeling like a new language and starts feeling like describing a funnel.
How do I write a KQL query to investigate an issue?
Start from the table that holds the signal, filter to your time range and the records you care about with where, reduce to the columns you need with project, aggregate with summarize to count or chart, and order or limit the result. Filter early and narrow the time range first; that is what keeps the query fast and prevents timeouts on large tables.
The handful of operators that do almost everything are worth knowing cold. The where operator filters rows by a condition, and it is the operator you reach for first and most, because filtering early is what keeps a query fast. The project operator selects and renames columns, trimming a wide record down to what you actually want to read. The summarize operator aggregates, collapsing many rows into grouped counts, sums, averages, or percentiles, and it is the heart of any “how many, grouped by” question. The join operator combines two tables on a shared key, which is how you correlate a request in one table with the exception it triggered in another. And operators like order by, top, and take shape and bound the result. A representative investigative query stitches these together:
// Failed requests in the last hour, grouped by URL, slowest first
requests
| where timestamp > ago(1h)
| where success == false
| summarize failures = count(), avgDuration = avg(duration) by name
| order by failures desc
| take 20
Read that as a funnel: start with the requests table, keep only the last hour, keep only the failures, collapse to a count and average duration per request name, sort by failure count, and show the top twenty. Every clause narrows or shapes what flows to the next. The two habits that matter most for both correctness and performance are filtering on time first and filtering hard before you aggregate or join. A query that scans a busy table over a wide window before narrowing has to read a large amount of data, and that is the usual cause of a query that runs slowly or times out. The fix is almost never more compute; it is moving the time filter and the selective where clauses to the front so the engine reads less.
The same KQL works against platform logs, guest logs, security logs, and application telemetry, because they all live as tables in the workspace. A query against the requests and dependencies tables from Application Insights, a query against AzureDiagnostics from a resource log, and a query against SigninLogs from Entra ID are the same language with different table names. That uniformity is the practical payoff of consolidating into a shared workspace: one language, one place, and the ability to join across what used to be separate silos. If you want to drill these patterns until they are muscle memory, you can run the hands-on Azure labs and command library on VaultBook, which lets you route diagnostics into a workspace, write progressively harder KQL against real tables, and build an alert end to end, so the operators above become reflexes rather than reference lookups.
Data Collection Rules and Shaping Telemetry Before It Lands
The data collection rule is the configuration object that governs the Azure Monitor Agent path, and it has grown into one of the most useful cost and quality levers in the whole platform. A data collection rule specifies what to gather, from which machines, and where to send it, and it does this declaratively so that associating the rule with a fleet of machines applies the same collection policy uniformly. That declarative model is what lets you treat guest-telemetry collection as code rather than as per-machine clicking, and it is the right foundation for any virtual-machine estate of more than a handful of hosts.
What is a data collection rule and why does it matter?
A data collection rule defines which guest counters, event logs, or syslog facilities the Azure Monitor Agent collects, from which machines, and which workspace receives them. It decouples collection policy from individual machines, so one rule applied to a fleet enforces consistent telemetry. It is also where ingestion-time transformations can drop or reshape data before it is billed.
That last sentence points at a lever many teams overlook. A data collection rule can carry an ingestion-time transformation, a small KQL fragment that runs on each incoming record before it is stored, letting you drop noisy columns, filter out records you will never query, or project a high-volume stream down to what matters. Because billing is driven by what lands in the workspace, transforming or filtering at ingestion is a direct cost reduction that also keeps your tables cleaner. A chatty source that emits a verbose field you never query can have that field stripped before it is ever stored, turning a costly stream into a lean one without losing the records you actually use. The transformation is expressed in the same KQL you already know, so the skill transfers directly from query writing to ingestion shaping.
There is a parallel cost lever on the table side. A Log Analytics workspace can assign different table plans, where high-value tables sit on the full analytics plan that supports the complete query surface and alerting, while high-volume, low-value tables can sit on a cheaper plan intended for occasional search and investigation rather than continuous querying and alerting. The decision per table follows the same logic as the metrics-versus-logs decision one level up: tables you alert on and query constantly earn the full plan, while tables you keep mostly for forensic lookups can sit on the cheaper tier. The available plans and their exact capabilities and prices change over time, so confirm the current options against the official documentation, but the principle is durable: not every table deserves the same treatment, and matching the plan to how you actually use the table is real money.
KQL Beyond the Basics
The core operators carry most operational work, but a second tier of KQL turns the language from a filter-and-count tool into something you can genuinely investigate and visualize with. Learning these well is what separates an engineer who can confirm a known condition from one who can discover an unknown one, and they are worth the short investment because they recur constantly in real incidents.
Which KQL operators should I learn after the basics?
After where, project, summarize, and join, learn extend to compute new columns, bin to bucket time for trends, make-series for evenly spaced time series, parse to extract fields from text, render to chart inline, and let to name reusable values. These six cover trend analysis, text extraction, and readable, reusable queries.
The extend operator adds a calculated column without discarding the existing ones, which is how you derive a value mid-pipeline, for example computing a duration in seconds from a duration in milliseconds, or flagging a row with a computed category before you aggregate. It pairs naturally with summarize, because you often compute a value with extend and then group by it. The bin function buckets a continuous value, almost always time, into fixed intervals, and it is the key to any trend query: grouping a count by bin(timestamp, 5m) gives you a count per five-minute bucket, which is exactly the shape a time chart wants. The make-series operator goes further, producing an evenly spaced series even across intervals with no data, filling gaps so a chart does not lie by omission, which matters when a quiet period would otherwise look like missing data rather than genuine silence.
Text handling is where KQL earns its keep on messy logs. The parse operator extracts structured fields from a free-text column using a pattern, turning an unstructured log line into queryable columns without preprocessing the data at the source. When a service emits a useful value buried inside a message string, parse pulls it out at query time so you can filter and group on it. For charting, the render operator draws the result inline as a time chart, bar chart, or pie chart directly in the query view, which makes ad-hoc visual investigation a single query rather than a trip to a separate tool. And the let statement names a value, a scalar, a list, or even a whole subquery, so you can write readable queries that reference a threshold or a filtered set by name and reuse it, which is both clearer to read and easier to maintain. A query that begins by binding the time range and a couple of thresholds with let reads far better than one that repeats literals throughout.
Two cross-cutting capabilities round out the practical toolkit. Cross-resource queries let a single query reach into another workspace or an Application Insights resource by naming it explicitly, so even when you have split into multiple workspaces for a legitimate reason, you can still correlate across them when an investigation demands it, at the cost of a more involved query. And the ability to save a query as a function, giving a complex piece of logic a name you can call like any built-in operator, lets a team capture institutional knowledge: the gnarly query that decodes a specific failure pattern becomes a named function any engineer can invoke, so the hard-won investigation does not have to be reinvented at the next incident. These are the features that turn KQL from a personal skill into a team asset.
The Curated Insights: VM, Container, and Application Experiences
On top of the raw collection and query layers, Azure Monitor ships curated experiences that pre-package the right collection, the right queries, and the right visualizations for common workload types, so you are not assembling observability for a virtual machine or a Kubernetes cluster from first principles every time. These experiences are worth understanding because they save substantial setup work, but also because knowing what they collect underneath tells you what you are paying to ingest.
What are VM insights and Container insights?
VM insights and Container insights are curated monitoring experiences for virtual machines and Azure Kubernetes Service. They configure the right agent and data collection, then present prebuilt performance views, dependency maps, and health signals. They are an opinionated starting point that routes telemetry into a workspace, so the same data is also queryable with KQL and usable for custom alerts.
VM insights layers a performance and dependency experience over a virtual machine or a fleet, collecting the key guest performance counters and a map of process-level connections so you can see not just that a machine is busy but what it is talking to. It is the fast path to host observability, and because it routes its data into a workspace, everything it collects is also available to your own KQL and your own alerts, so adopting it does not lock you out of custom analysis. Container insights does the equivalent for Azure Kubernetes Service and other container platforms, collecting node and pod performance, container stdout and stderr, and inventory, and presenting cluster, node, and workload views. For anyone running containers on Azure, it is the difference between flying blind and seeing pod restarts, node pressure, and per-container resource consumption in one place, and again the underlying data lands in a workspace for custom querying.
Application Insights is the application-layer curated experience, and it deserves its own note on cost control because application telemetry can be extremely high volume. Sampling is the lever: Application Insights can retain a representative fraction of telemetry rather than every single trace, preserving statistical accuracy for rates and trends while dramatically cutting ingestion volume on high-traffic applications. Adaptive sampling adjusts the rate automatically to hold volume near a target, and understanding it is essential, because a busy application instrumented without sampling can generate a startling ingestion bill. The right posture is to sample aggressively enough to control cost while keeping enough fidelity for the questions you actually ask, and to be deliberate about it rather than discovering the volume after the fact. As with the rest of the platform, the experience is a convenience layer over data that ultimately lives in a workspace, queryable with the same KQL, which keeps the whole estate coherent.
Visualizing Telemetry: Workbooks, Dashboards, and Metrics Explorer
Query results and metric streams become operational tools when they are arranged into views people actually watch, and Azure Monitor offers three visualization surfaces that serve different needs. Knowing which surface fits which job keeps you from forcing a tool to do something it is poor at.
Metrics Explorer is the fast path for charting the metrics store. Because platform metrics are pre-aggregated, Metrics Explorer can plot a counter over time, split it by a dimension, and apply an aggregation almost instantly, with no query writing. It is the right tool for the quick “show me CPU across these machines for the last day” question and for assembling simple metric tiles. Its limit is exactly the limit of the metrics store: it charts numbers across the available dimensions, and it cannot reach into the rich content of logs.
Workbooks are the rich, interactive reporting surface, and they are where serious operational dashboards live. A workbook combines KQL-driven log queries, metric charts, parameters that let a viewer filter the whole report, and narrative text into a single interactive document. Because a workbook can run arbitrary KQL, it can show anything a query can produce, which makes it the right tool for an incident-investigation dashboard, a capacity report, or a cost-by-source view that updates live. Workbooks are also shareable and templatable, so a team can standardize on a set of investigation views rather than each engineer rebuilding the same query collection. When you want a living, parameterized view that mixes logs and metrics, the workbook is the answer.
Azure dashboards serve the at-a-glance shared-screen need, pinning tiles from Metrics Explorer, workbooks, and other sources into a single board suitable for a wall display or a shared overview. They are less interactive than workbooks but better as a fixed operational summary. The practical division is straightforward: Metrics Explorer for quick metric charts, workbooks for interactive log-and-metric investigation and reporting, and dashboards for the fixed shared overview. Choosing the right surface for the job, rather than forcing everything into one, is what keeps your visualization layer clean and fast.
Securing and Governing the Telemetry Path
Telemetry is sensitive data. Logs can contain user identifiers, request contents, security verdicts, and infrastructure detail that an attacker would value, so the access and network controls around your workspaces matter as much as the controls around any other data store. Treating observability data as a security boundary, not just an operational convenience, is the mark of a mature design.
Access control on a workspace governs who can read which data, and it operates at the workspace level and, where configured, at a finer table level. The principle is least privilege: an engineer who needs to investigate application failures does not necessarily need to read security sign-in logs, and a workspace design that mixes both behind a single coarse access grant forces an uncomfortable choice between over-sharing and splitting the workspace. Where table-level access controls cleanly cover the case, you can consolidate and still segment access; where they do not, that is one of the few legitimate reasons to separate workspaces. Designing access alongside the workspace topology, rather than bolting it on later, avoids the painful migration of moving data between workspaces to fix an access boundary you did not plan for.
Network isolation closes the path between your resources and the ingestion endpoints. An Azure Monitor Private Link Scope lets telemetry flow to your workspaces over private network connectivity rather than the public internet, which is often a hard requirement in regulated environments and a sensible default in any security-conscious estate. Combined with data residency, the region in which a workspace lives and therefore where its data is stored and processed, these controls let you satisfy compliance obligations about where telemetry lives and how it travels. The residency requirement is the hardest of the workspace boundaries precisely because it is non-negotiable when regulation imposes it, and it must be designed in from the start rather than discovered during an audit. Governing telemetry with the same access, network, and residency rigor you apply to production data is not optional caution; it is the recognition that your logs are production data.
Alerts and Action Groups: Turning Signal Into Response
Telemetry that nobody acts on is just an expensive archive. The alerting layer is what turns a threshold crossing or a log pattern into a page, an email, a webhook, or an automated remediation, and it is built from two cooperating pieces: the alert rule that detects the condition, and the action group that defines who or what gets notified when it fires.
How do alert rules and action groups work together?
An alert rule defines the condition to watch, a metric crossing a threshold or a KQL query returning rows, and how often to evaluate it. An action group defines the response: the emails, SMS, push notifications, webhooks, or automation to trigger. The rule detects; the group responds. One action group can serve many rules, so you define escalation paths once and reuse them.
There are two principal kinds of alert rule, and choosing between them is another application of the metrics-versus-logs decision. A metric alert rule watches a platform or custom metric and fires when it crosses a threshold you set, evaluated continuously against the near-real-time metric stream. These are the right tool for fast, simple numeric conditions like CPU over a line or available memory under one, because they are cheap to run and they fire quickly. A log alert rule, also called a scheduled query rule, runs a KQL query on a schedule and fires based on what the query returns, for example “more than five distinct five-hundred errors in the last fifteen minutes” or “any sign-in from an impossible-travel pattern.” Log alerts can express conditions that metrics simply cannot, because they have the full record to reason over, but they evaluate on a schedule rather than continuously and they carry the cost of running a query repeatedly. The rule of thumb follows directly from the earlier decision: if a metric can express the condition, use a metric alert for speed and cost; reach for a log alert when the condition needs the content of the records.
The action group is where you encode your escalation strategy once and reuse it everywhere. A single action group might notify the on-call rotation by SMS and push, email a shared mailbox, post to a webhook that opens an incident ticket, and trigger an automation runbook or function to attempt a self-heal, all in one definition. Because many rules can point at the same action group, you change your escalation path in one place rather than editing dozens of rules. The discipline that keeps an alerting estate sane is to tune for signal: every alert that fires should require a human decision or trigger an automation, and any alert that fires repeatedly without action is noise that erodes trust in the whole system. Treat a noisy alert as a bug to fix, by tightening the threshold, lengthening the evaluation window, or moving the logic into an automation, rather than as background you learn to ignore.
The Cost Model: What Drives the Bill and How to Control It
The single most surprising line on an Azure bill, for teams new to observability, is the Log Analytics ingestion charge, and the surprise is almost always the result of enabling rich logging everywhere without understanding that ingested volume is the cost lever. Metrics are effectively free; logs are billed by the gigabyte you ingest and by how long you retain it. That asymmetry is the whole game.
What drives Azure Monitor and Log Analytics cost?
The dominant cost is the volume of log data ingested into Log Analytics workspaces, billed per gigabyte, plus retention beyond the included period. Platform metrics are collected at no ingestion charge. Verbose diagnostic categories, debug-level application logging, and high-traffic resources sending every event are the usual cost drivers. You control the bill by controlling what you ingest and how long you keep it.
The cost surprise has a predictable anatomy. A team enables every available diagnostic category on every resource, sets application logging to a verbose level, and routes high-traffic services that emit enormous request volumes into the workspace, all of it kept at the default interactive retention. None of those choices is wrong in isolation, but together they generate a volume nobody estimated. The fix is not to stop logging; it is to log deliberately. You select the categories that earn their place, you set application log levels to capture what an investigation needs without drowning in debug chatter outside of active debugging, and you match retention to the question horizon rather than keeping everything at interactive prices for the maximum period.
There are several levers, and they stack. The first is selectivity at the source: enable only the diagnostic categories you will actually query or alert on, because a category you enabled and never look at is pure cost. The second is the right pricing model: workspaces can run pay-as-you-go per gigabyte or on a commitment tier that buys a daily capacity at a discount, and once your steady-state volume is predictable, a commitment tier typically beats pay-as-you-go, which is one more reason to consolidate volume into fewer workspaces so a commitment is easier to justify. The third is retention tuning: shorten interactive retention to your real operational horizon and move compliance data to the cheaper archive tier rather than holding it interactively. The fourth is table-level configuration: some workspaces let you set retention or a lower-cost ingestion tier per table, so noisy, low-value tables can be handled differently from the high-value ones. The exact pricing, the commitment thresholds, the included retention, and the archive economics all move over time, so model your specific volume against the current official pricing when you plan; the levers themselves are stable even as the numbers shift.
Cost control is therefore not a one-time setup but an ongoing FinOps practice that lives next to your observability practice. The teams that keep monitoring costs predictable are the ones that review ingestion by source regularly, catch a newly verbose service before it runs for a month, and treat a sudden ingestion spike as an incident to investigate rather than a bill to absorb. Wiring that review into your delivery process, so that cost visibility travels with the rest of your operational signal, is part of mature monitoring and observability for DevOps, where ingestion trends sit on the same dashboards as reliability and deployment metrics and a cost regression is caught the same way a latency regression is.
The Workspace Schema: Tables You Will Actually Use
A Log Analytics workspace is a collection of typed tables, and knowing which table holds which signal is the difference between a query that finds the answer immediately and ten minutes of guessing. The tables fall into recognizable families by producer, and a handful of them account for most day-to-day work, so learning their names and contents pays off on every investigation.
Where do I find a specific signal in the workspace?
Each producer writes to its own tables. Guest performance counters land in Perf, Windows events in Event, Linux logs in Syslog, agent health in Heartbeat. Application telemetry uses requests, dependencies, exceptions, and traces. Security signals use SigninLogs and AuditLogs. Resource logs land either in service-specific tables or in the shared AzureDiagnostics table.
The guest-telemetry family comes from the Azure Monitor Agent. The Perf table holds performance counter samples, so a question about memory, processor time, or disk queue length inside a machine starts there. The Event table holds Windows event log entries and Syslog holds Linux syslog records, which is where you look for operating-system-level errors that never reach the application layer. The Heartbeat table is quietly one of the most useful, because every reporting agent writes a periodic heartbeat, so a gap in Heartbeat for a machine is the clean signal that its agent has stopped reporting, distinct from the machine itself being down. When guest telemetry goes missing, Heartbeat is the first place to confirm whether the collector is even alive.
The application family comes from Application Insights. The requests table holds incoming requests with their duration, result code, and success flag; dependencies holds outbound calls the application made, such as database queries or HTTP calls to other services, with their timing and success; exceptions holds thrown errors with their stack traces; and traces holds custom log messages your code emitted. The power of having these together is correlation: a single query can find a failed request, join to the dependency that failed underneath it, and pull the exception that resulted, reconstructing the causal chain in one pass. That chain is exactly what an investigation needs and exactly what scattered logging cannot provide.
The platform family is where the most confusion lives, because resource logs land in one of two shapes. Newer resource types write to dedicated, resource-specific tables with clean schemas, while many resources historically wrote to the shared AzureDiagnostics table, a wide catch-all where records from many resource types coexist and the columns are sparsely populated depending on which resource emitted the row. When you query platform logs and cannot find a clean table, AzureDiagnostics filtered by resource type is usually where the data is, and the trend over time is toward dedicated tables that are easier to query. Platform metrics routed to a workspace via a diagnostic setting land in AzureMetrics. The security family adds SigninLogs for authentication events and AuditLogs for directory changes from Entra ID, and the container family adds tables for container logs and inventory from Container insights. Knowing this map turns “where is my data” from a hunt into a lookup.
A Worked Investigation: From Symptom to Root Cause
The fastest way to internalize how the pieces fit is to walk a single realistic investigation from the first symptom to the confirmed cause, using metrics to detect, KQL to diagnose, and a join to prove the chain. The scenario is ordinary: users report that a web application is intermittently slow and occasionally errors, and the on-call engineer has to find out why with the telemetry already flowing into a shared workspace.
The investigation starts in metrics, because the question “is something actually wrong, and when” is a how-much-right-now question. A glance at the application’s server response time metric and its failed request rate over the last few hours confirms the symptom is real and localizes it to a window: response time climbed and failures spiked starting around a particular time. Metrics answered the detection question in seconds and for free, which is exactly what they are for. But metrics cannot say which requests failed or why, so the investigation now crosses the line from how-much into what-and-why, and moves to logs.
In the workspace, the first KQL query narrows to the bad window and asks which requests failed and how badly:
let badWindow = ago(3h);
requests
| where timestamp > badWindow
| where success == false
| summarize failures = count(), p95 = percentile(duration, 95) by name
| order by failures desc
| take 10
That result names the specific request paths that account for the failures and shows their slow tail, turning a vague “the app is slow” into “these three endpoints are failing and their ninety-fifth percentile latency is high.” The investigation now has a target. The next question is what those failing requests were doing underneath, which is a job for a join from requests into dependencies, correlating on the operation that ties an incoming request to the outbound calls it made:
let badWindow = ago(3h);
requests
| where timestamp > badWindow and success == false
| join kind=inner (
dependencies
| where timestamp > badWindow
) on operation_Id
| summarize calls = count(), avgDep = avg(duration1) by type, target
| order by avgDep desc
If the result shows that the failing requests consistently waited on a slow database dependency, the chain is nearly proven: the requests are slow and failing because a downstream dependency is slow. A final query into the exceptions table for the same window and operation confirms what error the application actually raised, which often names the cause precisely, a timeout against the database, a connection pool exhausted, a downstream service returning errors. At that point the engineer has moved from a user complaint to a named root cause with the evidence to support it, and can act: scale the database, fix the query, add a retry with backoff, or whatever the specific cause demands.
The shape of that investigation is the shape of most of them, and it is worth abstracting. Detection is a metrics question answered fast and cheap. Localization narrows the time window. Diagnosis is a logs question that filters to the bad window, identifies the affected entities, and then joins across tables to follow the causal chain to its source. The reason it works in one place with one language is that all the signals share a workspace, which is the entire argument for consolidating telemetry rather than scattering it. An estate where requests, dependencies, exceptions, platform logs, and security events all live together is an estate where this investigation is a few queries; an estate where they are split across silos is one where the same investigation is a manual reconciliation across tools at the worst possible time. If you want to practice this exact flow against real tables until it is automatic, you can build the scenario and work it end to end with the hands-on labs referenced earlier, which is the difference between having read the method and being able to run it under pressure.
Failure Modes and How to Avoid Them
Most Azure Monitor problems fall into a small number of recognizable shapes, and knowing the shapes lets you diagnose in minutes instead of hours. The shapes recur because they all trace back to the pipeline model: a producer that is silent, a collection path that is broken, a store queried wrong, or a cost that ran ahead of attention.
The first and most common shape is total silence: a resource shows no logs whatsoever. This is the diagnostic-settings-first rule in its purest form. Before anything else, confirm a diagnostic setting exists, is enabled, selects the categories you expect, and points at the workspace you are querying. The overwhelming majority of “no data” cases end here, at a pipe that was never connected or was connected to a different workspace than the one the query runs against. Only after the pipe is confirmed do you move on.
The second shape is partial silence: some signal arrives and some does not. This is the independence of the collection paths showing itself. Application traces appear but platform logs do not, or guest performance counters appear but a specific event log does not. The diagnosis is to identify which path owns the missing signal, the diagnostic-settings path, the Azure Monitor Agent path, or the application SDK path, and check that one path on its own terms, because a healthy path tells you nothing about a silent one.
The third shape is the agent that has stopped reporting. Guest operating-system telemetry depends on the Azure Monitor Agent running and the data collection rule being associated with the machine. An agent that is stopped, unhealthy, or has lost its association produces a gap that looks like missing data but is really a stalled collector. The check is to verify the agent’s health and its data collection rule association, not to fiddle with diagnostic settings, which govern a different path entirely.
The fourth shape is the slow or timing-out query. This is almost never a platform fault and almost always a query that scans too much before it narrows. The fix is to push the time filter and the most selective conditions to the front of the pipeline so the engine reads less data, to avoid expensive operations over wide windows, and to summarize rather than return enormous raw result sets. A query that times out over thirty days will usually run instantly over the one hour you actually needed.
The fifth shape is the data that is late rather than absent. Normal ingestion latency means a freshly emitted event is not instantly queryable, and an engineer who expects zero delay will misread that envelope as a fault. The discipline is to distinguish “this is taking the normal few minutes” from “this is hours behind and something is wrong,” because the first needs patience and the second needs investigation. Confusing the two wastes time in both directions.
The sixth shape is the cost spike: ingestion that ran ahead of attention. A newly verbose service, a debug log level left on in production, or a traffic surge into a high-volume table can multiply ingestion before anyone notices, and at default retention the bill compounds. The avoidance is the FinOps habit above: review ingestion by source, alert on ingestion volume itself, and treat a spike as something to diagnose immediately. Watching the watcher is part of running the watcher well.
Service Health and Resource Health: Knowing When the Problem Is Azure’s
Not every incident is your fault, and a mature observability practice can tell quickly whether the cause sits in your configuration or in the platform underneath you. Azure Monitor includes health signals for exactly this question, and wiring them into your alerting alongside your own telemetry saves the wasted hours of debugging an application when the real cause is a platform issue you could have known about immediately.
How do I tell whether an outage is my fault or Azure’s?
Check Azure Resource Health for the affected resource and Azure Service Health for the region and service. Resource Health reports the platform’s view of a specific resource’s availability, while Service Health reports regional service incidents and planned maintenance. If either shows a platform problem covering your window, the cause is likely Azure’s, not your configuration.
Resource Health gives the platform’s assessment of an individual resource’s availability, distinguishing a resource that is genuinely unavailable due to a platform fault from one that is unhealthy because of how you configured or loaded it. When a virtual machine becomes unreachable, Resource Health can tell you whether the platform sees it as affected by a host issue, which is a very different starting point from assuming your own networking or application is at fault. Reading Resource Health early reframes the entire investigation, because it answers the layer question before you spend effort at the wrong layer.
Service Health operates at the regional and service level, reporting active incidents, planned maintenance that might affect your resources, and health advisories about services you use. A regional incident affecting a service your workload depends on explains a class of symptoms that would otherwise look like your own failures, and planned maintenance notifications let you anticipate disruption rather than be surprised by it. The key practice is to create alerts on these health signals just as you alert on your own metrics and logs, so that a relevant platform incident pages you the moment it is declared rather than after you have spent an hour ruling out your own code. A health alert that fires “a service you depend on has a regional incident” at the start of an investigation is worth a great deal, because it redirects effort instantly to waiting and communicating rather than fruitless debugging.
The activity log complements these health signals from the configuration angle. Because it records every control-plane operation, the activity log answers the question “did something change right before this started,” which is the cause of an enormous share of incidents. A deployment, a configuration edit, a scaling change, or a policy update logged just before symptoms began is a powerful lead, and correlating the activity log against the timing your metrics revealed often points straight at the change that broke things. Routing the activity log into your workspace so it sits alongside your other telemetry lets a single KQL query line up control-plane changes with the data-plane behavior that followed, which is how you connect “someone changed the network security group” to “connections started failing” without manual cross-referencing. Together, Resource Health, Service Health, and the activity log let you answer the three orienting questions of any incident quickly: is the platform healthy, is the specific resource healthy, and did anything change. Answering those first, before diving into application logs, is what separates a focused investigation from a flailing one.
When to Use Azure Monitor and When to Reach for Something Else
Azure Monitor and Log Analytics are the right default for observing anything that runs on Azure, because they are integrated with the platform, they collect platform metrics and the activity log automatically, and they put every signal type into one queryable place with one query language. For Azure-native estates the question is rarely whether to use them and almost always how to use them well, which is what the rest of this guide has been about.
The honest boundaries are worth naming so you design with eyes open. Multi-cloud and hybrid estates often need a layer above any single cloud’s native tooling, and while Azure Monitor can ingest from outside Azure, an organization standardized on a vendor-neutral observability platform across clouds may centralize there and use Azure Monitor as a feeder rather than the single pane. Deep distributed-tracing needs, where you want rich span-level traces across many services, are well served by Application Insights but some teams standardize on OpenTelemetry pipelines that can target multiple back ends, and the modern Azure Monitor story embraces OpenTelemetry-based instrumentation precisely so that choice does not force you off the platform. And extremely high-volume, low-value telemetry, where you want to retain enormous raw streams cheaply for occasional forensic search, sometimes belongs in a storage-account archive or a purpose-built log lake rather than in interactive-tier Log Analytics, which is a cost decision the metrics-versus-logs table already points toward.
None of those boundaries is a reason to avoid Azure Monitor on Azure. They are reasons to be deliberate about what you route where, which loops directly back to the metrics-versus-logs decision and the cost model. The platform is the right hub; the discipline is in deciding what flows into it and what is better kept at the edge.
A Sensible Default Setup to Start From
Teams new to the platform often ask where to begin, and a sensible baseline avoids both the under-instrumented estate that goes blind during incidents and the over-instrumented one that surprises everyone with its bill. The starting posture below is deliberately modest and expands from a defensible core rather than enabling everything and trimming later.
Begin with a single central workspace per environment, placed in the region where your workloads run, and resist splitting it until a concrete residency, access, or retention requirement forces the issue. Enforce diagnostic settings through policy so that every new resource of a monitored type is wired to that workspace automatically at creation, which guarantees coverage without depending on anyone remembering. Select the diagnostic categories that you can name a use for, alerting or investigation, and leave the rest off until a need appears, because an enabled category you never query is cost with no return. For virtual machines, deploy the Azure Monitor Agent with a data collection rule that gathers the handful of guest counters and event sources you actually watch, and lean on VM insights or Container insights for the curated baseline rather than assembling it by hand.
On the alerting side, start with metric alerts for the few fast numeric conditions that genuinely warrant a page, such as elevated error rates or saturation, routed through one well-defined action group that encodes your escalation path. Add log alerts only for conditions a metric cannot express, and treat every alert that fires without prompting action as a defect to tune rather than noise to tolerate. Set interactive retention to your real operational horizon, archive only what compliance demands, and review ingestion by source on a regular cadence so a newly chatty service is caught in days rather than at the end of a billing period. Wire Service Health and Resource Health alerts in from the start so a platform incident announces itself instead of being discovered the hard way.
That baseline gives you complete coverage of what matters, fast alerting on the conditions that warrant it, the full record available for investigation, and a bill you can predict, all without the sprawl that makes an estate expensive and hard to reason about. From there you expand deliberately: a new high-value table earns the full analytics plan, a noisy one gets an ingestion transformation or the cheaper plan, a recurring investigation becomes a saved function, and a frequently watched view becomes a workbook. Growing from a defensible core in this way keeps the estate observable and affordable as it scales, which is exactly the outcome the four pillars are meant to produce.
How to Think About Azure Monitor in One Idea
If you keep one idea from this guide, keep the pipeline with two destinations. Telemetry is produced, collected, and lands in either the metrics store, which is fast, cheap, automatic, and shallow, or a Log Analytics workspace, which is rich, queryable, billed by volume, and silent until you route to it. Every feature hangs off that pipeline, every cost decision is a decision about what you route into the workspace, and every “it isn’t working” begins with finding which node in the pipeline is silent. Metrics answer how much; logs answer what and why; and nothing reaches a workspace without a diagnostic setting, an agent, or an SDK explicitly putting it there.
That single picture turns the platform from a collection of confusingly named blades into a system you can reason about. When telemetry is missing, you trace the pipe. When a query is slow, you narrow it. When a bill surprises you, you find the source of the volume. When you need to alert, you ask whether a metric can express the condition before you pay for a log query. The names will keep changing, agents will be renamed, blades will move, and prices will be revised, but the pipeline and the two destinations are the durable model underneath all of it.
The Strategic Verdict
Azure Monitor and Log Analytics reward the engineer who treats observability as a design decision rather than a checkbox. The teams that get value from the platform are not the ones who enabled the most logging; they are the ones who routed deliberately, queried efficiently, alerted on signal rather than noise, and kept ingestion under the same scrutiny as the rest of their cost. The diagnostic-settings-first rule, the metrics-versus-logs decision, a workspace topology justified by real access and residency boundaries, and a cost model you actively manage are the four pillars, and an estate built on them stays observable and affordable as it grows.
The failure mode to avoid is the opposite of all four: telemetry enabled everywhere by reflex, queries written without regard to what they scan, alerts that fire so often nobody reads them, and a bill that grows quietly until it forces a panicked cleanup. That outcome is not the platform’s fault; it is the absence of the decisions this guide laid out. Make the decisions early, encode them as policy so they survive turnover, and Azure Monitor becomes the thing it is meant to be: the place where, at 02:14 on a bad night, the answer to “what was this service doing” is one query away.
Frequently Asked Questions
Q: What is the difference between Azure Monitor and Log Analytics?
Azure Monitor is the entire observability platform on Azure: it collects telemetry, stores it in two distinct tiers, evaluates alerts, drives dashboards, and triggers automation. Log Analytics is the logs storage and query tier inside that platform, backed by a workspace you interrogate with KQL. Metrics live in a separate time-series store, not in Log Analytics. The practical way to hold the distinction is that Azure Monitor is the whole house and Log Analytics is one room in it, the room where log records are stored and queried. The confusion comes from people using the two names interchangeably, but they bill differently, fail for different reasons, and serve different signal types, so keeping them distinct in your head is the first step to using either one well.
Q: How do I query telemetry with KQL in Log Analytics?
You write a query as a top-to-bottom pipeline. Start by naming the table that holds the signal, then chain operators separated by the pipe character, each transforming the rows the next one receives. Filter to your time range and the records you care about with where placed as early as possible, reduce to the columns you need with project, aggregate with summarize to produce counts or averages grouped by some dimension, and shape the result with order by, top, or take. The same language works against platform logs, guest logs, security logs, and application telemetry because they are all tables in the workspace. The two habits that matter most are filtering on time first and filtering hard before any aggregation or join, because that is what keeps a query fast and prevents it from timing out on a large, busy table.
Q: How should I design Log Analytics workspaces?
Default to as few workspaces as your hard requirements allow, often a single central workspace per environment or region, because fewer workspaces make cross-source correlation and cost control dramatically easier. A workspace is simultaneously your boundary for access control, retention, region, and cost attribution, so each one you add fragments all four. Create a separate workspace only when a specific force requires it: data residency that mandates a particular region, an access boundary where one group must never see another’s data, or a retention requirement that diverges sharply from the rest. Outside those forces, consolidation lets KQL join across all your signals in one query and makes a commitment pricing tier easier to reach. Document a clear rule for which resources report to which workspace so the next engineer routing a resource does not have to guess.
Q: When do I use metrics versus logs?
Use metrics when you need a fast, cheap numeric signal you will threshold or chart in near real time, such as CPU percentage, request latency, or queue depth. Metrics are pre-aggregated, collected automatically for most platform resources, and effectively free, which makes them ideal for quick alerting. Use logs when you need the actual content of an event: the exact error text, a specific user, a precise sequence, correlation across services, audit history, or long retention. Logs are billed by the volume you ingest and retain, so routing something to a workspace is a cost decision as well as a technical one. The deciding line is the kind of question: metrics answer how much, right now, while logs answer what exactly happened and why. The moment a need crosses from a number you threshold into a record you investigate, it belongs in logs.
Q: What drives Azure Monitor and Log Analytics cost?
The dominant cost is the volume of log data ingested into Log Analytics, billed per gigabyte, plus charges for retaining data beyond the included period. Platform metrics are collected without an ingestion charge, so the bill is almost entirely a function of what you send to your workspaces and how long you keep it. The usual cost drivers are enabling every diagnostic category by reflex, leaving application logging at a verbose or debug level in production, routing high-traffic services that emit enormous event volumes, and holding everything at interactive retention for the maximum period. You control the bill by being selective about categories, matching log levels to what investigations actually need, choosing a commitment pricing tier once volume is predictable, and tuning retention to your real question horizon. Verify current pricing and commitment thresholds against the official source, since they change over time.
Q: How do alert rules and action groups work?
An alert rule defines the condition to watch and how often to evaluate it, while an action group defines the response. A metric alert rule watches a metric and fires when it crosses a threshold, evaluated continuously and well suited to fast, simple numeric conditions. A log alert rule runs a KQL query on a schedule and fires based on what the query returns, which lets it express conditions metrics cannot. The action group is where you encode the response once: emails, SMS, push notifications, webhooks to open tickets, or automation runbooks to attempt a self-heal. Because many rules can point at one action group, you change your escalation path in a single place. The discipline that keeps alerting useful is to ensure every alert that fires demands a human decision or triggers an automation, and to treat a chronically noisy alert as a bug to fix rather than background noise.
Q: Why does my Log Analytics query time out or run slowly?
Almost always because it scans too much data before it narrows. A Log Analytics workspace indexes records for query at read time, so a query over a wide time range against a large, busy table has to read a great deal before your filters take effect. The fix is rarely more compute and almost always query shape: push the time filter to the very front so the engine only reads the window you need, apply the most selective where conditions early, and use summarize to collapse results rather than returning enormous raw row sets. A query that times out over thirty days frequently runs instantly over the single hour the investigation actually requires. Narrowing first is the habit that separates KQL that scales from KQL that stalls.
Q: Do Azure resources log to Log Analytics automatically?
No, and assuming they do is the single most common observability mistake on Azure. The platform collects metrics for most resources automatically and records the subscription activity log without configuration, but the rich per-resource logs, the diagnostic logs, emit nothing until a diagnostic setting explicitly routes them to a destination. A resource without a diagnostic setting keeps running and keeps producing zero queryable log rows. This is why “no data” almost always traces to a missing pipe rather than a broken query. Before you debug a query that returns nothing, confirm a diagnostic setting exists for that resource, is enabled, selects the categories you expect, and points at the same workspace your query runs against. Carry that check as a reflex and you will resolve most monitoring gaps in minutes.
Q: What is the Azure Monitor Agent and when do I need it?
The Azure Monitor Agent is the in-guest collector that gathers telemetry from inside a virtual machine: memory and process counters, Windows event logs, syslog entries, and custom performance counters. You need it whenever you want signal from inside the operating system, because the platform can report a machine’s host-level metrics from the hypervisor without ever entering the guest, but anything from within the guest requires an agent. The agent is configured through data collection rules that specify exactly which counters and event sources to gather and which workspace to send them to. It replaces the older Log Analytics agent, which is on a retirement path, so any new design should standardize on the Azure Monitor Agent and data collection rules from the start rather than carrying legacy collection forward and having to migrate later.
Q: How is the activity log different from resource logs?
The activity log is a subscription-level audit trail of control-plane operations: who started a virtual machine, who changed a network security group rule, who deleted a resource. It is recorded automatically, retained for a fixed window at no charge, and answers governance questions about what changed and who changed it. Resource logs, also called diagnostic logs, are data-plane records emitted by an individual resource about its own operation, and they emit nothing until a diagnostic setting routes them. The two answer different questions: the activity log tells you about management actions on resources, while resource logs tell you what a resource was doing internally. For long retention and query you can route the activity log into a workspace alongside resource logs, which lets you correlate a control-plane change with the data-plane behavior that followed it.
Q: Where does Application Insights data live, and can I query it with KQL?
In its modern, workspace-based form, an Application Insights resource stores its data physically in a Log Analytics workspace, and you query it with the same KQL you use for everything else, against tables like requests, dependencies, exceptions, and traces. This means Application Insights is not a separate silo with its own query language to learn; it is a specialized producer of application telemetry that writes into the same store as your platform logs and security logs. The practical benefit is enormous: a single KQL query can join an application request to the platform log of the underlying service and the security sign-in that initiated it, all because they share a workspace. This unification is one of the strongest arguments for consolidating telemetry into a small number of workspaces rather than scattering it.
Q: What retention period should I set on a Log Analytics workspace?
Match retention to the actual horizon over which you would query the data interactively, and use the cheaper archive tier for anything you must keep but will rarely touch. Every workspace has an interactive retention period during which data is fully queryable, and beyond that you can configure archived retention that holds data at lower cost for occasional retrieval. The mistake teams make is holding everything at interactive prices for the maximum period out of caution, which is expensive for data they will never query interactively. Keep operational logs only as long as an active investigation might reach back, archive compliance data for as long as the obligation runs, and set retention per table where the workspace allows it so noisy low-value tables are not kept as long as high-value ones. Confirm current default and maximum retention windows against the official documentation, since they change.
Q: How do I monitor a virtual machine’s memory usage on Azure?
Memory from inside the guest requires the Azure Monitor Agent, because the platform reports a virtual machine’s host-level metrics like host CPU and disk from the hypervisor but cannot see guest memory pressure without an in-guest collector. You install the Azure Monitor Agent, associate a data collection rule that gathers the memory performance counters you want, and direct that rule to a Log Analytics workspace. Once the counters flow, you query them with KQL in the performance table and can build a metric-style alert or a workbook tile from the result. The common confusion is expecting guest memory to appear automatically alongside host CPU; it will not, because the two come from different layers. The host metrics are free and automatic; the guest counters need the agent and a data collection rule pointed at a workspace.
Q: Can one diagnostic setting send to multiple destinations at once?
Yes. A single diagnostic setting can route the selected log and metric categories to several destinations simultaneously: a Log Analytics workspace for query and alerting, a storage account for cheap long-term archive, and an event hub for streaming to an external system. This lets you serve different needs from one configuration, for example sending high-value categories to a workspace for interactive query while also archiving the full stream cheaply in storage for forensic retention. The thing to remember is that each destination has its own cost and purpose: the workspace destination is what makes data queryable with KQL and usable by log alerts, while storage and event hub serve archive and streaming respectively. Choose destinations by what you will actually do with the data, since routing everything everywhere multiplies cost without adding capability you use.
Q: Why do I see metrics for my resource but no logs?
Because metrics and logs travel different paths, and one working tells you nothing about the other. Platform metrics for most resources are collected automatically into the metrics store with no configuration, which is why you see them. Resource logs, by contrast, emit nothing until a diagnostic setting routes them to a workspace, so their absence means the pipe was never connected. This is the partial-silence failure shape, and the diagnosis is to recognize that seeing metrics is not evidence that logging is configured. Check for a diagnostic setting on the resource, confirm it selects the log categories you expect, and confirm it points at the workspace your query runs against. The metrics arriving automatically is exactly what lulls teams into assuming the logs must be arriving too, when in fact they require a separate, explicit configuration.
Q: Should I use a metric alert or a log alert for a given condition?
If a metric can express the condition, prefer a metric alert, because it evaluates continuously against a near-real-time stream, fires quickly, and costs almost nothing to run. Use it for simple numeric thresholds like CPU over a line, available memory under one, or request latency above a target. Reach for a log alert, which runs a KQL query on a schedule, when the condition needs the content of the records: counting distinct error types, detecting a pattern across fields, correlating events from two services, or any logic a single metric cannot carry. Log alerts are more expressive but evaluate on a schedule rather than continuously and carry the cost of running a query repeatedly. The choice mirrors the broader metrics-versus-logs decision: numbers you threshold go to metric alerts for speed and cost, and conditions that require the record go to log alerts for expressiveness.
Q: How do I reduce a surprising Log Analytics bill without losing visibility?
Attack the volume, not the visibility. Start by reviewing ingestion by source to find which resources and categories generate the most data, because the bill is dominated by ingested gigabytes. Disable diagnostic categories you enabled but never query, since an unused category is pure cost. Set application logging to capture what investigations need without verbose debug output running continuously in production. Match retention to your real query horizon and move compliance data to the cheaper archive tier instead of holding it at interactive prices. Once your steady-state volume is predictable, evaluate a commitment pricing tier, which usually beats pay-as-you-go and is easier to justify when volume is consolidated into fewer workspaces. None of these reduces the visibility you actually use; they remove the data you were paying to ingest and keep but never looked at.
Q: What is the normal ingestion latency, and how do I tell late data from missing data?
There is always some delay between an event happening and that event becoming queryable in a workspace, because telemetry is batched at the source, transmitted, processed through the ingestion pipeline, and only then indexed. This latency is normal, bounded under healthy conditions, and varies by data type, but it is never zero. The skill is distinguishing normal latency from a genuine problem. If your data is a few minutes behind, that is the expected envelope and the answer is patience. If it is hours behind, or a source that was flowing has gone quiet, that is a distinct condition with its own causes worth investigating. Confusing the two wastes time in both directions: treating normal latency as an outage sends you chasing a non-problem, while treating a real stall as mere latency lets a genuine gap persist. Confirm recency before concluding either way.
Q: Can Azure Monitor collect telemetry from outside Azure or from other clouds?
Yes. While Azure Monitor is deepest on Azure-native resources, it can ingest telemetry from on-premises machines and from other clouds through the Azure Monitor Agent and the data collection API, and Application Insights can instrument applications wherever they run. Organizations standardized on Azure can use it as a central observability hub across a hybrid estate. That said, teams running heavily multi-cloud sometimes centralize on a vendor-neutral platform and treat Azure Monitor as a feeder for the Azure portion, and the modern story embraces OpenTelemetry-based instrumentation so that choosing a portable standard does not force you off the platform. The decision comes down to where your center of gravity is: an Azure-centric estate gets the most from making Azure Monitor the hub, while a balanced multi-cloud estate may put the single pane elsewhere and route Azure signal into it.
Q: What is KQL and is it the same as SQL?
KQL, the Kusto Query Language, is the query language for Log Analytics and several other Azure data services, and while it shares the goal of querying tabular data with SQL, it reads quite differently. Where SQL is declarative and clause-ordered, KQL is a top-to-bottom pipeline: you name a table, then pass rows through a sequence of operators separated by the pipe character, each transforming what the next receives. That pipeline shape makes investigative queries read like a funnel, which many engineers find more natural for exploratory log analysis than SQL’s structure. The core operators, where to filter, project to select columns, summarize to aggregate, and join to correlate, cover the overwhelming majority of operational work. If you know SQL the concepts transfer quickly; the main adjustment is thinking in a forward pipeline rather than in SQL’s clause order, and learning to filter early for performance.