Fix Log Analytics Ingestion Delay: Latency vs Loss

A Log Analytics ingestion delay is the gap between the moment an event happens on your resource and the moment a query in that workspace can return the row. It feels like a fault because you know the event occurred, you can see the activity on the resource, and yet the table comes back empty when you run the search. The reflex that follows is almost always wrong: engineers conclude data is being lost, open a support case, rebuild the diagnostic pipeline, reinstall the agent, or recreate the workspace, when in the overwhelming majority of cases nothing is broken at all. The data is in flight. It has not finished its journey from the source, through collection and batching, across the ingestion pipeline, into the indexed store where queries read from. That journey has an inherent, bounded latency, and treating that latency as loss is the single most common and most expensive mistake in Azure Monitor operations.

Diagnosing a Log Analytics ingestion delay, latency versus data loss - Insight Crunch

This article diagnoses the delay to root cause. The central idea, the one to carry into every incident, is what this series calls the latency-is-expected rule: a bounded ingestion delay is normal, so the first step is always to compare the observed latency against the expected window before you treat it as a fault. Get that ordering wrong and you spend an afternoon dismantling a healthy pipeline to chase data that was about to arrive on its own. Get it right and most of these incidents close in two minutes with a single query. The distinct causes that produce a real, abnormal delay are few and each has a confirming check: a query that filters on event time rather than ingestion time and therefore excludes rows that already arrived, agent buffering and batching that holds events before sending them, ingestion-volume throttling that extends latency under heavy load, a connector or pipeline lag specific to one source, and a dashboard that refreshes inside the normal latency window and shows a hole that fills itself. By the end you will be able to tell normal latency from a genuine delay, attribute a true delay to its cause, and act only when the situation is actually abnormal.

What a Log Analytics ingestion delay actually means

When you send telemetry to a Log Analytics workspace, the data does not appear in a queryable table the instant the event fires. It passes through a sequence of stages, and each stage adds time. The source generates the event and stamps it with a time. A collection agent or a platform pipeline picks the event up, often buffering several events together rather than sending each one alone. The batch is transmitted to the Azure Monitor ingestion endpoint. The ingestion pipeline validates, parses, and routes the records to the correct table according to the schema and the data collection rule in force. Finally the records are committed to the indexed columnar store that the Kusto engine queries against. Only after that last commit can a query return the row. The total elapsed time across all those stages is the ingestion latency, and it is a property of the system working correctly, not a symptom of the system failing.

The reason this matters so much is that the symptom of normal latency is identical to the symptom of real data loss when you look only at an empty query result. A query that returns nothing for the last five minutes looks exactly the same whether the data is lost forever or simply still in transit. The empty result cannot, on its own, tell you which situation you are in. That is why the diagnosis has to start with the expected window. If the missing interval falls inside the normal latency window, the empty result is expected behavior and the correct action is to wait or to query a slightly older interval. If the missing interval extends well beyond the expected window, you have a genuine delay and you move to attributing it to a cause. The empty result is the symptom; the expected window is the ruler you measure it against; the cause is what you fix.

There is a sibling failure that looks similar and must be separated from this one. If telemetry never arrives at all, no matter how long you wait, that is not an ingestion delay, it is a no-data problem with its own distinct causes: a diagnostic setting that was never configured, an agent that is not running or not connected, telemetry routed to a different workspace than the one you are querying, or a data collection rule that does not associate the resource. That family is diagnosed separately in the companion piece on how to fix Azure Monitor missing logs and metrics, and the very first branch of any monitoring incident is deciding which family you are in. The test is simple and you will see it repeated throughout this article: does the data eventually arrive if you wait and widen the window? If yes, you have a delay, and this article is the right one. If no, you have a no-data problem, and the sibling article is the right one.

The model underneath all of this, the workspace itself and how queries read from it, is covered in the Azure Monitor and Log Analytics guide. This article assumes that model and concentrates on the timing behavior: why the data is late, how late is normal, and what makes it later than it should be.

How to read the signal before you touch anything

The discipline that separates a two-minute close from a wasted afternoon is gathering the diagnostic signal before you change anything. The instinct under pressure is to act first, restart the agent, recreate the diagnostic setting, scale the workspace, and only then look at what the data is telling you. Reverse that. Every change you make before you have read the signal destroys the evidence you need to diagnose correctly and can introduce a real problem on top of the imagined one. The signal you need is small and fast to collect, and it answers the only question that matters at this stage: is the data arriving, and if so, how late.

How do I confirm telemetry is still flowing into the workspace?

Run a query with no recent time filter and look at the most recent row’s ingestion time. If you can see rows arriving and the newest ingestion time is within the last several minutes, telemetry is flowing and you are looking at normal latency. If the newest ingestion time is hours old or absent, you have a real delay or a no-data problem, and you move to attribution.

The fastest confirming query in the entire Azure Monitor toolkit is the one that asks the data when it actually arrived rather than when the event happened. The ingestion_time() function returns, for each record, the time the ingestion pipeline committed that record to the store. Project it alongside the event time and the difference between the two columns is the measured latency for those specific rows. This is not a guess or an estimate from documentation; it is the latency your workspace is experiencing right now, computed from the rows themselves.

Heartbeat
| extend IngestionTime = ingestion_time()
| extend LatencySeconds = datetime_diff('second', IngestionTime, TimeGenerated)
| project TimeGenerated, IngestionTime, LatencySeconds, Computer
| sort by IngestionTime desc
| take 20

The Heartbeat table is the right place to start because every connected agent writes a heartbeat record on a regular cadence, so it is a steady, predictable stream you can use as a clock for the whole ingestion path. If the heartbeat rows are arriving with a small, stable latency, the path is healthy and any “missing” application data is almost certainly normal latency on a less frequent table. If the heartbeat latency itself is large or the heartbeats have stopped arriving, the problem is upstream of any single table and affects everything that agent sends.

Read the LatencySeconds column carefully. A small positive number, on the order of seconds to a few minutes, is the system working as designed. The number being positive is expected: ingestion time is always at or after event time, because a record cannot be committed before the event it describes occurred. What you are watching for is the magnitude. A latency that sits steady in the low minutes is your baseline. A latency that has climbed into the tens of minutes or hours, or that is growing each time you re-run the query, is the abnormal case that warrants attribution. Re-running the query a few times over a couple of minutes tells you whether the latency is stable, shrinking as a backlog drains, or growing as a backlog builds, and that trend is one of the most useful signals you can collect.

One more signal belongs in this first pass: the spread of latency across sources. If one computer or one resource shows a large latency while everything else in the same workspace is within the normal window, the problem is local to that source, which points at agent batching, a connector lag, or a source-specific issue rather than a workspace-wide condition. If every source shows the same elevated latency at the same time, the problem is downstream and shared, which points at volume throttling or a pipeline-wide event. Collecting this spread early saves you from chasing a single agent when the whole workspace is affected, and from blaming the workspace when one agent is the outlier.

Heartbeat
| where TimeGenerated > ago(1h)
| extend LatencySeconds = datetime_diff('second', ingestion_time(), TimeGenerated)
| summarize MaxLatency = max(LatencySeconds), AvgLatency = avg(LatencySeconds), Rows = count() by Computer
| sort by MaxLatency desc

This query gives you the per-source latency profile for the last hour in one shot. The computers at the top of the sorted result are your outliers, the candidates for a source-specific cause. A flat profile where every source shows similar latency tells you to look downstream. This single query frequently ends the investigation, because it either shows you a healthy workspace experiencing normal latency or points you straight at the one source that is behaving differently. Practicing this read until it is automatic is exactly the kind of drill that the scenario-based exercises on ReportMedic are built around, and being able to interpret the latency profile under incident pressure is what turns a panicked rebuild into a calm two-minute confirmation.

The latency-is-expected rule and the baseline you measure against

The latency-is-expected rule is the spine of this whole diagnosis, so it is worth stating with precision. A Log Analytics ingestion delay within the normal window is not a fault, it is the designed behavior of a buffered, batched, pipelined ingestion system, and therefore the correct first action is never to fix anything but to measure the observed latency against the expected window. Only a latency that exceeds the expected window is a genuine delay, and only a genuine delay gets attributed to a cause and acted on. Everything in the rest of this article hangs off that ordering. Measure first, attribute second, act third, and act only when the measurement says the situation is abnormal.

To apply the rule you need a baseline, an expected window for your own workspace. The exact end-to-end latency you should expect depends on the data type, the collection method, and the load, and the precise figures change as Azure evolves the ingestion platform, so the durable approach is to establish your own baseline from your own data rather than memorizing a number that will age badly. The official Azure Monitor documentation publishes a latency guideline and the data-type-specific expectations, and you should confirm the current published window against that source, but your operational baseline is the latency your ingestion_time() queries report on a normal day. Capture that baseline once, when nothing is wrong, and you will know instantly during an incident whether the latency you are seeing is ordinary or elevated.

What ingestion latency is normal versus a real problem?

Normal latency is a small, stable delay measured from your own workspace on a quiet day, typically on the order of a few minutes for most data types, with some types and some collection paths running faster or slower. A real problem is a latency that sits well above that baseline, keeps growing rather than holding steady, or affects data that should be near real time.

The reason a single universal number is the wrong thing to chase is that different telemetry travels different paths with different timing characteristics. Platform metrics, which take a dedicated near-real-time path, are queryable far faster than verbose resource logs that flow through diagnostic settings and the general log ingestion pipeline. Agent-collected data carries the additional time the agent spends buffering and batching before it transmits. Data from a connector that pulls from an external system on a schedule inherits that schedule’s cadence on top of the ingestion latency. Because the paths differ, the expected window differs by data type, and a latency that is perfectly normal for one table would be alarming for another. This is why your baseline should be captured per table, or at least per major data path, rather than as one workspace-wide figure.

The most important consequence of the rule is what it tells you not to do. When you observe a gap that falls inside the expected window, the rule forbids the destructive responses that engineers reach for under pressure. Do not recreate the diagnostic setting, because the setting is working and recreating it changes nothing except possibly introducing a real misconfiguration. Do not reinstall or reconfigure the agent, because the agent is sending and the data is in flight. Do not scale up the workspace or change its tier, because capacity is not the constraint when the latency is normal. Do not open a support case for data that has not yet exceeded its expected window, because the answer will be to wait. The rule converts a tempting list of actions into a single correct action: confirm the window, and if the gap is inside it, wait or query the slightly older interval that is already complete.

There is a counter-reading worth engaging directly, because it is the trap that catches experienced engineers, not just beginners. The argument goes: “I have waited several minutes, the data still is not there, so it must be lost.” The flaw is that “several minutes” is an impression, not a measurement, and that the query used to check is often filtering on event time, which hides rows that already arrived. The disciplined version of that same check is to run the ingestion_time() query, read the actual latency, and compare it to the baseline. Nine times out of ten the disciplined check shows the data either already present or arriving with a normal latency that the impatient impression missed. The rule does not ask you to be patient on faith; it asks you to replace an impression with a measurement.

Why ingestion time and event time are not the same thing

The deepest source of confusion in this entire problem space is the difference between when an event happened and when its record became queryable, because Log Analytics records both and the two are used for completely different purposes. Get this distinction wrong and you will misdiagnose ingestion delays for the rest of your career; get it right and most “missing recent data” incidents dissolve on contact.

Every record in a Log Analytics workspace carries a TimeGenerated column. This is the event time, the timestamp of when the thing the record describes actually occurred at the source. It is the time you almost always want to filter and chart on, because you care about when events happened in the real world, not when the plumbing delivered them. When you write where TimeGenerated > ago(5m), you are asking for events that occurred in the last five minutes of real-world time. That is usually exactly right, and it is why TimeGenerated is the default time column for the workspace and the field the time-range picker in the portal binds to.

Separately, every record has an ingestion time, exposed not as a stored column you select by name but through the ingestion_time() function. This is the time the ingestion pipeline committed the record to the store and made it queryable. The gap between TimeGenerated and ingestion_time() is the latency for that record, and it is always non-negative, because a record cannot be made queryable before the event it describes happened. The two timestamps answer two different questions. TimeGenerated answers “when did it happen.” ingestion_time() answers “when could I first see it.” Confusing the two is the root of the most common false alarm in the whole subject.

Why does my query miss the most recent rows?

Because your query filters on TimeGenerated, the event time, and the rows for the most recent events have not finished ingesting yet, so they are not in the store for the query to return. The events occurred, but their records are still in flight through the latency window. Querying by ingestion time, or simply waiting for the window to close, reveals them.

Walk through exactly how this trap springs, because it is so common that recognizing the shape of it will save you more incidents than any other single idea in this article. You run a query for the last five minutes filtered on TimeGenerated. The most recent events did occur within those five minutes, so their event time falls inside your filter. But those same events are still inside the ingestion latency window, so their records are not yet committed to the store. The query engine scans the store, finds no committed records whose TimeGenerated falls in the window, and returns an empty or short result. Nothing is lost. The events happened, the records are coming, but the query asked for them by event time before ingestion had finished, so they were invisible to that particular query at that particular moment. Wait until the latency window closes and re-run the identical query, and the rows appear, because now they are committed and their TimeGenerated still falls inside the (now slightly older) window.

The fix for this specific misdiagnosis is to make the latency visible in the query itself rather than fighting it blindly. When you are checking whether recent data is arriving, filter and inspect on ingestion time, not event time, so that you see what has actually been committed regardless of when the underlying events happened.

AppRequests
| where ingestion_time() > ago(5m)
| extend EventTime = TimeGenerated, IngestionTime = ingestion_time()
| extend LatencySeconds = datetime_diff('second', IngestionTime, EventTime)
| project EventTime, IngestionTime, LatencySeconds
| sort by IngestionTime desc

This query asks “what has been committed in the last five minutes” rather than “what happened in the last five minutes,” and the difference is decisive during a delay investigation. If this query returns rows, ingestion is working and your event-time query was simply asking for data that had not finished arriving. If this query also returns nothing, the data genuinely is not arriving and you have crossed from a latency question into a no-data question. The same idea, framing recent-data checks around ingestion time, is the practice habit that the DevOps observability article builds into dashboards and alerts so that the team stops raising false alarms on normal latency.

A subtle and important corollary: for charting and alerting on actual system behavior, you still want TimeGenerated, because you care about when events happened, not when they were ingested. Ingestion time is a diagnostic lens for delay investigations, not a replacement for event time in your day-to-day queries. The skill is knowing which lens to use for which question. Use event time to understand the system. Use ingestion time to understand the pipeline. Reach for ingestion time precisely when you suspect a delay, and switch back to event time once you have confirmed the pipeline is healthy.

The collection paths and why each one carries its own timing

To baseline correctly and to attribute a real delay quickly, you need a working picture of how telemetry actually reaches a workspace, because the path determines the timing and the path differs by data type. There is no single conveyor belt; there are several, each with its own stages, and a number that is perfectly healthy on one path would be alarming on another. Knowing which path a given table travels lets you set the right expectation and rules out half the candidate causes before you run a single confirming query.

The fastest path belongs to platform metrics. Metrics are numeric, pre-aggregated, and travel a dedicated near-real-time channel that exists precisely so that autoscale and operational alerts can react quickly. When you compare a metric chart against a log query and the metric looks current while the log query looks behind, you are not seeing a fault; you are seeing two paths with two different timing profiles doing exactly what they are built to do. This is why an alert that needs to fire fast is usually built on a metric, and why a metric-versus-log discrepancy is a poor reason to suspect a problem.

Resource logs take the longer road. They are emitted by the resource, picked up by a diagnostic setting, and routed through the general log ingestion pipeline, where they are validated against the table schema, transformed according to any data collection rule, and committed to the indexed store. Every one of those stages is fast individually, but together they account for the bulk of the few-minute window most engineers observe for log data. The configuration that puts a resource on this road, and the order in which the diagnostic setting and the workspace association must be created, is the subject of the configure diagnostic settings guide, and a setting that is misconfigured does not produce a delay at all, it produces no data, which is the no-data family rather than this one.

Agent-collected data adds a stage on top of the log pipeline: the agent itself. The Azure Monitor Agent, governed by a data collection rule, gathers events on the host, buffers and batches them, and transmits the batch to the ingestion endpoint, after which the records travel the same pipeline as any other log. The agent stage is where host pressure, a slow network path, or an aggressive collection rule turns a modest, expected increment into the source-localized lag covered later as a real cause. Note the platform shift worth keeping straight: the current agent is the Azure Monitor Agent driven by data collection rules, which superseded the older Log Analytics agent, sometimes still called by its legacy name, and the two have different configuration surfaces even though both buffer and batch in the same general way. Write and reason in terms of the current agent and rules, and acknowledge the legacy agent only where a reader on an older estate needs the bridge.

Connector data travels the most variable road of all. A connector ingesting from an external system, a security signal feed, or a scheduled import does not stream in real time; it polls or receives on a cadence, often batches large pulls, and depends entirely on the health and rate limits of the upstream system it reads from. The expected window for a connector is therefore whatever its schedule and its upstream allow, which can be far longer than the few minutes you expect from a directly collected source and is still completely normal for that connector. Treating a connector’s scheduled cadence as a delay is one of the most common false alarms in security and integration monitoring, and the cure is to baseline each connector against its own schedule rather than against a directly collected source.

Holding this map in mind changes how you read an incident. A workspace-wide elevation across every path at once points downstream, at volume throttling or a pipeline event, because a shared cause is the only thing that moves every path together. An elevation confined to one host points at the agent stage on that host. An elevation confined to one connector points at that connector’s schedule or its upstream. And a metric that is current while a log is behind points at nothing wrong at all, just two paths behaving as designed. Pattern-matching the shape of the elevation against this map narrows the cause before you run a confirming query, and it is exactly the kind of system-level reasoning the DevOps observability article treats as a core operational skill rather than a piece of trivia.

The KQL toolkit for diagnosing an ingestion delay

The diagnosis lives in a handful of KQL patterns, and learning them as a small toolkit rather than reinventing a query each time is what makes the two-minute close possible. Each pattern answers one question, and run in sequence they walk you from “is anything arriving” to “exactly which source is behind and by how much.” Keep them in a saved set so that under incident pressure you are pasting and reading rather than composing.

The foundational pattern is the per-row delay measurement, which you have already met: project ingestion_time() alongside TimeGenerated, subtract them, and read the gap. The variation worth saving is the distribution rather than the raw rows, because a single slow row is noise while a shifted distribution is signal. Percentiles tell you whether the whole stream has moved or just a few outliers have, and the median against the high percentile tells you whether the elevation is broad or tail-heavy.

Heartbeat
| where TimeGenerated > ago(2h)
| extend DelaySeconds = datetime_diff('second', ingestion_time(), TimeGenerated)
| summarize p50 = percentile(DelaySeconds, 50),
            p90 = percentile(DelaySeconds, 90),
            p99 = percentile(DelaySeconds, 99),
            MaxDelay = max(DelaySeconds)
  by bin(TimeGenerated, 15m)
| sort by TimeGenerated desc

Binning by fifteen minutes turns this into a trend you can read at a glance: a p50 that has crept up across the bins is a building backlog, a p50 that is steady while p99 spikes is a few slow rows rather than a systemic shift, and a p50 that is climbing toward the p99 is a stream falling behind across the board. This single query distinguishes a transient blip from a genuine, sustained elevation, which is the distinction that decides whether you act at all.

The second pattern is the cross-table sweep, which finds the lagging source without your having to guess which table to inspect. The earlier per-table query did this for a small set; the generalized form unions across the tables you care about and ranks them by delay, so the outlier surfaces on its own.

union withsource=SourceTable
    Heartbeat, AzureDiagnostics, AppRequests, AppTraces
| where TimeGenerated > ago(1h)
| extend DelaySeconds = datetime_diff('second', ingestion_time(), TimeGenerated)
| summarize p50 = percentile(DelaySeconds, 50),
            MaxDelay = max(DelaySeconds),
            Rows = count()
  by SourceTable
| sort by MaxDelay desc

Reading the result is the whole attribution step in one look. If every table sits near baseline, you are not in a delay at all and the empty query that started the incident was a misread. If one table towers over the rest, you have your lagging source and the cause is local to it. If every table is elevated together, the cause is downstream and shared, which sends you to the volume correlation. The union approach is also how you avoid the tunnel-vision trap of staring at the one table you happened to query first while the real outlier sits in a table you never looked at.

The third pattern is the volume correlation, which confirms or rules out throttling under load by tying the delay to the inflow that would produce it. Run the Usage aggregation from the volume cause alongside the binned delay above, over the same window, and compare the shapes. When the high-volume bins are the high-delay bins, throttling is confirmed and the Usage breakdown names the data type to reduce. When volume is flat while the delay is elevated, throttling is ruled out and you look elsewhere, which is just as valuable, because ruling a cause out is half of a fast diagnosis.

The fourth pattern is the heartbeat gap detector, which catches the boundary case where a delay is actually the early stage of a source going silent. A heartbeat that is merely late is a delay; a heartbeat that has stopped is a no-data problem in progress. Measuring the time since the last heartbeat per host separates the two before the silence is long enough to be obvious.

Heartbeat
| summarize LastHeartbeat = max(TimeGenerated),
            LastIngested = max(ingestion_time()) by Computer
| extend MinutesSinceHeartbeat = datetime_diff('minute', now(), LastHeartbeat)
| extend MinutesSinceIngested = datetime_diff('minute', now(), LastIngested)
| sort by MinutesSinceHeartbeat desc

A host whose last heartbeat is recent in event time but whose last ingestion is also recent is healthy with normal timing. A host whose last heartbeat event time is old is going quiet, which is the no-data family rather than a delay, and the gap detector catches it early enough to act. Building these four patterns into saved queries, and rehearsing the read of each until it is automatic, is precisely the kind of repeatable diagnostic muscle the hands-on Azure labs and command library on VaultBook exist to develop, so that the toolkit is reflex by the time a real incident lands.

The distinct causes of a genuine ingestion delay

Once the latency-is-expected rule has told you the delay is real, meaning it exceeds your baseline window, the next job is attribution. A genuine delay comes from a small set of distinct causes, and each one has a confirming check that tells you whether it is yours and a tested response that addresses it. The discipline here mirrors the discipline of the whole series: do not apply a fix until a check has confirmed the cause, because applying the wrong fix to a delay you have not confirmed is how you turn a self-resolving latency into a genuine outage. The causes are normal latency masquerading as a problem, a query filtering on event time, agent buffering and batching, ingestion-volume throttling, a connector or pipeline lag for one source, and a dashboard refreshing inside the window. The first two are not really delays at all, they are misreadings, and you have already met both. The remaining causes are real and are where the rest of this section concentrates.

Cause one and two: the delay that is not a delay

The two most frequent “delays” reported are normal latency read as loss and event-time queries that hide in-flight rows, and both have already been diagnosed above. They lead the list deliberately, because the data is unambiguous on this point: the large majority of ingestion-delay incidents resolve to one of these two, and both are misreadings rather than faults. Before you spend a minute on the real causes, you confirm you are not in one of these two, because the confirming check is the same fast ingestion_time() query you already ran. If that query shows data arriving with a latency at or near your baseline, you are in the not-a-delay case, and the correct action is to stop, because there is nothing to fix. The reason this is worth restating as a formal cause is that the false alarm is so seductive under pressure that engineers skip the check and jump to the real causes, then “fix” a problem that did not exist and create one that did. Confirm the not-a-delay case first, every time, with the measurement rather than the impression.

Can agent buffering or batching cause the delay?

Yes. Collection agents do not send each event the instant it occurs; they buffer events and transmit them in batches on an interval and by size, which is efficient but introduces a delay between the event and its departure from the source. Under backlog, a slow link, or a misconfigured collection rule, that agent-side delay can grow well beyond the normal window.

Agent buffering and batching is the first genuine cause to consider when the latency is real and localized to one source or a set of sources sharing an agent. The agent, whether the current Azure Monitor Agent or a legacy agent, is designed to batch for efficiency, because sending one network request per event would be wasteful and would not scale. It accumulates events and flushes them on a cadence governed by time and volume. In healthy operation this adds a modest, predictable increment to the latency. The increment becomes a problem when the agent cannot flush as fast as events arrive, which happens when the source is generating a high event rate, when the network path to the ingestion endpoint is slow or intermittent, when the agent host is resource-starved and cannot process its own queue, or when a data collection rule is configured in a way that holds or transforms data inefficiently before sending.

The confirming check is the per-source latency profile you ran earlier. If one agent or one host shows a latency far above the others while the rest of the workspace is within baseline, the delay is local to that source, and agent buffering is the leading hypothesis. Confirm it further by checking the agent’s own health on the host: whether the agent service is running, whether the host is under CPU, memory, or disk pressure that would starve the agent’s queue, and whether the agent’s local buffer is backing up. A backing-up buffer is the signature of an agent that is collecting faster than it can transmit, and it produces exactly the growing-latency trend you watch for when you re-run the ingestion_time() query.

# Confirm the Azure Monitor Agent service is running and inspect host pressure
# Linux
systemctl status azuremonitoragent
top -b -n 1 | head -20

# Windows (PowerShell)
Get-Service AzureMonitorAgent
Get-Counter '\Processor(_Total)\% Processor Time'

The fix depends on which constraint you confirmed. If the host is resource-starved, the answer is to relieve the pressure on the host, because the agent cannot flush a queue it has no CPU to process; right-size the host or move the noisy neighbor causing the contention. If the network path is the constraint, the answer is to address the path, because a slow or flapping link to the ingestion endpoint will throttle every batch; confirm reachability and latency to the endpoint from the host. If the event rate genuinely exceeds what a single agent can transmit, the answer is to reduce what you collect at the source through the data collection rule, filtering out the noise you do not need rather than paying latency to ship telemetry you will never query. The order of operations for setting up those collection rules correctly, so that the agent collects the right data efficiently in the first place, is covered in the guide to configuring diagnostic settings across Azure, and a collection rule tuned to send only what matters is the most durable prevention for agent-side latency.

Does high ingestion volume add latency?

Yes. The ingestion pipeline has rate limits, and when the volume of data flowing into a workspace climbs past the pipeline’s per-workspace ingestion rate, the platform applies throttling that holds the excess and processes it as capacity frees up, which extends the end-to-end latency for everything in that workspace until the backlog drains.

Ingestion-volume throttling is the cause to suspect when the latency is real, elevated, and shared across every source in the workspace at the same time, rather than localized to one agent. The signature is distinctive: a workspace-wide latency increase that correlates with a spike in ingested volume, often tied to a deployment that turned on verbose logging, a new high-volume source that was just onboarded, a debug flag left enabled in production, or a runaway component emitting far more telemetry than usual. The pipeline does not drop the excess; it queues and processes it, so the data is not lost, but the queue depth shows up as latency, and the latency persists until the inflow drops back below the rate the pipeline can sustain or the backlog drains.

Confirm it by correlating latency with volume. Chart the ingested volume over the period in question and overlay the measured latency, and if the latency rises and falls with the volume, throttling under load is the cause. The Usage table records ingested volume per data type and is the right source for this correlation.

Usage
| where TimeGenerated > ago(24h)
| where IsBillable == true
| summarize IngestedGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1h), DataType
| sort by TimeGenerated desc

Run that alongside an hourly aggregation of the latency from ingestion_time() and the overlap is usually unmistakable: the hours of highest volume are the hours of highest latency. Having confirmed throttling under volume, the fix is to attack the volume, not to ask for more pipeline. Identify the data type and the source driving the spike from the Usage breakdown, then reduce it at the source: turn off the verbose or debug logging that does not earn its keep, filter high-cardinality noise in the data collection rule before it is sent, sample high-volume traces rather than ingesting every one, and route genuinely separate high-volume workloads to their own workspace so a noisy source does not extend latency for everything else. The durable principle is that ingestion latency under load is a volume problem, and volume problems are solved by collecting less of what you do not need, which also reduces the cost the volume was driving in parallel.

Can a single connector or source lag while everything else is fine?

Yes. A data connector that pulls from an external or upstream system on its own schedule inherits that schedule’s cadence and any backlog in the upstream system, so a single connector can lag by far more than the ingestion pipeline’s own latency while every directly collected source in the same workspace stays within baseline.

A source-specific or connector lag is the cause when the per-source profile shows one data type or one connector far behind while everything else is current. Connectors that ingest from external systems, security signal sources, third-party platforms, or scheduled imports do not flow through the same near-real-time path as a metric or an agent heartbeat. They poll or receive on a cadence, they may batch large pulls, and they are subject to delays and backlogs in the upstream system they read from. When that upstream system is slow, rate-limited, or itself backlogged, the connector’s data arrives late even though the Azure ingestion pipeline downstream of it is perfectly healthy. The latency you measure for that one data type will be large while the workspace-wide latency for everything else is normal, which is the fingerprint that distinguishes a connector lag from a pipeline-wide throttle.

Confirm it the same way you confirm everything else, by reading the latency per data type rather than per workspace.

union withsource=SourceTable *
| where TimeGenerated > ago(2h)
| extend LatencySeconds = datetime_diff('second', ingestion_time(), TimeGenerated)
| summarize MaxLatency = max(LatencySeconds), AvgLatency = avg(LatencySeconds), Rows = count() by SourceTable
| sort by MaxLatency desc

The table at the top of that result with a latency far above the others, while the rest sit near baseline, is your lagging connector. The fix is rarely on the Azure side, because the Azure side is healthy; the lag lives in the connector’s schedule or the upstream system. Check the connector’s configured polling or collection interval, because a connector set to pull every hour will, by definition, show up to an hour of latency that is expected for that connector and abnormal only relative to near-real-time sources. Check the health and rate limits of the upstream system the connector reads from, because a throttled or backlogged source delays everything the connector can pull. And recalibrate your baseline for that data type to match the connector’s real cadence, so you stop alarming on a connector that is behaving exactly as configured. The lesson is that “normal” latency is per source, and a connector’s normal is whatever its schedule and its upstream allow, not the few-minute window you expect from a directly collected source.

Why does my dashboard show a hole that fills itself in a few minutes?

Because the dashboard refreshes inside the ingestion latency window, querying the most recent minutes before those rows have finished ingesting, so it renders a hole at the leading edge of the time range. The hole is not missing data, it is the latency window made visible, and it fills itself as ingestion completes and the next refresh runs.

The self-healing dashboard hole is the last of the common causes, and it is purely a presentation artifact of querying too close to now. A dashboard or workbook set to show “the last 15 minutes” and refresh every minute will, on every refresh, query right up to the present moment. The most recent few minutes of that range are inside the ingestion latency window, so their rows are not yet committed, so the chart shows a dip or a gap at the right-hand edge. A minute later the next refresh runs, those rows have now ingested, the gap has moved one minute to the right, and the previous gap has filled. To anyone watching the dashboard in real time, it looks like data is arriving late and then catching up, which is precisely what is happening, and precisely what is supposed to happen given the latency.

The confirmation is trivial: widen the dashboard’s time range and watch the leading-edge hole persist only at the very edge while the rest of the range is solid, or query the same data filtered on ingestion_time() and watch the hole disappear. The fix is to stop querying inside the latency window for any visualization where a leading-edge gap is alarming. Add a small offset so the dashboard’s “now” is actually a few minutes ago, past the latency window, by filtering on TimeGenerated < ago(5m) for the leading edge or by setting the visualization to lag real time by your baseline window. This trades a few minutes of recency for a chart with no self-healing holes, which is the right trade for almost every operational dashboard, because nobody makes a decision on the most recent thirty seconds of a trend line, and the false alarm the hole generates costs more than the recency it preserves. Building dashboards and alerts that respect the latency window is one of the practices the observability article treats in depth, and it is the single most effective way to stop a healthy pipeline from generating false delay alerts.

A worked ingestion-delay incident from alarm to resolution

Theory becomes reflex only when you have walked the path once end to end, so here is a complete incident built from the patterns above, written the way it actually unfolds at a terminal rather than as a tidy summary. Treat it as a script you can follow the first few times until the sequence is automatic.

The page arrives: an on-call engineer reports that the production application’s request logs have stopped appearing for the last ten minutes, and a deployment went out twenty minutes ago, so the suspicion is that the deployment broke logging. The tempting first move, rolling back the deployment or recreating the diagnostic setting, is exactly the move the latency-is-expected rule forbids until a measurement has been taken. The discipline is to read the signal first.

The first query is the flow confirmation against the heartbeat, asking whether anything is arriving at all.

Heartbeat
| where ingestion_time() > ago(10m)
| summarize LastIngested = max(ingestion_time()), Rows = count()

The result shows heartbeats committed within the last two minutes and a healthy row count. That single result has already changed the diagnosis: the ingestion path is alive and current, so this is not a dead pipeline and not a no-data problem, which means rolling back the deployment would have fixed nothing and wasted the rollback. The data is either normal latency or a real, localized delay, and the next query decides which.

The second query measures the actual delay on the specific table that triggered the alarm, on ingestion time so that in-flight rows are visible.

AppRequests
| where ingestion_time() > ago(15m)
| extend DelaySeconds = datetime_diff('second', ingestion_time(), TimeGenerated)
| summarize p50 = percentile(DelaySeconds, 50),
            p99 = percentile(DelaySeconds, 99),
            Rows = count()

The result returns rows, which immediately disproves “logs have stopped,” and the p50 delay reads a little over four minutes against a documented baseline of roughly two to three minutes for this table. So the application logs are arriving, slightly slower than usual but not dramatically so, and the engineer’s “stopped for ten minutes” was an event-time query run inside the latency window that simply could not see the most recent rows yet. At this point a large fraction of incidents are already closed: the data is present, the delay is mild, and the correct action is to inform the on-call engineer that nothing is broken and to wait one window. But the p50 is above baseline, so a careful engineer spends one more query confirming whether the mild elevation is the deployment’s doing or background noise.

The third query is the volume correlation, because a deployment that turned on verbose logging is the classic driver of a volume-throttling elevation that tracks a deploy.

Usage
| where TimeGenerated > ago(1h)
| where IsBillable == true
| summarize IngestedGB = sum(Quantity) / 1024 by bin(TimeGenerated, 5m), DataType
| sort by TimeGenerated desc

The result shows the application’s log data type roughly tripling in volume in the five-minute bins right after the deployment. That is the confirmation: the deployment did not break logging, it turned the logging volume up, and the increased inflow pushed the delay modestly above baseline as the pipeline absorbed the extra load. The diagnosis is complete and rests entirely on measurements: the path is healthy, the data is arriving, the delay is mild and is driven by a volume increase that the deployment introduced.

The resolution follows from the cause rather than from the panic. The immediate action is to communicate that there is no outage and no data loss, only a mild, self-resolving elevation, which stands the on-call engineer down from a rollback that would have fixed nothing. The durable action is to look at what the deployment started logging and decide whether the tripled volume is worth its latency and its cost; if it is a debug flag left enabled, the fix is to turn it off and the delay returns to baseline on its own, and if it is genuinely useful telemetry, the fix is to accept the slightly higher baseline for that table or to tune the collection rule to keep the signal and drop the noise. Either way, nothing was rebuilt, nothing was lost, and the whole incident took four queries and a few minutes, which is the difference the discipline makes. Rehearsing this exact sequence against a reproduced incident, where you generate the volume spike yourself and watch the four queries tell the story, is what the scenario-based troubleshooting drills on ReportMedic are designed to build, so that the calm, measured walk replaces the panic the very first time you meet this in production.

The InsightCrunch ingestion-latency table

The findable artifact for this article is the InsightCrunch ingestion-latency table, which collapses the whole diagnosis into one reference you can keep open during an incident. Each row is a contributor to what looks like a delay, paired with the check that confirms whether it is yours and the verdict on whether action is actually warranted. The table encodes the latency-is-expected rule directly: the first two rows are not faults and warrant no action beyond confirmation, while the real causes each carry a targeted response. Read it top to bottom during an incident and you will rarely reach the bottom, because most delays resolve in the first two rows.

Contributor	What it looks like	How to confirm it	Action warranted?
Normal expected latency	Recent rows absent, data arrives within baseline if you wait	`ingestion_time()` shows latency at or near baseline	No. Wait or query the slightly older complete interval
Query filters on event time	Most recent rows missing despite events occurring	Same query with `where ingestion_time() > ago(5m)` returns the rows	No. Filter or check on ingestion time during delay investigations
Agent buffering and batching	One source or host lags, others are current	Per-computer latency profile shows one outlier; agent host under pressure	Yes, if real. Relieve host pressure, fix the network path, or reduce collection in the data collection rule
Ingestion-volume throttling	Whole workspace lags, latency tracks a volume spike	`Usage` volume correlates with elevated latency across all sources	Yes. Reduce volume at the source, sample, or split workspaces
Connector or source lag	One data type far behind, everything else current	Per-table latency profile shows one table as the outlier	Yes, relative to its cadence. Check connector interval and upstream health; recalibrate that table’s baseline
Dashboard refresh inside window	Self-healing hole at the leading edge of a live chart	Widen the range or query on ingestion time and the hole disappears	No. Offset the dashboard to lag real time by the baseline window

The table is also the spine of a hands-on lab, because every row maps to a reproduction you can run. You can generate a controlled burst of telemetry and watch the volume-throttling row light up, throttle an agent host and watch the agent-batching row appear, and build a dashboard that queries inside the window and watch the self-healing hole form and fill. Running each row’s reproduction until you can produce and confirm it on demand is how the diagnosis becomes muscle memory rather than a checklist you consult under stress, and the hands-on Azure labs and command library on VaultBook are built to let you reproduce each contributor and read its signal in a sandbox before you meet it in production.

How to prevent ingestion-delay false alarms and real delays

Prevention splits cleanly along the same line the whole article runs on: preventing the false alarms, which are misreadings of normal latency, and preventing the real delays, which are genuine pipeline conditions. The two require different work, and most teams under-invest in the first, which is why most of their ingestion-delay incidents are false alarms that should never have been raised.

Preventing false alarms is mostly a matter of building latency awareness into the tools the team looks at every day, so that normal latency never reads as a problem. Capture a per-table latency baseline once, when the workspace is healthy, and document it where the on-call engineer will find it, because an engineer who knows the baseline is a few minutes will not panic at a few-minute gap. Build every operational dashboard to lag real time by the baseline window, so the leading-edge hole never forms and nobody chases a self-healing gap. Write any “recent data” health check to test on ingestion_time() rather than TimeGenerated, so the check sees what has actually been committed and does not fire on data that is merely in flight. Set any latency alert to trigger only when the measured latency exceeds the baseline by a real margin and stays there, not on a single sample inside the normal window, because a one-shot alert on normal latency trains the team to ignore the alert entirely. These four habits, baseline documented, dashboards offset, health checks on ingestion time, alerts thresholded above baseline, eliminate the large majority of ingestion-delay incidents because the large majority are false alarms.

Preventing real delays is a matter of managing the two genuine workspace-wide and source-level conditions: volume and agent health. On volume, the durable prevention is to collect deliberately rather than collect everything, because the volume that drives throttling under load is almost always volume you did not need. Tune your data collection rules and diagnostic settings to send the log categories and the telemetry you actually query, filter high-cardinality noise before it leaves the source, sample high-volume traces, and keep verbose and debug logging out of steady-state production. Watch the Usage table for volume trends and investigate any step change in ingested volume the moment it appears, because a volume spike caught early is a configuration to fix, while a volume spike caught late is an incident. On agent health, keep the agent hosts right-sized so the agent always has the CPU and memory to flush its queue, keep the agents current, and monitor the heartbeat latency itself as a leading indicator, because rising heartbeat latency on a host is the earliest sign that that host’s agent is starting to back up.

There is a structural prevention that addresses both volume and source isolation at once: workspace topology. Routing a genuinely high-volume or latency-sensitive workload to its own workspace prevents a noisy source from extending latency for everything else and lets you tune ingestion and retention for that workload independently. The trade-off is more workspaces to manage and cross-workspace queries for correlation, so this is a deliberate design choice rather than a default, but for a workload where ingestion latency under load has been a recurring real delay, isolating it is often the cleanest fix. The workspace design model that informs that choice is laid out in the Azure Monitor and Log Analytics guide, and the at-scale configuration that keeps collection rules consistent across many resources is covered in the configure diagnostic settings article.

Monitoring the ingestion pipeline so a real delay surfaces and a fake one does not

The final operational maturity step is to monitor ingestion timing itself, turning the ad hoc queries you run during an incident into a standing signal that tells you about a genuine delay before a human notices and that never fires on normal latency. Most teams monitor what their telemetry reports and never monitor the timing of the telemetry’s own delivery, which means they learn about a real pipeline delay only when an engineer happens to query inside the window and panics. Closing that gap converts the ingestion delay from a recurring surprise into a measured property with a known normal range and an alert that fires only when the range is genuinely exceeded.

The signal to monitor is the measured delay distribution per table, computed exactly as the toolkit queries compute it, evaluated on a schedule and compared against the baseline you captured for that table. The reason to monitor the distribution rather than a single sample is the same reason you read percentiles during an incident: one slow row is noise, a shifted distribution is signal, and an alert built on a single sample will fire constantly on the tail and train the team to ignore it. Compute the median and a high percentile on a rolling window, compare each against the table’s documented baseline, and raise the signal only when the median sits above baseline by a real margin and stays there across consecutive evaluations. That persistence condition is what separates a brief, self-resolving blip during a transient volume bump from a sustained elevation that indicates a backing-up agent or genuine throttling under load.

Think of the normal window as an informal objective for delivery timing rather than a hard guarantee, because the platform does not promise a fixed number and the durable framing is your own expectation rather than a quoted figure. Treat the baseline as the target, treat a sustained breach of it as the event worth knowing about, and treat everything inside it as success that needs no attention. Framing it this way also resolves the perennial argument about how patient to be: you are not patient on faith, you are patient up to the objective and alerted past it, which is a defensible position in a design review rather than a shrug.

The signal needs a deliberate scope, because a single workspace-wide number hides exactly the information that makes attribution fast. Compute the delay per table or per major data path, since a connector’s normal cadence and an agent table’s normal cadence are different objectives and folding them together produces a number that is meaningless for both. Compute it per source where the volume justifies the granularity, because a per-source signal is what tells you instantly whether an elevation is one host or the whole workspace, which is the first fork in the attribution tree. The cost of this granularity is more signals to manage, so scope it to the tables and sources where a delay actually matters operationally rather than instrumenting everything, and let the less critical paths ride on the workspace-wide signal.

The alert that rides on the signal should be written to be actionable, which means it should tell the responder which fork of the diagnosis they are already on. An alert that says only “ingestion is slow” sends the responder back to the start; an alert that says “table X delay p50 is twelve minutes against a three-minute baseline, sustained for fifteen minutes, while volume on table X has tripled” hands the responder the volume cause already half-confirmed. Encoding the baseline, the breach, the duration, and the correlated volume into the alert payload turns the page itself into the first three queries of the diagnosis, which is the difference between a responder who starts cold and one who starts on third base. This is the practice the DevOps observability article treats as the standard for any operational alert: the alert carries enough context to be the beginning of the diagnosis rather than merely an announcement that something is wrong.

There is a second-order benefit to monitoring ingestion timing that pays off independently of any delay incident. The same per-table volume and timing signals are the earliest indicators of a configuration mistake that would otherwise show up first as a cost surprise, because the volume spike that drives a throttling delay is the same volume spike that drives the bill. A team watching ingestion volume per table catches the debug flag left on in production within an hour as a timing and volume anomaly, rather than at the end of the month as a cost anomaly, which is a far cheaper place to catch it. Monitoring the pipeline’s timing, in other words, is also monitoring the pipeline’s cost, and the same instrumentation serves both, which is what makes it worth the effort to build once and keep.

The failures this is most often confused with

An ingestion delay sits in a small neighborhood of monitoring symptoms that all present as “the data I expected is not there,” and distinguishing them is most of the diagnostic skill, because the correct action for each is different and applying one family’s fix to another family’s problem makes things worse. Three confusions account for nearly all the misdiagnosis.

The first and most consequential confusion is between latency and loss. Latency means the data is in flight and will arrive; loss means the data will never arrive because something upstream failed to send it or routed it elsewhere. The two look identical in an empty query result, and the entire latency-is-expected rule exists to separate them: if the data arrives when you wait and widen the window, it was latency, and the correct action was to wait; if the data never arrives no matter how long you wait, it was loss, and you have crossed into the no-data family. The cost of confusing them runs both ways. Treat latency as loss and you rebuild a healthy pipeline, which wastes time and risks introducing a real fault. Treat loss as latency and you wait indefinitely for data that is never coming, which delays the real fix. The test that separates them is the same one stated at the top of this article: does the data eventually arrive. Run the ingestion_time() check, wait past the baseline window, and let the answer to that one question route you to the right family.

The second confusion is between an ingestion delay and a genuine no-data problem, which is the loss family in detail. When data never arrives, the causes are not timing causes at all: a diagnostic setting that was never configured so nothing is being sent, an agent that is not running or not connected so it collects nothing, telemetry routed to a different workspace than the one you are querying so it arrives somewhere you are not looking, or a data collection rule that does not associate the resource so the agent never collects from it. None of these is a latency problem and none is fixed by waiting; each is fixed by correcting the configuration that stopped the data from being sent in the first place. The full diagnosis of that family lives in the fix Azure Monitor missing logs and metrics article, and the clean handoff between the two articles is the eventual-arrival test: this article owns the case where data arrives late, that article owns the case where data does not arrive at all.

The third confusion is between an ingestion delay and a query that is simply wrong, which is the most embarrassing to discover and the easiest to rule out. A query that filters on the wrong time column, scopes to the wrong table, applies a filter that excludes the rows you want, or runs against the wrong workspace will return nothing while the data sits perfectly ingested and queryable. This is not a delay and not a loss; it is a query defect. The fastest way to rule it out is to strip the query down to the table and a wide time range with no other filters, confirm the rows are present, and then add filters back one at a time until you find the one that was excluding your data. Most “the data is missing” reports that are neither latency nor loss turn out to be a query that asked the wrong question, and the broad-then-narrow technique finds the offending clause in seconds.

Closing verdict

A Log Analytics ingestion delay is, in the great majority of cases, not a fault but the normal latency of a buffered, batched, pipelined ingestion system being mistaken for data loss. The single most valuable habit you can build is the latency-is-expected rule: before you treat a gap as a problem, measure the observed latency against your baseline window, and act only when the measurement says the situation is genuinely abnormal. That ordering, measure first, attribute second, act third, converts the most common false alarm in Azure Monitor operations into a two-minute confirmation, and it stops the destructive reflex of rebuilding a healthy pipeline to chase data that was about to arrive on its own.

When the delay is real, the causes are few and each has a confirming check. A query filtering on event time hides in-flight rows, and switching the check to ingestion time reveals them. Agent buffering and batching lag one source, confirmed by the per-computer latency profile and fixed by relieving host pressure, fixing the path, or collecting less. Ingestion-volume throttling lags the whole workspace, confirmed by correlating latency with the Usage volume and fixed by reducing volume at the source. A connector lags one data type, confirmed by the per-table profile and fixed by aligning expectations to the connector’s cadence. A dashboard refreshing inside the window shows a self-healing hole, fixed by offsetting the visualization past the latency window. Keep the InsightCrunch ingestion-latency table open during an incident, work it top to bottom, and you will resolve almost every case before you reach the real causes, because almost every case is a misreading the first two rows catch. Reproduce each row until the diagnosis is automatic, build your dashboards and alerts to respect the latency window, and the ingestion delay stops being an incident and becomes a number you already know.

Frequently Asked Questions

Q: Why is there a delay before logs appear in Log Analytics?

Because the data has to travel through a multi-stage pipeline before a query can read it, and each stage adds time. The source generates the event, a collection agent or platform pipeline buffers and batches it, the batch is transmitted to the ingestion endpoint, the pipeline validates and routes the records according to the schema and data collection rule, and finally the records are committed to the indexed store that queries read from. Only after that final commit is the row queryable. The total elapsed time across those stages is the ingestion latency, and it is the system working as designed, not a fault. The delay is normal and bounded; what makes it abnormal is when the measured latency exceeds your workspace’s usual baseline, which is the signal to start attributing it to a specific cause rather than waiting.

Q: What ingestion latency is normal versus a real problem in Log Analytics?

Normal latency is a small, stable delay measured from your own workspace on a quiet day, commonly on the order of a few minutes for most data types, with some collection paths faster and some slower. A real problem is a latency that sits well above that baseline, keeps growing across repeated checks rather than holding steady, or affects data that should be near real time. The right approach is to capture your own per-table baseline when nothing is wrong, because the exact expected window depends on the data type and the collection method and the precise published figures change as Azure evolves the platform. Confirm the current official latency guidance against Azure’s documentation, but treat the latency your own ingestion_time() queries report on a normal day as your operational baseline, since that is what tells you instantly during an incident whether the latency is ordinary or elevated.

Q: How do I measure the actual ingestion latency for my workspace?

Use the ingestion_time() function in KQL, which returns the time each record was committed to the store, and compute the difference against TimeGenerated, the event time. Project both columns and subtract them with datetime_diff to get the per-row latency in seconds, then sort by ingestion time and inspect the most recent rows. The Heartbeat table is the best starting point because every connected agent writes heartbeats on a steady cadence, giving you a reliable clock for the whole path. Re-run the query a few times over a couple of minutes to read the trend: a stable small latency is your baseline, a shrinking latency is a backlog draining, and a growing latency is a backlog building. This single measurement replaces the impression of “it feels slow” with the actual number your pipeline is producing right now, which is the foundation of every correct ingestion-delay diagnosis.

Q: What is the difference between ingestion_time() and TimeGenerated?

TimeGenerated is the event time, the timestamp of when the thing the record describes actually happened at the source, and it is the column you normally filter and chart on because you care about when events occurred in the real world. ingestion_time() is a function that returns the time the ingestion pipeline committed the record and made it queryable. The gap between the two is the latency for that record, and it is always non-negative because a record cannot be queryable before its event happened. The two answer different questions: TimeGenerated answers when it happened, ingestion_time() answers when you could first see it. Use event time for everyday queries, dashboards, and alerts about system behavior, and reach for ingestion time specifically when you are investigating a delay, because that is the lens that reveals rows that have arrived but whose event time has not yet entered your event-time filter window.

Q: Why does my query miss the most recent rows even though the events happened?

Because the query filters on TimeGenerated, the event time, and the records for those most recent events have not finished ingesting, so they are not yet committed to the store for the query to return. The events occurred and their event time falls inside your filter, but they are still inside the ingestion latency window, so the query engine finds no committed rows and returns an empty or short result. Nothing is lost; the records are in flight. Re-run the identical query after the latency window closes and the rows appear because they are now committed and their event time still falls in the window. To check correctly during a delay investigation, filter on ingestion_time() instead of TimeGenerated, which asks what has actually been committed rather than what happened, and returns the in-flight rows the event-time filter was hiding.

Q: Can agent buffering or batching cause an ingestion delay?

Yes, and it is the first genuine cause to consider when the delay is real and localized to one source. Collection agents batch events for efficiency rather than sending each one immediately, accumulating them and flushing on a time and size cadence. In healthy operation that adds a modest, predictable increment to latency. It becomes a problem when the agent cannot flush as fast as events arrive, which happens when the source generates a high event rate, the network path to the ingestion endpoint is slow or intermittent, the agent host is starved for CPU or memory and cannot process its own queue, or a data collection rule holds or transforms data inefficiently. Confirm it with the per-computer latency profile: if one host is far behind while others are current, agent batching is the leading hypothesis. Fix it by relieving host pressure, addressing the network path, or reducing what the collection rule sends.

Q: Does high ingestion volume increase Log Analytics latency?

Yes. The ingestion pipeline has a per-workspace rate limit, and when ingested volume climbs past what the pipeline can sustain, the platform throttles by queueing the excess and processing it as capacity frees up, which extends end-to-end latency for everything in that workspace until the backlog drains. The data is not dropped; the queue depth shows up as latency. The signature is a workspace-wide latency increase that rises and falls with a volume spike, often caused by a deployment enabling verbose logging, a new high-volume source, a debug flag left on in production, or a runaway component. Confirm it by charting ingested volume from the Usage table and overlaying the measured latency; if they track each other, throttling under load is the cause. The fix is to reduce volume at the source by disabling unneeded verbose logging, filtering noise in the collection rule, sampling high-volume traces, or splitting workloads into separate workspaces.

Q: How do I confirm telemetry is still flowing into the workspace?

Query a steady stream like the Heartbeat table with no recent time filter and inspect the most recent ingestion_time(). If rows are arriving and the newest ingestion time is within the last several minutes, telemetry is flowing and you are looking at normal latency on whatever table seemed empty. If the newest ingestion time is hours old or absent, you have either a real delay or a no-data problem and you move to attribution. The heartbeat is ideal for this because every connected agent writes one on a predictable cadence, so it acts as a clock for the whole ingestion path; if heartbeats are current, the path is healthy and the “missing” data on a less frequent table is almost certainly in flight. This check is the fastest way to separate a pipeline that is working with normal latency from one that has genuinely stopped delivering data.

Q: Is my Log Analytics data lost or just delayed?

Run the eventual-arrival test: wait past your baseline latency window, widen the query time range, and check on ingestion_time() whether the rows have arrived. If they eventually appear, it was latency and the correct action was simply to wait. If they never arrive no matter how long you wait, it was loss, and you have crossed from a delay question into a no-data question with entirely different causes. The two look identical in an empty query result, which is why you must run the test rather than guess. Treating latency as loss leads you to rebuild a healthy pipeline and risk introducing a real fault; treating loss as latency leaves you waiting forever for data that is not coming. The one question that routes you correctly is whether the data eventually arrives, and the ingestion_time() check plus a wait past baseline answers it definitively.

Q: Why does my dashboard show a gap that fills itself a few minutes later?

Because the dashboard refreshes inside the ingestion latency window, querying right up to the present moment where the most recent rows have not finished ingesting, so it renders a hole at the leading edge of the chart. On the next refresh those rows have committed, the gap fills, and a new gap forms one minute further along, which looks like data arriving late and catching up because that is exactly what is happening. It is a presentation artifact of querying too close to now, not missing data. Confirm it by widening the time range and watching the gap persist only at the very edge, or by querying on ingestion_time() and watching it disappear. Fix it by offsetting the dashboard so its “now” lags real time by your baseline window, for example filtering the leading edge with TimeGenerated < ago(5m), which trades a few minutes of recency for a chart that never shows a self-healing hole.

Q: How do I set an ingestion-delay alert that does not fire on normal latency?

Base the alert on the measured latency from ingestion_time() relative to your documented baseline, and require both a real margin above baseline and persistence across multiple samples before it triggers. A single sample inside the normal window must never fire the alert, because an alert that fires on ordinary latency trains the on-call team to ignore it, which defeats the purpose. Compute the latency per table or per major data path, since the expected window differs by data type and a connector’s normal cadence is not the same as an agent’s. Set the threshold meaningfully above the baseline for that path and add a duration condition so a brief blip during a transient volume spike does not page anyone, while a sustained elevation that indicates genuine throttling or a backing-up agent does. This turns the alert from a generator of false alarms into a signal that fires only on the abnormal latency that actually warrants attention.

Q: Can one connector lag while the rest of the workspace is current?

Yes. A connector that pulls from an external or upstream system on its own schedule inherits that schedule’s cadence and any backlog in the upstream system, so it can lag by far more than the ingestion pipeline’s own latency while every directly collected source stays within baseline. Connectors do not flow through the same near-real-time path as metrics or agent heartbeats; they poll or receive on a cadence, batch large pulls, and depend on the health and rate limits of the system they read from. Confirm it with a per-table latency profile: the data type far behind while everything else is current is your lagging connector. The fix usually is not on the Azure side, since the pipeline downstream is healthy; check the connector’s configured interval, the upstream system’s health and rate limits, and recalibrate your baseline for that table to match its real cadence so you stop alarming on a connector behaving exactly as configured.

Q: Should I recreate my diagnostic setting or reinstall the agent when logs are delayed?

No, not before you have measured the latency and confirmed the cause. When the gap falls inside your baseline window, the latency-is-expected rule forbids these destructive responses because the setting and the agent are working and the data is in flight; recreating the setting or reinstalling the agent changes nothing except the risk of introducing a real misconfiguration on top of an imagined problem. Run the ingestion_time() check first. If the latency is at or near baseline, the correct action is to wait or query the slightly older complete interval. Reach for configuration changes only after the measurement shows a genuine delay and a confirming check has attributed it to a cause that the change actually addresses, such as reducing collection in a rule when an agent is backing up. Acting before measuring is how a self-resolving latency becomes a real outage.

Q: Why is ingestion latency different for metrics versus logs?

Because they travel different paths with different timing characteristics. Platform metrics take a dedicated near-real-time path optimized for low latency, so they are queryable quickly. Resource logs and agent-collected data flow through diagnostic settings and the general log ingestion pipeline, which buffers, batches, validates, routes by schema, and commits to the indexed store, all of which adds time. Agent-collected data carries the additional agent-side buffering and batching delay on top of the pipeline latency. Connector data inherits its connector’s polling schedule. Because the paths differ, the expected window differs by data type, and a latency that is perfectly normal for a verbose log table would be alarming for a metric. This is the core reason to baseline per table or per data path rather than as one workspace-wide figure: comparing a slow path against a fast path’s expectation produces false alarms about a system that is behaving exactly as its architecture dictates.

Q: How do I check ingestion volume to diagnose a workspace-wide delay?

Query the Usage table, which records ingested volume per data type over time, and aggregate it into hourly bins to see the volume trend. Filter to billable data, sum the quantity by data type and time bin, and sort by time to find spikes. Then run an hourly aggregation of the measured latency from ingestion_time() over the same period and overlay the two. If the latency rises and falls with the volume, you have confirmed ingestion-volume throttling under load, and the Usage breakdown tells you exactly which data type and source drove the spike so you know where to reduce. This correlation is the definitive confirmation for the volume cause, because it ties an elevated, workspace-wide latency directly to the inflow that produced it, and it points straight at the noisy source to fix rather than leaving you to scale the workspace blindly when capacity was never the real constraint.

Q: Does scaling up the workspace or changing its tier reduce ingestion latency?

Not when the latency is normal, and rarely as the right first move even when it is elevated. When the measured latency sits at baseline, capacity is not the constraint, so changing the tier addresses nothing. When the latency is elevated because of volume throttling, the durable fix is to reduce the volume at the source rather than to provision your way around a flood of telemetry you did not need, because the same volume is also driving cost in parallel. Identify the data type and source driving the spike from the Usage table, then turn off verbose or debug logging, filter high-cardinality noise in the collection rule, sample high-volume traces, and route genuinely separate high-volume workloads to their own workspace. Tier and commitment choices affect cost and retention behavior, but the lever that actually fixes volume-driven latency is collecting less of what does not earn its keep, which solves the latency and the cost together.

Q: How do I build dashboards that respect ingestion latency?

Offset every operational visualization so its leading edge lags real time by your baseline window, which keeps the chart out of the latency window where rows have not finished committing and the self-healing hole forms. In practice this means filtering the leading edge with something like TimeGenerated < ago(5m) or configuring the visualization to lag by your measured baseline, trading a few minutes of recency for a chart with no false gaps. Base health checks on ingestion_time() rather than TimeGenerated so they test what has actually been committed rather than what merely happened, and threshold latency alerts above baseline with a persistence condition so they fire only on genuine elevation. These habits build latency awareness into the tools the team watches all day, which is what stops normal latency from being read as a problem and eliminates the false alarms that make up the majority of ingestion-delay incidents.

Q: Does ingestion latency affect when a log-based alert rule fires?

Yes. A log alert rule queries the workspace on a schedule, and if it queries inside the latency window it will evaluate against data that has not finished ingesting, which can delay the alert or, for a rule looking for the absence of data, fire a false alarm because the expected rows are merely in flight rather than missing. This is why a log alert designed to detect a missing signal needs to account for the ingestion window in its logic, either by evaluating on a slightly lagged time range or by setting its threshold and lookback so that normal latency cannot trip it. Metric alerts, which ride the near-real-time metric path, are far less exposed to this because their path has a much smaller window. When you build a log alert, treat the latency window as part of the design: a rule that ignores it will either fire late on real conditions or fire falsely on normal timing, and both erode trust in the alert.

Q: What is _TimeReceived and how does it relate to ingestion time?

_TimeReceived is a field some tables expose that records when the ingestion point received the record, which sits between the event time in TimeGenerated and the final commit time returned by ingestion_time(). It can help you localize where in the path the time is being spent: a large gap between TimeGenerated and _TimeReceived points at delay before the data reached the ingestion endpoint, such as agent buffering or a connector schedule, while a gap between _TimeReceived and ingestion_time() points at delay inside the pipeline after receipt, such as processing under heavy volume. Not every table populates the field consistently, so treat it as a supplementary lens rather than the primary measure, and lean on ingestion_time() against TimeGenerated as the authoritative end-to-end delay. When the field is present, the two sub-gaps it exposes can shorten attribution by telling you which side of the receipt point the time is being lost on.

Q: How long should I wait before treating delayed logs as an actual incident?

Wait until the measured delay exceeds your documented baseline for that specific table by a real margin and stays there across repeated checks, rather than waiting a fixed number of minutes by feel. The right wait is defined by the baseline, not the clock: if a table’s normal window is three minutes, a gap at four minutes is not yet an incident, while a gap at twenty minutes that is still growing is one regardless of how recently it started. Run the ingestion_time() delay query, compare against baseline, and re-run it a couple of times over a few minutes to read the trend. A delay that is stable at or near baseline is not an incident, a delay that is shrinking is a backlog draining and will resolve itself, and a delay that is elevated and growing is the case that warrants attribution and action. Anchoring the decision to the baseline and the trend, rather than to an arbitrary patience threshold, is what keeps you from both over-reacting to normal timing and under-reacting to a genuine, building delay.