Fix Azure Data Factory Pipeline Failures

A red dot on a pipeline run and a single line that reads Operation on target Copy_Customers failed is where most engineers begin, and far too many of them respond by clicking rerun and walking away to get coffee. The rerun completes, the same red dot appears, and the cycle repeats until somebody senior wanders over and asks the only question that matters: which activity failed, and what did its error actually say? Azure Data Factory pipeline failures are almost never about the pipeline. The pipeline status is a roll-up, a summary verdict that inherits the worst outcome of any step inside it, and the actionable truth lives one level down, in the specific activity that broke and the structured error it recorded. Learn to read that one error and you stop guessing. You will know whether a linked service lost its credential, a copy activity hit a throttling wall against its sink, a mapping data flow could not start its Spark cluster, a source path or schema shifted underneath you, or the integration runtime that runs everything went dark. Each of those leaves a different signature, each has a confirming check you can run in under a minute, and each has a tested fix that is not “rerun and hope.”

Fixing Azure Data Factory pipeline failures by reading the failed activity error - Insight Crunch

This is a diagnosis, not a symptom catalog. By the end you will be able to open a failed run, navigate from the generic pipeline result to the failing step, read the error structure the way the platform intends, map the signal to one of a small set of recurring root causes, confirm which one is yours, apply the matching repair, and then rerun from the failed step rather than restarting work that already succeeded. The method generalizes. Orchestration in Azure, whether you build it in Data Factory or in the pipelines engine that ships inside Synapse, follows the same model, so the reading skill you build here transfers directly to either surface.

What an Azure Data Factory pipeline failure actually tells you

A pipeline in Data Factory is a container for activities, and an activity is the unit of work that does something concrete: a copy activity moves rows from a source to a sink, a data flow runs a transformation on a Spark cluster, a lookup reads a value, a stored procedure activity invokes something in a database, an execute pipeline activity calls a child pipeline. When you trigger a pipeline, the service walks the activity graph in dependency order, runs each activity, records the outcome of each one, and then rolls those outcomes up into a single pipeline run status. If every activity reports success, the pipeline reports success. If any activity in the path reports failure and nothing downstream catches it, the pipeline reports failure and stops at the point the dependency chain breaks.

That roll-up is the source of the confusion. The pipeline-level message you see first, the one that often reads Operation on target <ActivityName> failed, names the activity but does not explain it. It is a pointer, not a diagnosis. The explanation, the part that tells you what to fix, is recorded on the activity run itself, in a structured error object with an error code, a message, and frequently a nested inner detail from the underlying connector, driver, or compute engine. The single most common mistake in this whole domain is treating the pipeline-level pointer as if it were the full story, rerunning the pipeline against it, and never opening the activity that actually carries the cause.

Why does my Azure Data Factory pipeline fail?

A pipeline fails because one of its activities failed and the failure was not handled by a downstream path. The pipeline status only summarizes the worst activity outcome, so the real reason is always recorded on the specific activity that broke, in its error code and message. Open that activity, read its error, and the cause becomes concrete.

Hold that model firmly, because it dictates everything that follows. The pipeline is the orchestrator and the bookkeeper. The activity is where work happens and where work fails. When you ask “why did my pipeline fail,” you are really asking “which activity failed, and what was its error,” and the answer to that compound question is two clicks away in the monitoring experience. Everything else in this article is the discipline of asking that compound question consistently and knowing how to read the answer.

How to read the failure: the run views and the activity error

The monitoring experience in Data Factory Studio gives you two stacked views, and you need both. The pipeline runs view lists each execution of a pipeline with its status, start time, duration, and the trigger that launched it. This is where you confirm that a run failed, see how far it got, and check whether the failure is new or a repeat of something that has been failing on a schedule for hours. Click into a failed pipeline run and you drop into the activity runs view, a Gantt-style breakdown of every activity in that run with its individual status. Here is where the truth lives. The activity that failed shows a failure status, and next to it sits an error icon. Click that icon.

The error you get back is structured, and learning its shape is the whole game. At the top is an error code, a short token like 2200 for a general user-configuration error from a copy activity, or DFExecutorUserError, or a connector-specific string. Below that is a message, written in prose, describing what the activity was attempting and what went wrong. Below that, very often, is the part most people stop reading before they reach: a nested detail, frequently labeled as an inner error or a message from the underlying source, that contains the actual driver exception, the HTTP status the sink returned, the Spark error, or the precise reason a connection was refused. The pipeline-level pointer told you the activity name. The activity-level message tells you the category. The nested inner detail tells you the exact fix. Read all three, in that order, every time.

Where does Data Factory show which activity failed?

The activity runs view inside a failed pipeline run shows it. Open the pipeline runs list in the monitoring section of Data Factory Studio, click the failed run, and you see every activity with its own status. The failed activity carries an error icon; clicking it reveals the error code, message, and the nested inner detail with the real cause.

You can pull the same information without the portal, which matters for automated triage and for incident runbooks. The Azure CLI exposes activity run history for a given pipeline run, and the structured error comes back in the response so you can parse it programmatically:

# List failed pipeline runs in a time window
az datafactory pipeline-run query-by-factory \
  --factory-name "adf-prod-eastus" \
  --resource-group "rg-data-prod" \
  --last-updated-after "2023-03-05T00:00:00Z" \
  --last-updated-before "2023-03-06T00:00:00Z" \
  --filters operand="Status" operator="Equals" values="Failed"

# For a specific run, list every activity and its error
az datafactory activity-run query-by-pipeline-run \
  --factory-name "adf-prod-eastus" \
  --resource-group "rg-data-prod" \
  --run-id "<pipelineRunId>" \
  --last-updated-after "2023-03-05T00:00:00Z" \
  --last-updated-before "2023-03-06T00:00:00Z" \
  --query "value[?status=='Failed'].{activity:activityName, code:error.errorCode, message:error.message}"

That last query projects exactly the three fields you care about for each failed step: the activity name, the error code, and the message. Run it the moment a run fails and you have the diagnosis in your terminal before the portal even finishes loading. The structured error is the same object the portal renders; the only difference is that the command-line form makes it scriptable, so you can wire it into an alert handler that posts the failed activity and its code straight to a chat channel.

The named principle to carry out of this section is the failed-activity rule for pipelines: the pipeline error is generic by design, so the fix lives in the specific activity that failed and in that activity’s error detail, found in the run view before you touch the rerun button. Every cause below is reached by applying that rule first. You do not start by guessing at causes. You start by reading the failed activity, and the activity tells you which cause to investigate.

The InsightCrunch pipeline-failure table

Once you can read the failed activity, the next skill is mapping its signal to a cause. The recurring failures cluster into a handful of families, and each family announces itself in a recognizable way. The table below is the InsightCrunch pipeline-failure table, a compact map from the failing component to the signal it leaves and the fix that addresses it. Keep it next to the monitoring view; it turns a wall of error text into a short decision.

Failing component	Signal in the failed activity	Root cause family	Fix direction
Linked service / credential	Connection refused, login failed, token expired, `forbidden`, `Unauthorized` in the inner detail; fails immediately, before moving data	The connection definition or its secret is wrong, expired, or unauthorized	Repair the credential, refresh the Key Vault secret, or grant the managed identity the right role
Copy activity sink	`429`, throttling, request rate too large, write timeout; the activity reads fine but slows or fails on write; long duration before failure	The sink cannot absorb the write rate the copy is pushing	Lower parallelism or DIUs, add staging, or scale the sink for the window
Mapping data flow	`DFExecutorUserError`, cluster failed to start, timeout during startup, Spark error before transformation logic runs	The data flow could not acquire or start its Spark cluster, or the cluster choked on the data	Adjust the integration runtime compute and TTL, fix the transformation, or right-size the cluster
Source schema / path	File not found, path does not exist, column not found, schema mismatch, mapping error	The source moved, was renamed, or changed shape since the last successful run	Correct the path or wildcard, enable schema drift handling, or fix the mapping
Integration runtime	Every activity that uses the runtime fails together; runtime unavailable, node offline, connection to integration runtime failed	The integration runtime is offline or unreachable	Restore the runtime node and its connectivity, not the pipeline

The table is deliberately small because the failures really are this concentrated. A copy activity that fails the instant it starts, before any rows move, is almost always a connection or authorization problem, not a throughput problem. A copy activity that runs for twenty minutes and then fails on the write side is almost always a sink throttle. A data flow that fails in its first seconds with a cluster message never reached your transformation logic at all. A whole pipeline where every single activity failed at the same timestamp is rarely five independent bugs; it is one shared dependency, usually the runtime, going down. The signal tells you the family, and the family tells you which of the sections below to read. Each cause now gets its own treatment: how to confirm it is yours, and the tested fix.

Cause one: a linked service connection or credential problem

A linked service is the connection definition that an activity uses to reach a data store or compute. It holds the endpoint, the authentication method, and a reference to the secret or identity that authorizes the connection. When a linked service is wrong, the activity that depends on it fails fast, before any meaningful work happens, and the inner detail of the error names a connection or authorization problem rather than a data problem. This is the cleanest failure to diagnose precisely because of how early it fails: the activity never got far enough to throttle, time out on volume, or hit a schema issue, so the error is about reaching the store at all.

The signatures vary by store but rhyme. A SQL sink or source with a bad password produces a login-failed message in the inner detail. A storage account reached with an expired account key or a stale shared access signature produces an authentication or 403 Forbidden. A connection that uses a managed identity which lacks the right data-plane role produces a forbidden response even though the network path is wide open and the secret is perfectly valid, because the identity is authenticated but not authorized. A secret stored in Key Vault that has expired or been rotated out from under the linked service produces a failure to retrieve the secret, which then cascades into a connection failure. The common thread is that the activity fails at connect time, fast, with an authentication or authorization word in the detail.

Can a bad linked service connection fail the whole pipeline?

Yes. If an activity cannot establish its connection through its linked service, that activity fails, and unless a downstream path handles the failure, the pipeline rolls that failure up and stops. The activity error will name a connection or authorization problem in its inner detail, and it will fail almost immediately rather than after moving data.

To confirm it is yours, the fastest move is to test the connection directly rather than rerunning the pipeline. In Data Factory Studio, opening the linked service and clicking the test connection button exercises exactly the path the activity uses, and it returns the same connection or authorization error in isolation, without the noise of the rest of the pipeline. If the test fails, you have confirmed the cause and you have stripped away everything else. You can also test from the command line by inspecting the linked service definition and checking the secret it references:

# Show the linked service definition to see how it authenticates
az datafactory linked-service show \
  --factory-name "adf-prod-eastus" \
  --resource-group "rg-data-prod" \
  --name "ls_sql_customers" \
  --query "properties.typeProperties"

# If it references a Key Vault secret, confirm the secret still resolves
az keyvault secret show \
  --vault-name "kv-data-prod" \
  --name "sql-customers-connstring" \
  --query "attributes.{enabled:enabled, expires:expires}"

If the secret shows as disabled or past its expiry, you found it. The fix depends on which flavor you confirmed. For an expired or rotated secret, update the secret in Key Vault and confirm the linked service references the secret by name without a pinned version, so it always reads the current value; pinning a specific secret version is a frequent and quiet cause of “it worked yesterday” failures, because the rotation produced a new version while the linked service kept pointing at the old one. For a managed-identity authorization gap, grant the identity the correct data-plane role on the target, for example the Storage Blob Data Contributor role on the container a copy activity writes to, and remember that role assignments take a short while to propagate before the next run will succeed. For a plain bad password or stale account key, correct the credential at its source and let the linked service pick it up.

# Grant the Data Factory managed identity the data-plane role on a storage container
az role assignment create \
  --assignee "<adf-managed-identity-object-id>" \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/<sub>/resourceGroups/rg-data-prod/providers/Microsoft.Storage/storageAccounts/stdataprod/blobServices/default/containers/landing"

The discipline here is to resist rerunning until the test connection passes. A linked service problem is deterministic. It will fail on every rerun in exactly the same way, because nothing about the rerun changes the credential or the role. Reruns are for transient failures; this is not transient. Fix the connection, prove it with the test, then rerun once. If you want to drill into the connection model behind all of this, the relationship between linked services, datasets, and the integration runtime that physically carries the connection is covered end to end in the complete guide to Azure Data Factory, which is worth reading once so that these errors stop looking like mysteries and start looking like predictable consequences of how the connection layer is wired.

Cause two: a copy activity throttled against the sink

A copy activity has two halves, a read side that pulls from the source and a write side that pushes to the sink, and they run concurrently with the copy engine trying to move data as fast as both sides allow. When the sink cannot accept writes as fast as the copy is sending them, the sink pushes back, and that push-back surfaces as throttling. The classic signature is a copy activity that starts cleanly, reads happily for a while, and then slows dramatically or fails on the write side with a throttling indicator: a 429 from a service that speaks HTTP rate limits, a “request rate is large” message from a database under write pressure, a write timeout, or a retry-storm pattern visible in the activity’s duration and its copy details. Crucially, this failure happens after the activity has been running, not at the instant it starts, which is the tell that distinguishes it from a connection problem.

The sink determines the exact shape. A copy into a database can exhaust the provisioned throughput of the target or saturate its log write rate, and the database responds by throttling incoming writes. A copy into a service with a request-units model can blow past the provisioned units and collect a wall of 429 responses. A copy into storage can hit per-account or per-partition request limits under aggressive parallelism. In every case the cause is the same shape: the copy is pushing harder than the sink will accept, and the sink is protecting itself. The copy activity’s output, visible in the monitoring detail, exposes the throughput it achieved, the number of rows written, and often the parallel copies it used, which together let you see how hard it was pushing when the sink complained.

To confirm it is yours, open the failed copy activity and read its details rather than just its top-level message. The monitoring detail for a copy activity shows data read, data written, throughput, the duration of each stage, and the degree of parallelism. A throttled copy shows healthy read throughput and stalled or collapsing write throughput, and the inner error names a throttling or rate condition on the sink side. You then correlate that with the sink’s own metrics for the same window. If the sink is a database, its throttling or resource metrics will spike in lockstep with the copy. If the sink is a request-unit service, its rate-limited request count will climb during the copy window.

The fix has several levers, and the right one depends on whether you can change the sink, change the copy, or only change the timing. The most direct copy-side lever is to reduce how hard the copy pushes. A copy activity exposes parallel copies and, for supported sinks, data integration units, and lowering them throttles the copy from your side so the sink stops complaining:

{
  "name": "Copy_Customers",
  "type": "Copy",
  "typeProperties": {
    "source": { "type": "SqlServerSource" },
    "sink": {
      "type": "AzureSqlSink",
      "writeBatchSize": 5000,
      "writeBatchTimeout": "00:30:00"
    },
    "parallelCopies": 4,
    "dataIntegrationUnits": 8,
    "enableStaging": true,
    "stagingSettings": {
      "linkedServiceName": { "referenceName": "ls_staging_blob", "type": "LinkedServiceReference" }
    }
  }
}

Three things in that definition fight throttling. Lowering parallelCopies and dataIntegrationUnits reduces the write pressure the copy generates, which is the bluntest and most reliable lever when a sink is rate-limited. Setting a sensible writeBatchSize for a database sink groups writes so each round trip carries more rows, which often raises effective throughput without raising the request rate that triggers throttling. Enabling staging routes the data through an intermediate blob store and then bulk-loads it into the sink, which for many database sinks is dramatically gentler than a row-by-row or small-batch insert pattern, because the bulk path is built for high-volume ingestion rather than transactional inserts. The other lever, when you control the sink, is to scale it for the load window: raise the provisioned throughput or request units before the copy runs and lower them after, so the sink can absorb the burst without throttling and you do not pay for the higher tier around the clock.

A copy activity also has built-in fault tolerance and retry settings, and tuning them matters for throttling specifically. A modest retry with a back-off lets the copy ride out brief, self-correcting throttle events instead of failing on the first 429. But retries are a treatment for transient throttling, not for a sink that is simply too small for the volume; if the sink throttles for the entire window no matter how long you wait, retries only delay the failure. The judgment call is whether the throttle is a brief spike the sink will recover from, in which case retry, or a sustained ceiling, in which case reduce the push or scale the sink. Reading the sink metrics tells you which.

Cause three: a mapping data flow failing to start its Spark cluster

A mapping data flow is not like a copy activity. It does not run on the integration runtime’s own process the way a copy does. It compiles your visual transformation into Spark code and runs that code on a Spark cluster that the integration runtime provisions on demand. That architectural fact is the source of an entire class of failures that have nothing to do with your transformation logic and everything to do with the cluster lifecycle. A data flow activity has to acquire a cluster, start it, and only then run the transformation, and any of those early phases can fail before your logic is ever evaluated. The signature is a data flow that fails in its first seconds to a couple of minutes with a cluster, executor, or startup message, often carrying a DFExecutorUserError or a Spark-flavored error, rather than a message about your specific columns or expressions.

The dominant cause is cluster startup behavior, and it has a counterintuitive twist. The first data flow run after a period of inactivity has to start a fresh Spark cluster from cold, and cold cluster startup takes time, frequently several minutes. If your activity has a tight timeout, or if you are debugging interactively and expecting near-instant results, that startup window reads as a hang or a failure even though nothing is wrong except that you did not budget for the cold start. The integration runtime that the data flow uses has a time-to-live setting precisely for this reason: a non-zero TTL keeps a cluster warm for a configured period after a run, so subsequent runs reuse the warm cluster and skip the cold start. Set TTL to zero, or leave it at the default and run data flows infrequently, and every run pays the cold-start cost.

# Inspect the data-flow integration runtime's compute and TTL
az datafactory integration-runtime show \
  --factory-name "adf-prod-eastus" \
  --resource-group "rg-data-prod" \
  --name "ir-dataflow-eastus" \
  --query "properties.typeProperties.computeProperties.dataFlowProperties"

That command surfaces the compute type, the core count, and the TTL for the data-flow runtime. If TTL is zero or very low and your data flows fail or crawl on startup, raising the TTL so a warm cluster survives between runs is the single most effective fix for the cold-start variant. The other early-phase failures are about sizing. A cluster that starts but then fails partway through with an executor or memory error is telling you the compute it was given cannot hold the data volume or the shuffle the transformation requires, and the fix is to increase the core count or choose a compute-optimized or memory-optimized profile that matches the workload, or to reduce the shuffle by partitioning the transformation more sensibly. A cluster that fails to start at all, repeatedly, points back at the integration runtime configuration or a capacity or quota condition in the region, which is a different investigation from your transformation.

To confirm the cause is the cluster and not your logic, use the data flow debug session. Turning on debug starts an interactive cluster you control, and running the data flow in debug isolates the cluster-acquisition and startup phase from the transformation phase. If debug cannot start a cluster, the problem is the runtime and its compute. If debug starts fine but your transformation errors on real data, the problem is your logic, which is a different article. The data-flow engine is the same Spark-based engine that powers the corresponding capability inside Synapse, so if you also run transformations there, the cluster lifecycle and the startup-time reasoning carry straight across; the shared engine and how Synapse unifies SQL, Spark, and pipelines is laid out in the explainer on Azure Synapse Analytics, and recognizing that the data flow runs on a managed Spark cluster is the mental shift that makes these startup failures stop being surprising.

Cause four: a source schema or path that changed

Pipelines that ran perfectly for months and then suddenly fail, with nothing changed on your side, are very often the victims of a change on the source side. A source file path that was renamed, a folder structure that was reorganized, a file that did not arrive on schedule, a column that was added or removed or retyped in a source table, a delimiter that changed in an incoming file: any of these breaks an activity that was correct against the old shape. The signature is an error about a missing path, a file that does not exist, a column that cannot be found, or a schema or mapping mismatch, and it appears suddenly on a pipeline that has a clean history, which is the tell that the world changed rather than your configuration.

The path variant is the simpler one. A copy activity or a dataset that points at a fixed path fails the moment that path no longer resolves, with a not-found or path-does-not-exist error. If the upstream system renamed a folder, moved files into a dated subfolder, or simply failed to deliver the expected file on time, the activity has nothing to read and fails. Confirming it is direct: look at the path the dataset resolves to, including any parameters or expressions that build it, and check whether that exact path exists in the source store right now. The fix is either to correct the path the dataset uses or, far more robustly, to stop hard-coding the path. A dataset can use a wildcard path or a parameterized path built from the trigger’s window, so that a daily pipeline reads landing/2023/03/06/*.csv from an expression rather than a frozen literal, which makes the pipeline resilient to the date rolling over and to the upstream system’s folder convention.

The schema variant is subtler. A copy activity with an explicit column mapping fails when a mapped column disappears or is renamed at the source, because the mapping references a column that no longer exists. A data flow with a fixed projection fails when the incoming shape no longer matches what the projection expects. The confirming check is to compare the current source schema against the mapping the activity uses; the activity error usually names the offending column, which points you straight at the change. The fix depends on whether you want the pipeline to tolerate change or to enforce a contract. To tolerate it, enable schema drift, which lets a data flow carry columns it did not explicitly define and pass through changes without failing, and use a flexible mapping in the copy activity rather than a rigid one-to-one map. To enforce a contract instead, keep the explicit mapping and treat the failure as a correct signal that the source broke its agreement, then coordinate the change with the upstream owner. Both are valid; the wrong move is to silently widen everything without deciding which you want, because schema drift that hides a genuine breaking change downstream is its own incident waiting to happen.

A reliable prevention pattern for both variants is a validation activity at the top of the pipeline. A get-metadata activity that checks for the existence of the expected path, or that reads the column set and compares it to what you expect, can fail the pipeline early with a clear, intentional message instead of letting a copy activity fail later with a cryptic one. Failing fast on a validated precondition turns a confusing mid-pipeline error into an obvious “the file is not here yet” message, which is far cheaper to triage at three in the morning.

Cause five: an integration runtime outage failing every activity

When you open a failed run and find that not one activity but every activity that uses a particular integration runtime failed, often at the same timestamp, you are not looking at five independent bugs. You are looking at one shared dependency that went down. The integration runtime is the compute that physically executes activities and carries connections, and if it is offline or unreachable, every activity that depends on it fails together with a runtime-unavailable or connection-to-integration-runtime-failed message. This is the failure that masquerades as a catastrophe and is actually a single point to fix.

The pattern is the diagnostic. A copy activity failing alone points at that copy’s source, sink, or connection. A data flow failing alone points at the cluster or the transformation. But a whole pipeline where every step failed in unison, especially across multiple pipelines that share the same runtime, points at the runtime itself. For a self-hosted integration runtime, which runs on machines you manage to reach on-premises or private-network data, the runtime can go offline because its Windows service stopped, the host machine rebooted or lost power, or the node lost the outbound connectivity it needs to reach Azure. For the managed integration runtimes, an outage is rarer but a regional event or a capacity condition can still take execution offline temporarily.

To confirm it, check the runtime’s status directly rather than inferring it from the pipeline:

# Check the status of the integration runtime
az datafactory integration-runtime get-status \
  --factory-name "adf-prod-eastus" \
  --resource-group "rg-data-prod" \
  --name "ir-selfhosted-onprem" \
  --query "properties.{state:state, nodes:nodes[].{name:nodeName, status:status}}"

If the state comes back as something other than online, or a self-hosted node reports an offline status, you have confirmed that the runtime, not any individual activity, is the cause. The fix is to restore the runtime, not to recreate the pipeline and certainly not to rerun activities that will simply fail again against a runtime that is still down. Recreating the integration runtime when a node service merely stopped or a firewall blocked its outbound endpoints is the costly overreaction this section exists to prevent. Restart the service or the node, restore the connectivity it needs, and bring the runtime back online; the dedicated walkthrough on diagnosing a self-hosted integration runtime that shows offline covers the node-and-connectivity checks in full, because that runtime layer is its own diagnostic surface with its own recurring causes. Once the runtime is back online, your pipelines rerun and succeed without any change to the pipelines themselves, which is the proof that the runtime was the real cause all along.

How the integration runtime type shapes which failures you see

The integration runtime is worth understanding a little more deeply, because which type your activities use determines the failures you will encounter and the checks that confirm them. Data Factory offers three kinds, and they fail in characteristically different ways. The Azure integration runtime is the fully managed compute that runs in the cloud and handles activities against cloud data stores and the public network; it scales automatically and you do not run any machine for it. The self-hosted integration runtime is software you install on a machine you control, which lets activities reach data behind a firewall, on-premises, or in a private network the managed runtime cannot touch. The Azure-SSIS integration runtime is a managed cluster purpose-built to run lift-and-shift SSIS packages, a different concern from native pipeline activities.

The failures cluster by type. The managed Azure integration runtime rarely goes offline in the way a self-hosted node does, because you are not responsible for a machine; its failures show up as regional capacity conditions or transient platform events rather than a node you can restart. When every activity on a managed runtime fails together, the cause is more likely a regional event or a quota condition than a node outage, and the response is to confirm the platform status and, where the workload allows, to consider a different region rather than to restart anything. The self-hosted runtime, by contrast, fails in the concrete physical ways a server fails: its Windows service stops, the host reboots, the machine loses outbound connectivity to the Azure endpoints it must reach, or a single node goes down where high availability would have needed a second registered node. These are the failures you confirm with the node status check and fix by restoring the node and its connectivity, and they are the reason a self-hosted runtime needs the same operational care as any production server you own.

The discriminator between a managed-runtime issue and a self-hosted-node issue is which activities failed and what they touched. If the activities that failed all reach private or on-premises sources, they run on a self-hosted runtime, and a mass failure points at that node. If the failed activities all reach cloud stores over the public network, they run on a managed runtime, and a mass failure points at a regional or capacity condition rather than a node you can restart. Reading which stores the failed activities target tells you which runtime carried them, which tells you which kind of outage you are confirming. The self-hosted runtime’s own diagnostic surface, including its service state, node health, registration, and the specific outbound connectivity it requires, is detailed enough to warrant its own treatment, and when a self-hosted runtime is the confirmed cause the node-level checks are where the actual repair happens rather than anywhere in the pipeline.

There is a capacity dimension to the self-hosted runtime that produces a quieter failure mode worth naming. A self-hosted runtime runs a bounded number of concurrent jobs per node, governed by the node’s resources and its configured concurrency. If a burst of pipelines schedules more concurrent activities than the node can run at once, activities queue, and under enough pressure they can time out waiting for a slot rather than failing on anything intrinsic to the activity. The signature is activities that fail with a timeout while the node itself reports online and healthy, which is confusing until you recognize it as a concurrency ceiling rather than a node outage. The fixes are to add nodes to spread the concurrent load, raise the node’s concurrent job capacity if the machine has headroom, or stagger the pipeline schedule so the bursts do not collide. This is distinct from a node being offline, and the check that separates them is whether the node reports online with jobs queuing, which is concurrency, or offline entirely, which is an outage.

Building a triage habit you can run in any order

The sections above are organized by cause, but in an incident you do not know the cause yet, so it helps to internalize the triage as a sequence of questions that converges on the answer regardless of where you start. The first question is always whether one activity failed or many, because that single distinction splits the entire problem space. One activity failing sends you into the per-activity causes: read its error and classify it. Many activities failing in unison sends you to the shared dependency, almost always the runtime, and you check the runtime status before reading any individual activity error, because fixing the runtime fixes all of them at once.

For a single failed activity, the second question is when it failed relative to when it started. A failure at the instant the activity began, before any data moved, is a connection or authorization problem, and you go to the linked service and its credential. A failure after the activity ran for a while, on the write side, is a throughput or throttle problem, and you read the copy detail and the sink metrics. A failure in the first seconds of a data flow with a cluster message is a Spark startup problem, and you check the runtime time-to-live and compute. A sudden failure on a pipeline with a clean history, with a not-found or column message, is a source change, and you check the path and the schema.

The third question, once you have a candidate cause, is how to confirm it without rerunning. Confirmation is a fast, targeted check that proves the cause in isolation: the test connection button for a linked service, the copy detail and sink metric for a throttle, the debug session for a data flow, the source store inspection for a path or schema change, the runtime status check for an outage. Confirming before fixing is what separates a diagnosis from a guess, because a confirmed cause has a single matching fix while an unconfirmed symptom invites a scattershot of changes that muddy the picture. Only after the confirmation do you apply the fix, and only after the fix do you rerun, and you rerun from the failed activity unless the fix invalidated the upstream work.

Running the triage in this order makes the diagnosis fast even on a pipeline you have never seen, because each question eliminates a large part of the space. The one-versus-many question removes either all the per-activity causes or the runtime cause in a single step. The timing question, for a single activity, removes connection problems or throughput problems depending on the answer. The confirmation step rules the surviving candidate in or out definitively. Three questions and a confirming check take you from a red dot to a known cause without a single speculative rerun, which is the entire promise of treating the failed activity as the unit of diagnosis rather than the pipeline.

Rerunning from the failed activity, not the start

Here is the lever that turns a long pipeline failure from an hour of wasted recompute into a two-minute recovery: you do not have to rerun the whole pipeline. Data Factory lets you rerun from the activity that failed, which keeps every successful upstream activity’s work intact and resumes execution at the broken step once you have fixed its cause. For a pipeline that copied ten datasets successfully and then failed on the eleventh, rerunning from the failed activity skips the ten that already succeeded and picks up at the one that broke, instead of recopying everything from scratch.

The mechanics live in the same monitoring view where you read the error. On a failed pipeline run, the rerun control offers a rerun-from-failed-activity option in addition to a full rerun. Choosing rerun-from-failed resumes at the failed step and runs everything from there forward; the upstream successes are preserved as part of the original run. This is the correct default for a pipeline whose early steps are expensive or whose early steps have side effects you do not want to repeat, such as a load that is not idempotent. It is also available from the command line, so an automated handler can fix a cause and resume without a human reopening the portal:

# Rerun a failed pipeline run starting from the activity that failed
az datafactory pipeline-run rerun-from-failed-activity \
  --factory-name "adf-prod-eastus" \
  --resource-group "rg-data-prod" \
  --run-id "<failedPipelineRunId>"

Rerunning from a point is not always the right call, and knowing when to do a full rerun matters as much as knowing how to resume. If the failure corrupted partial state, if an early activity wrote data that a fix now invalidates, or if you are not certain the upstream successes are still valid after whatever you changed, a clean full rerun from the start is safer than resuming on top of questionable state. The decision rule is straightforward: resume from the failed activity when the upstream work is still correct and expensive to repeat, and run from the start when the fix changes the meaning of the upstream work or when partial state cannot be trusted. The wrong move, the one this whole article pushes against, is to rerun blindly without first reading why the activity failed, because a deterministic failure will simply fail again from whatever point you resume.

For pipelines that move incremental data, the most robust pattern combines a rerun-from-point with parameterization so the resumed run processes only the slice that failed. A pipeline parameterized on a date or a watermark can be rerun for exactly the failed window rather than reprocessing the entire history, which is both faster and safer than a full reload. Building the pipeline so a failed slice can be rerun in isolation, with the slice identity carried as a parameter, is the difference between a recovery that takes minutes and one that takes the rest of the shift.

A worked diagnosis: from a red dot to the fix in five minutes

Theory is cheaper than practice, so walk through a concrete failure end to end the way you would handle it on a real morning. The scenario is a nightly pipeline named LoadSalesWarehouse that copies several source tables into a database, runs a mapping data flow to build aggregates, and writes a summary. The on-call alert says the pipeline failed at 02:14. You open the monitoring view and see the pipeline run with a failure status, a duration of eleven minutes, and a trigger of the scheduled nightly trigger. The first thing you do not do is click rerun.

You click into the run and read the activity runs view. Of the seven activities, the first five copy activities show success, the sixth activity, a copy named Copy_Orders, shows failure, and the seventh, the data flow, shows a skipped status because its upstream dependency failed. Already the picture is clearer than the pipeline-level message suggested. Five steps succeeded and are not worth repeating, one step failed, and one step never ran because the failed step blocked it. The failure lives entirely in Copy_Orders, so that is the only activity whose error you need to read.

You click the error icon on Copy_Orders. The top-level message names a copy failure writing to the sink. You keep reading into the inner detail, and it carries a rate-limit indication from the database, a message about the request rate being too large during the write. That inner detail is the diagnosis: this is a sink throttle, not a connection problem, because the activity ran for minutes and failed on the write side rather than failing instantly at connect time. You confirm it in two ways. You read the copy activity’s monitoring detail and see healthy read throughput from the source against a write throughput that collapsed in the final minutes, and you glance at the sink database’s throttling metric for the 02:00 to 02:15 window and watch it spike exactly when the copy was writing hardest. The cause is confirmed: the nightly volume grew past what the sink could absorb at the parallelism the copy was using.

Now you decide the fix. You have three honest options. You can lower the copy’s parallel copies and data integration units so it pushes less hard, which trades a longer copy for a copy the sink can keep up with. You can enable staging so the copy bulk-loads into the database through a blob staging area rather than hammering it with smaller transactional batches. You can scale the database up for the load window and back down afterward. For a nightly batch where a slightly longer run is fine and you would rather not change the sink tier, lowering parallelism and enabling staging is the cleanest combination, so you edit the copy activity to drop parallel copies, add a write batch size suited to the table, and turn on staging against an existing blob linked service.

With the fix in place you do not run the whole pipeline from the start, because the five upstream copies succeeded and are still valid; their data is already in the warehouse and nothing about your change invalidates it. You rerun from the failed activity, which resumes at Copy_Orders and then runs the data flow that was skipped. The resumed run completes in four minutes, the copy finishes without throttling because it is now pushing within the sink’s capacity, and the data flow builds its aggregates on the freshly loaded orders. Total elapsed time from alert to green is well under ten minutes, and most of that was reading rather than waiting, because you never recopied the five tables that were never broken.

The whole sequence is the method in miniature. You read the run, not the pipeline message. You found the one failed activity among seven and ignored the rest. You read the inner detail to separate a throttle from a connection failure. You confirmed with the copy detail and the sink metric rather than assuming. You chose the fix that matched the confirmed cause. And you resumed from the failed step rather than restarting valid work. Every section of this article is one of those moves, and the speed comes entirely from doing them in order rather than reaching for rerun first.

Reading the copy activity output: throughput, units, and what they reveal

The copy activity records a detailed output that most engineers never open, and it is the richest single source of diagnostic signal for any copy failure or slowness. When a copy completes or fails, its output captures the data read, the data written, the number of rows copied, the throughput achieved, the duration broken into stages, the number of parallel copies actually used, and the data integration units the service applied. Reading that output is how you tell a sink-side problem from a source-side problem, and a throughput ceiling from a transient blip, without guessing.

The data read against data written comparison is the first signal. A copy where read volume climbs steadily while written volume stalls is a write-side bottleneck, which points at the sink and its capacity. A copy where read volume itself stalls is a source-side problem, a slow query, a locked table, or a source that is itself throttling the read. The throughput figure tells you the effective rate the copy sustained, and comparing it across runs shows you whether a copy that used to finish in five minutes is now crawling because the volume grew or because the sink degraded. The parallel copies and data integration units fields tell you how hard the service was actually pushing, which matters because the values you request are upper bounds the service may not fully use depending on the source and sink shapes.

The stage durations are the subtle part. A copy moving data through staging runs in distinct phases: it reads from the source into staging, then loads from staging into the sink. If the read-to-staging phase is fast and the staging-to-sink phase is slow, the sink ingestion is your bottleneck and scaling the sink or its ingestion path is the lever. If the read-to-staging phase is itself slow, the source query or the source store is the constraint and no amount of sink tuning will help. Reading which phase consumed the time routes you to the half of the copy that actually needs attention, which is the same discipline as reading which activity failed, applied one level deeper inside a single activity.

The copy activity also exposes a fault tolerance capability that changes how certain failures present, and understanding it prevents a misdiagnosis. With fault tolerance enabled, a copy can skip incompatible rows, rows that violate a constraint or fail a type conversion at the sink, and continue rather than failing the whole activity. When this is on, a copy that you expect to fail on bad data instead succeeds with a count of skipped rows, and the skipped rows can be logged to a store for inspection. That is useful for resilient loads, but it also means a copy reporting success is not always a copy that moved every row, so when downstream counts look short, checking the copy output for skipped rows is the check that explains the gap. When fault tolerance is off, the same bad data fails the activity with a conversion or constraint error, which is the signal that the source shape and the sink schema disagree, a variant of the schema-change cause.

Deeper on credentials: tokens, service principals, and rotation timing

The linked service failures deserve more than the single pass above, because the credential layer has several distinct flavors that fail in subtly different ways, and telling them apart saves a wrong fix. A linked service can authenticate with a stored secret such as a password or an account key, with a shared access signature, with a service principal that holds a client secret or certificate, or with a managed identity that holds no secret at all and is authorized purely by role assignment. Each of these fails at connect time, but the inner detail differs, and the fix differs with it.

A stored password or account key that is simply wrong produces a clean authentication failure, and the fix is to correct the stored value. A shared access signature that has expired produces an authentication failure that is time-bound, meaning the linked service worked until the signature’s expiry passed and then began failing on every run after; the tell is that the failure started at a clock boundary rather than on a configuration change, and the fix is to issue a new signature with a later expiry and update the secret. A service principal whose client secret expired behaves the same way, failing at the moment the secret’s validity window closed, and the fix is to rotate the client secret in the directory and update the value the linked service reads. A managed identity that lacks a role produces an authorization failure rather than an authentication failure, the distinction being that the identity proved who it is but was not permitted to do what it attempted, and the fix is a role assignment rather than a credential change.

The rotation timing trap underlies several of these and is worth stating plainly. When a secret is rotated, a new version is created, and if the linked service references a specific pinned version of the secret rather than the current version, it keeps reading the old, now-invalid value while the rest of your estate moves to the new one. The pipeline then fails with an authentication error that makes no sense, because the secret in the vault is perfectly valid, just not the version the linked service pinned. Referencing the secret by name without a version makes the linked service always read the current value and immunizes it against this class of failure. Checking whether a linked service pins a version is therefore one of the first things to verify when a credential failure appears on a pipeline that has not been touched, because a rotation on the source side is invisible to you until a run fails.

There is also a propagation delay to respect after any role assignment. Granting a managed identity a data-plane role does not take effect instantly; the assignment needs a short window to propagate before the next run will see it. An engineer who grants the role and immediately reruns, sees the same authorization failure, and concludes the role did not help, is usually a victim of propagation timing rather than a wrong fix. Waiting a few minutes after the assignment before the rerun is the discipline that distinguishes a fix that needs time from a fix that did not work.

Deeper on data flows: memory, partitioning, and the broadcast trap

Beyond cluster startup, mapping data flows fail in execution for reasons rooted in how Spark processes data, and a few patterns recur often enough to name. The most common execution failure after startup is an out-of-memory or executor failure during a transformation that requires the engine to hold or shuffle more data than the cluster’s executors can manage. This is not a startup problem and not a logic error in the sense of a wrong expression; it is a sizing and partitioning problem where the shape of the work exceeds the compute. The signature is a data flow that starts fine, runs through some of its transformation, and then fails partway with an executor or memory message rather than a message about a specific column.

The first lever is compute size. A data flow integration runtime with too few cores or a general-purpose profile may not hold a large join or aggregation, and moving to more cores or a memory-optimized profile gives the executors the headroom to complete. The second lever is partitioning, which is more surgical. A transformation that shuffles all data through a single partition, or that skews data so one partition carries most of the rows, concentrates the work and exhausts a single executor while others sit idle, the same hot-partition pathology that afflicts any distributed system. Setting a sensible partitioning scheme on the transformation, so the work spreads evenly across executors, often fixes a memory failure without any change in compute size, because the problem was distribution rather than total capacity.

The broadcast trap is the specific data flow pitfall worth calling out. A join in a data flow can use a broadcast strategy, where one side of the join is small enough to copy to every executor so the join avoids a shuffle, which is fast when the broadcast side really is small. But if the engine broadcasts a side that is actually large, because an automatic broadcast decision misjudged the size or because the data grew, every executor tries to hold a large dataset in memory and the data flow fails with a memory error. The fix is to control the broadcast behavior on the join, disabling broadcast or setting it explicitly when the automatic choice is wrong, so a large side is shuffled rather than copied everywhere. Recognizing a memory failure on a join as a possible broadcast misjudgment, rather than a pure compute-size problem, is what lets you fix it by changing the join strategy instead of throwing more cores at it.

The debug session remains the confirming tool for all of these. Running the data flow in debug with data preview on each transformation shows you exactly where in the transformation chain the failure occurs, which transformation processes the rows successfully and which one falls over, and that pinpoints whether the problem is a specific join, an aggregation, or the overall volume. Debug isolates the question from the orchestration entirely, so you are testing the transformation in isolation rather than waiting for a full pipeline run to reach the data flow and fail.

Prevention that stops a recurrence

Diagnosis is the skill, but prevention is what keeps you out of the monitoring view at unsociable hours, and the prevention measures map cleanly onto the causes above. The single highest-leverage habit is to configure retries and timeouts deliberately on activities that can fail transiently, and to leave them off where a failure is deterministic. A copy activity against a sink that occasionally throttles benefits from a small retry count with a back-off interval, because a brief throttle is exactly the kind of self-correcting condition a retry is built for. A data flow that occasionally pays a cold-start cost benefits from a timeout generous enough to absorb the startup. But a linked service with a permanently wrong credential gains nothing from retries except a slower failure, so do not paper over a deterministic problem with retry settings that only delay the inevitable.

The second prevention measure is observability that tells you a pipeline failed before a downstream consumer discovers stale data. Routing Data Factory’s run and activity telemetry to a Log Analytics workspace through a diagnostic setting lets you query failures across pipelines, build alerts on failure patterns, and spot a runtime outage the moment it starts taking activities down together rather than after the morning report comes up empty. A simple query against the pipeline-run logs surfaces every failure in a window with its activity and error, which is the same triage you do by hand in the portal, now running continuously:

ADFActivityRun
| where Status == "Failed"
| where TimeGenerated > ago(24h)
| project TimeGenerated, PipelineName, ActivityName, ErrorCode, ErrorMessage
| order by TimeGenerated desc

Wiring that query into an alert means a failure pages you with the activity and the code already attached, so you arrive at the incident already knowing which family you are dealing with. The broader practice of routing, querying, and alerting on Azure telemetry, including how a diagnostic setting is the pipe that carries logs to a workspace in the first place, is covered in the guide to Azure Monitor and Log Analytics, and a pipeline estate without that observability is one where every failure is discovered by a human noticing something is wrong rather than by a system telling you precisely what broke.

The third measure is the validation-first pipeline design from the schema section: a get-metadata or validation activity at the top that confirms preconditions, so a missing file or a changed schema fails fast with an intentional message rather than slowly with a cryptic one. The fourth is credential hygiene: reference Key Vault secrets by name rather than by pinned version so rotations do not silently break linked services, and assign managed-identity roles deliberately so an authorization gap is caught in design rather than in production. None of these is exotic. Together they convert the recurring causes from this article into conditions your pipeline either tolerates gracefully or reports clearly, which is the whole point.

When the failure hides inside a loop or a child pipeline

Two structural patterns hide the failing step one level deeper than a flat pipeline, and they trip up engineers who expect the failed step to sit right there in the top-level run. The first is the loop construct, where a single iterating step runs an inner set of work once per item in a collection. When one iteration fails, the loop reports failure, and the top-level run shows the loop as failed without immediately showing which iteration broke or why. The diagnosis is to drill into the loop’s iterations, find the specific iteration that failed and the item it was processing, and read the error on the inner step within that iteration. The item value matters because it often is the cause: one bad file in a batch of two hundred, one record that violates a constraint, one source that was unreachable while the rest responded. Reading which item the failed iteration carried frequently points straight at the data that broke rather than at any defect in the loop itself.

The loop also has a concurrency setting that interacts with the runtime capacity discussed earlier. A loop configured to run its iterations in parallel will launch many inner runs at once, and if that parallelism pushes more concurrent work onto the runtime than it can absorb, iterations can queue or time out even though each individual unit of work is fine. The fix in that case is not the inner step but the loop’s batch count, lowered so the parallel iterations stay within what the runtime can run concurrently. Distinguishing a genuine per-item failure from a concurrency-induced timeout comes back to reading the inner error: a per-item failure names something about the item, while a concurrency timeout names a timeout while the runtime shows healthy, which is the same concurrency-ceiling signal that appears outside loops.

The second pattern is the child pipeline invoked by an execute-pipeline step. When a parent pipeline calls a child and the child fails, the parent’s execute-pipeline step reports failure, but the actual failing activity lives inside the child’s own run, not the parent’s. The parent run shows you the execute step failed; the child run, reachable from that step, shows you which activity inside the child broke and carries the real error. Engineers who read only the parent run see a failed execute-pipeline step with a generic message and conclude there is nothing to diagnose, when the entire diagnosis sits in the child run one navigation away. The rule extends naturally: the failed-activity principle still holds, but with nested orchestration the failed activity may live inside a child run or a loop iteration, so following the failure down through the nesting until you reach the leaf activity that actually broke is the same discipline applied through one more layer. Once you reach that leaf activity, every cause and confirming check in this article applies exactly as it does for a flat pipeline.

Several failures look like a pipeline error but live somewhere adjacent, and distinguishing them saves you from fixing the wrong thing. The first is a trigger that never fires, which is not a pipeline failure at all. If a scheduled pipeline simply did not run, there is no failed run to read because there is no run; the problem is the trigger, its schedule, or its activation state, and you diagnose it in the trigger configuration rather than in a run’s activity errors. A reader staring at an empty run history for a pipeline that should have run nightly is looking for a failed activity that does not exist; the absence of a run is the clue that this is a trigger problem, a fundamentally different investigation from a run that started and broke.

The second is the integration runtime offline condition that takes every activity down together. It surfaces inside pipeline runs as failed activities, which is why it gets filed under pipeline errors, but the cause and the fix live entirely in the runtime layer, not in any pipeline. The discriminator is the pattern: one activity failing is a pipeline-level cause from the table above, while every activity failing in unison across one or more pipelines is the shared runtime going down, and the fix is to restore the runtime rather than to touch the pipelines.

The third is a copy activity that throttles versus one that fails on a connection, which the table separates but which engineers routinely conflate because both can present as “the copy failed.” The timing is the tell. A connection or authorization problem fails the copy immediately, before any rows move, with an authentication word in the detail. A throttle fails the copy after it has been running, on the write side, with a rate or 429 word in the detail. Reading whether the failure came at connect time or after data started moving routes you to the right section every time, and conflating the two sends you tuning throughput when the real problem was an expired secret, or refreshing credentials when the real problem was a sink that could not keep up.

Verdict

The discipline that fixes Azure Data Factory pipeline failures is not memorizing a hundred error strings; it is the failed-activity rule applied consistently. The pipeline status is a roll-up that points at a failed activity without explaining it, so you open that activity, read its error code, its message, and its nested inner detail, and you let that structured error tell you which of a small set of recurring causes you are dealing with. A connection that fails at connect time is a linked service or credential problem. A copy that fails on the write side after running is a sink throttle. A data flow that fails in its first seconds with a cluster message is a Spark startup problem, not a transformation bug. A sudden failure on a long-clean pipeline is a source path or schema that changed. A whole pipeline failing in unison is the integration runtime going down. Confirm the family with the one-minute check, apply the matching fix, and then rerun from the failed activity rather than restarting work that already succeeded. Do that, and the red dot on a pipeline run stops being a mystery and becomes a precise pointer to a known cause with a tested repair.

The fastest way to build the reflex is to break a pipeline on purpose and read what comes back, which is exactly what a sandbox is for; you can run the hands-on Azure labs and command library on VaultBook to stand up a pipeline, fail an activity in each of these ways, and watch the structured error change shape so you recognize each signature on sight. Once you can read the errors, drilling the diagnosis under realistic conditions is what makes it stick under pressure, so work through scenario-based troubleshooting drills on ReportMedic to rehearse the move from a generic pipeline failure to the failed activity, the right cause, and the correct rerun, until reading the activity error before touching the rerun button is simply what you do.

Frequently Asked Questions

Q: Why does my Azure Data Factory pipeline fail?

A pipeline fails because one of the activities inside it failed and no downstream path handled that failure. The pipeline status you see first is only a roll-up of the worst activity outcome, so it names the problem without explaining it. The real cause is recorded on the specific activity that broke, in its error code, message, and nested inner detail. Open the failed pipeline run in the monitoring view, find the activity carrying the error icon, and read its structured error. That tells you whether you are dealing with a connection problem, a sink throttle, a data flow cluster issue, a source change, or an integration runtime outage. Until you read that activity-level error, any rerun is a guess, and if the failure is deterministic, the rerun will reproduce it exactly. The pipeline is the orchestrator and bookkeeper; the activity is where work fails, so the activity is always where the diagnosis starts.

Q: How do I find which activity failed inside a pipeline run?

Open the pipeline runs list in the monitoring section of Data Factory Studio, click into the failed run, and you land in the activity runs view, which breaks the run into every activity with its own status. The failed activity shows a failure status and an error icon; clicking the icon reveals the error code, the message, and the nested inner detail with the actual cause. You can pull the same data from the command line with the activity-run query-by-pipeline-run command, projecting the activity name, error code, and message for each failed step so it lands in your terminal in seconds. This is faster than waiting for the portal and is scriptable for automated triage. The principle to internalize is that the pipeline-level message is a pointer to the activity, and the activity-level error is the explanation, so the activity runs view is always the second click after confirming a run failed.

Q: What does the error “Operation on target failed” actually mean?

That message is the pipeline-level pointer, not the diagnosis. The phrase names the activity that broke, shown in place of the target, but it does not tell you why the activity broke. It is the roll-up the pipeline produces when an activity in the dependency chain fails and nothing downstream catches the failure. Treating this message as the full story is the most common mistake in the whole domain, because it leads to reruns against a problem you have not yet read. The fix is to click into the activity it names and read that activity’s own structured error, which carries the error code, the message describing what was attempted, and frequently a nested inner detail with the connector, driver, or Spark exception that holds the real cause. Read all three levels, in order: the pipeline pointer names the activity, the activity message gives the category, and the inner detail gives the precise fix.

Q: Why does a copy activity throttle when writing to the sink?

A throttle happens when the copy pushes writes faster than the sink will accept, so the sink protects itself by rate-limiting or rejecting the incoming requests. The signature is a copy that starts cleanly, reads fine for a while, and then slows or fails on the write side with a 429, a request-rate-too-large message, or a write timeout. This is distinct from a connection failure, which fails immediately before any data moves. To confirm, read the copy activity’s monitoring detail for stalled write throughput and correlate it with the sink’s own throttling metrics for the same window. To fix, lower parallel copies and data integration units so the copy pushes less hard, set a sensible write batch size, enable staging so the load uses the sink’s bulk path rather than transactional inserts, or scale the sink up for the load window and back down after. Retries help only when the throttle is a brief, self-correcting spike rather than a sustained ceiling.

Q: Why does a mapping data flow fail to start its Spark cluster?

A mapping data flow runs your transformation as Spark code on a cluster that the integration runtime provisions on demand, so the activity must acquire and start a cluster before any of your logic runs. The first run after inactivity pays a cold-start cost that can take several minutes, and a tight activity timeout or an impatient debug session reads that as a failure. The fix for the cold-start variant is to raise the integration runtime time-to-live so a warm cluster survives between runs, letting subsequent runs skip the cold start. If the cluster starts but fails partway with an executor or memory error, the compute is too small for the data volume or the shuffle, so increase the core count, choose a compute or memory optimized profile, or reduce the shuffle through better partitioning. To confirm the cluster rather than your logic is at fault, use a data flow debug session, which isolates cluster startup from transformation execution.

Q: Can a bad linked service connection fail the whole pipeline?

Yes. A linked service is the connection definition an activity uses to reach a store, and if the activity cannot establish that connection, the activity fails. Unless a downstream path handles the failure, the pipeline rolls it up and stops. The signature is an activity that fails almost immediately, before moving data, with a connection or authorization word in the inner detail: login failed, connection refused, forbidden, unauthorized, or a token or secret retrieval error. Confirm it by clicking the test connection button on the linked service, which exercises the exact path in isolation and returns the same error without the rest of the pipeline. The fix depends on the flavor: refresh an expired or rotated Key Vault secret and reference it by name rather than a pinned version, grant a managed identity the correct data-plane role on the target, or correct a plain bad credential. Because this failure is deterministic, reruns will not help until the connection is repaired.

Q: How do I rerun a failed pipeline from the point it failed?

Use the rerun-from-failed-activity option in the monitoring view, or the corresponding command, rather than a full rerun. Resuming from the failed activity preserves every successful upstream activity’s work and picks up execution at the broken step once you have fixed its cause, so a pipeline that succeeded on ten datasets and failed on the eleventh skips the ten and resumes at the one that broke. This is the right default when early steps are expensive or have side effects you do not want to repeat. It is not always correct: if the failure corrupted partial state, or if your fix changes the meaning of upstream work, a clean full rerun from the start is safer. The decision rule is to resume when the upstream work is still valid and costly to repeat, and to restart when the fix invalidates that work. Always read why the activity failed before either rerun, because a deterministic failure repeats from any resume point.

Q: How do I read the full error detail of a failed activity?

Click the error icon on the failed activity in the activity runs view and read the structured error top to bottom. It has three levels that matter. The error code is a short token that categorizes the failure. The message is prose describing what the activity attempted and what went wrong. The nested inner detail, which many people stop reading before they reach, carries the underlying exception from the connector, database driver, or Spark engine, and that inner detail usually contains the exact fix. From the command line, the activity-run query returns the same error object, so you can project the code and message into a script. The habit to build is reading all three levels in order rather than reacting to the top-level pointer, because the inner detail is where a generic copy failure resolves into “the SAS token expired” or “the request rate exceeded the provisioned units,” which is the difference between guessing and knowing.

Q: Why did every activity in my pipeline fail at the same time?

When every activity that uses a particular integration runtime fails together, usually at the same timestamp, you are looking at one shared dependency going down rather than many independent bugs. The integration runtime is the compute that executes activities and carries connections, and if it is offline or unreachable, every activity that depends on it fails with a runtime-unavailable or connection-to-integration-runtime-failed message. Confirm it by checking the runtime status directly with the integration-runtime get-status command; if the state is not online or a self-hosted node reports offline, the runtime is the cause. The fix is to restore the runtime, by restarting its service or node and its connectivity, not to recreate the pipeline or rerun activities that will fail again against a runtime that is still down. The diagnostic pattern is the discriminator: one activity failing is a pipeline-level cause, while every activity failing in unison is the runtime.

Q: How do I tell a transient pipeline failure from a deterministic one?

Read the failed activity error and ask whether the cause can correct itself without intervention. A transient failure comes from a self-correcting condition: a brief sink throttle during a load spike, a momentary connectivity blip, a short capacity pause. These present as rate or timeout messages and often succeed on a retry or a later run with no change from you. A deterministic failure comes from a wrong configuration that no amount of waiting fixes: an expired credential, a missing role, a renamed source path, a dropped column, a misconfigured runtime. These present with authentication, not-found, or schema words and fail identically on every rerun. The practical test is to confirm the cause and ask whether a rerun changes anything about it. If the rerun cannot alter the condition, it is deterministic and needs a fix before any rerun. Reserve retries and reruns for transient failures, and treat deterministic ones as configuration to repair.

Q: Why does my copy activity time out on a large dataset?

A copy timing out on volume usually means the write side cannot keep pace with the read side for the size of the load, so the activity runs long and eventually exceeds its timeout, often with throttling on the sink underneath. Read the copy detail to see whether read throughput is healthy while write throughput stalls, which points at the sink as the bottleneck. The levers are the same as for throttling: lower parallel copies and data integration units to reduce pressure, set a write batch size that groups rows efficiently, and enable staging so a database sink uses its bulk-load path rather than slow transactional inserts, which often raises effective throughput by a wide margin. You can also raise the activity timeout to accommodate a genuinely large but healthy copy, but only after confirming the copy is making steady progress rather than stalling, because a longer timeout on a stalled copy only delays the same failure.

Q: How do I set retries and timeouts on a pipeline activity?

Every activity exposes a retry count, a retry interval, and a timeout in its policy settings. Set a small retry count with a back-off interval on activities whose failures can be transient, such as a copy against a sink that occasionally throttles, so a brief self-correcting condition is ridden out rather than failed on the first attempt. Set a timeout generous enough to absorb legitimate variance, such as a data flow that sometimes pays a cold-start cost, so a slow but healthy run is not killed prematurely. Leave retries minimal or off where failures are deterministic, because retrying a wrong credential or a missing path only produces a slower failure. The judgment is matching the policy to the failure mode: retries treat transient conditions, timeouts accommodate legitimate variance, and neither substitutes for fixing a deterministic configuration problem. Configure these per activity rather than globally, because different steps in the same pipeline have different failure characteristics.

Q: Can a source schema or file path change break a pipeline that worked yesterday?

Yes, and it is one of the most common causes of a sudden failure on a long-clean pipeline. If an upstream system renames a folder, moves files into a dated subfolder, fails to deliver an expected file, or adds, drops, or retypes a column, an activity that was correct against the old shape now fails. The path variant produces a not-found or path-does-not-exist error; the schema variant produces a column-not-found or mapping-mismatch error that often names the offending column. Confirm by checking whether the path the dataset resolves to exists now and whether the current source schema matches the activity’s mapping. Fix the path by parameterizing it from the trigger window or using a wildcard rather than a frozen literal, and handle schema change by enabling schema drift to tolerate it or keeping an explicit mapping to enforce a contract. A validation activity at the top of the pipeline can catch these early with a clear message.

Q: Does a failed activity always stop the whole pipeline?

Not necessarily; it depends on how the activity is wired into the dependency graph. By default a failed activity with no failure-path handling causes the dependency chain to break, and the pipeline rolls the failure up and stops at that point. But Data Factory supports dependency conditions on the links between activities, so you can route a failure to a different branch, run a cleanup or notification activity on failure, or allow the pipeline to continue past a non-critical step. An activity can also be configured so its failure is tolerated rather than fatal, depending on the activity type and its settings. The design choice is whether a given activity’s failure should halt the run, trigger a compensating path, or be ignored. Build the dependency conditions deliberately: a critical load should stop the pipeline on failure, while an optional enrichment step might be allowed to fail without taking the run down.

Q: How do I monitor Data Factory pipeline runs to catch failures early?

Route Data Factory’s pipeline-run and activity-run telemetry to a Log Analytics workspace through a diagnostic setting, then query and alert on it. A diagnostic setting is the pipe that carries the logs to the workspace; without it there is nothing to query. Once telemetry flows, a query against the activity-run logs filtered to failed status surfaces every failure in a window with its pipeline, activity, error code, and message, which is the same triage you do by hand in the portal, now running continuously. Wire that query into an alert and a failure pages you with the activity and code already attached, so you arrive knowing which cause family you face and whether the pattern is one activity or every activity failing together. This turns failure discovery from a human noticing stale data into a system reporting precisely what broke, and it is the difference between catching a runtime outage as it starts and discovering it after the morning report comes up empty.

Q: Should I use Data Factory pipelines or Synapse pipelines for orchestration?

Both use the same pipelines engine, so the orchestration model, the activity types, the integration runtime concept, and the failure-reading skill described here transfer directly between them. The practical decision is about where the rest of your work lives. If your data estate is centered on a Synapse workspace that unifies SQL pools, Spark, and analytics, building pipelines inside Synapse keeps orchestration next to the compute and storage it drives, which simplifies the architecture. If you orchestrate across many services without a single analytics workspace at the center, standalone Data Factory is the cleaner home for the pipelines. Because the engine is shared, you are not choosing different capabilities so much as choosing where the pipelines sit relative to your other resources. The troubleshooting in this article applies to both: read the failed activity, map the signal to a cause, confirm it, fix it, and rerun from the failed step, regardless of which surface hosts the pipeline.

Q: How do I make a rerun process only the data slice that failed?

Parameterize the pipeline on the dimension that defines a slice, such as a date or a watermark, and pass the failed slice’s identity when you rerun. A pipeline built this way reads and writes only the window it is given, so rerunning it for the specific failed date reprocesses just that slice rather than reloading the entire history. Combine this with rerun-from-failed-activity so the resumed run both starts at the broken step and operates on the correct narrow window. This pattern is what separates a recovery that takes minutes from one that takes the rest of the shift, and it is far safer than a full reload because it touches only the data that actually needs reprocessing. Design the pipeline for this from the start by carrying the slice identity as a parameter and building source paths and sink targets from that parameter, so a failed slice can always be rerun in isolation without disturbing slices that already succeeded.

Q: Why does an activity time out while the integration runtime reports online?

A self-hosted integration runtime runs a bounded number of concurrent jobs per node, so when a burst of pipelines schedules more concurrent activities than the node can run at once, activities queue for a slot, and under enough pressure they time out waiting rather than failing on anything intrinsic to the work. The confusing part is that the node reports online and healthy the whole time, which makes it look unlike an outage. The tell is a timeout combined with a node that is online with jobs queuing, which is a concurrency ceiling rather than a node being down. The fixes are to add nodes so the concurrent load spreads across more machines, to raise the node’s concurrent job capacity if the host has spare resources, or to stagger the pipeline schedule so bursts do not collide. Distinguishing this from an offline node matters because the fix is capacity and scheduling, not restarting a service.

Q: Why does my copy succeed but write fewer rows than expected?

A copy activity with fault tolerance enabled can skip rows it cannot write, such as rows that fail a type conversion or violate a sink constraint, and continue rather than failing the whole activity. When this is on, the copy reports success with a count of skipped rows, and those skipped rows can be logged to a store for inspection. So a copy that completed successfully is not always a copy that moved every row, and when downstream counts come up short the explanation is often in the skipped-row count in the copy output. Open the copy activity’s output and read the rows-copied figure against the rows-skipped figure; a non-zero skip count with a success status is fault tolerance doing exactly what it was configured to do. If you want every row to move or the activity to fail loudly, turn fault tolerance off so incompatible rows fail the copy with a conversion or constraint error, which is the signal that the source shape and the sink schema disagree.

Q: Does Data Factory retry transient failures automatically?

Data Factory does not silently retry an activity unless you configure it to. Each activity has a retry count and a retry interval in its policy, and by default the retry count is zero, so a single failure fails the activity. Some connectors and the copy engine include their own internal handling for certain brief conditions, but the activity-level retry you control is what rides out a transient throttle or a momentary connectivity blip across attempts. Configure a small retry count with a back-off interval on activities whose failures can be transient, and leave it at zero where failures are deterministic, because retrying an expired credential or a missing path only produces a slower failure. The decision is matching the policy to the failure mode rather than enabling retries everywhere, since a generous retry on a deterministic problem wastes time without changing the outcome, while a missing retry on a genuinely transient throttle turns a recoverable blip into a failed run.