Azure Data Factory: The Complete Guide

Azure Data Factory is the service most engineers reach for the moment a pipeline has to move data from one place to another on a schedule, and it is also the service most engineers misread on the first try. The misreading is predictable: people treat it as an ETL tool with a drag-and-drop canvas, copy a tutorial, watch the first run succeed against a sample blob, and then hit a wall the moment the real source sits behind a firewall, the transform needs more than a column rename, or the schedule has to honor a late-arriving file. The gap between using Azure Data Factory and understanding it is the gap between a demo that copies a CSV and a production pipeline that survives a self-hosted runtime restart, a throttled sink, and a Spark cluster that takes four minutes to spin up. This guide closes that gap by treating the service as what it actually is, an orchestration engine built from a small number of coupled parts, and showing how each part decides what your pipeline can and cannot do.

Azure Data Factory orchestration model, integration runtimes, and copy versus flow explained - Insight Crunch

The promise here is not a feature tour. Microsoft Learn already lists every activity and every connector, and a forum thread will solve one person’s exact copy failure without ever telling you why it happened. What follows instead is a mental model you can reason from. Once you can name the five moving parts, predict which integration runtime a source requires before you provision anything, and tell a copy activity apart from a flow by the work it does rather than by the icon on the canvas, most of the failures that send people to a search engine stop being mysterious. They become consequences of a choice you can see in advance.

What Azure Data Factory Actually Is

Azure Data Factory is a managed data integration and orchestration service. Its job is to coordinate the movement and transformation of data across stores and compute services, on a schedule or in response to an event, with monitoring and retry built in. The word that matters most in that sentence is orchestration. A factory does not, by itself, hold your records or run your heavy transformations on its own hardware in the way you might imagine. It directs other systems to do that work, and it tracks the result. When a flow transforms a few hundred million rows, the transformation runs on a Spark cluster the service provisions on your behalf; the factory is the conductor, not the orchestra.

That framing explains the single most useful thing to hold in your head: a data factory is a control plane for the work, and the question you keep asking as you build is always the same. Where does the data live, where does it need to go, and what compute touches it along the way? Every other decision in the service follows from answering that.

A data factory is itself an Azure resource. You create one in a resource group, in a region, with a name that has to be globally unique because it backs a public endpoint. Inside that factory you author a handful of entity types that fit together like parts of a small machine. Get the parts straight and the rest of the service stops feeling like a sprawling product and starts feeling like a set of decisions you already know how to make.

What are the five moving parts of a data factory?

A factory is built from pipelines, activities, datasets, linked services, and integration runtimes. Pipelines group activities into a unit of work. Activities are the individual steps. Datasets name the data. Linked services define the connections. Integration runtimes provide the compute that actually reaches and moves the data.

Hold those five and you can read any factory someone hands you. A pipeline is a logical grouping of activities that run together to accomplish a task, and it is the unit you schedule, trigger, monitor, and rerun. An activity is a single processing step inside a pipeline, and activities fall into three families that are worth keeping distinct in your mind. Data movement activities move bytes from a source to a sink, and the canonical one is the copy activity. Data transformation activities reshape data, and they include flows, stored procedure calls, Databricks notebooks, HDInsight jobs, and others. Control activities steer the flow of the pipeline itself: ForEach to iterate, If Condition and Switch to branch, Until to loop, Execute Pipeline to call a child pipeline, Lookup and Get Metadata to read values that influence later steps, Wait to pause, and Set Variable and Append Variable to hold state across steps.

A dataset is a named, reusable pointer to the data you want to use. It does not contain the data; it describes where the data is and, often, its shape: a specific table in a database, a folder or file pattern in a storage account, the structure of a delimited file. Datasets sit on top of linked services. A linked service is the connection definition, the equivalent of a connection string, and it comes in two flavors that map cleanly onto the two things a factory needs to talk to. A data store linked service connects to a place data lives, such as Azure Blob Storage, Azure SQL Database, an on-premises SQL Server, Amazon S3, or Salesforce. A compute linked service connects to a place transformations run, such as Azure Databricks or HDInsight. Every linked service, and through it every dataset built on top of it, is reached by way of an integration runtime, which is the fifth and most consequential part.

How does the orchestration model work end to end?

Authoring produces a pipeline that references datasets, which reference linked services, which run on integration runtimes. A trigger or a manual run starts the pipeline. Each activity executes in order or in parallel as the control flow dictates, the runtime does the actual data work, and the monitoring layer records a run history you can inspect and rerun.

The execution story is worth walking through once slowly, because understanding it makes the failure modes obvious later. When a pipeline starts, the factory evaluates the control flow and dispatches each activity. A copy activity tells its integration runtime to read from the source linked service and write to the sink linked service. A mapping flow asks an Azure integration runtime to provision a Spark cluster and then runs the transformation graph on it. A control activity such as ForEach evaluates an expression and fans out child activities. Throughout, the service writes pipeline runs, activity runs, and trigger runs into a monitoring store that you can query in the studio or route to Azure Monitor. Each activity has its own status, input, output, and error detail, which is why diagnosing a failed pipeline always comes down to opening the activity that failed and reading its error rather than staring at the pipeline as a whole. The pipeline only tells you that something broke. The activity tells you what.

One detail catches people off guard early. Activities pass output to later activities through the expression language, not through shared memory or a database. A Lookup activity returns rows; a downstream activity reads them with an expression like @activity('LookupTable').output.value. A pipeline parameter is read with @pipeline().parameters.fileName. This expression-driven wiring is how a factory stays declarative and serverless, and it is also why people who expect imperative scripting sometimes find the control flow awkward until the model clicks.

The Integration Runtime Is the Center of Everything

If you take one idea from this guide, take this one. The integration runtime determines what data a factory can even reach. Pick the wrong runtime and no connection string, no firewall rule, and no credential will save you, because the compute that would do the reaching is sitting in the wrong network. This is the runtime-decides-reach rule, and it reframes a whole category of failures. When a pipeline cannot connect to a source, the instinct is to suspect the password, the firewall, or the connection string. Far more often the real cause is an integration runtime that physically cannot see the source from where it runs. A connectivity failure in a factory is usually a runtime choice made wrong, not a credential problem.

There are three kinds of integration runtime, and each exists to solve a specific reachability problem. Choosing among them is the first real design decision in any non-trivial factory, so it is worth understanding each one by the network it lives in rather than by its name.

Which integration runtime do I actually need?

Use the Azure integration runtime for fully managed work between cloud data stores reachable over the public network or a managed private endpoint. Use the self-hosted integration runtime to reach on-premises sources or anything inside a private network. Use the Azure-SSIS integration runtime to lift and run existing SSIS packages.

The Azure integration runtime is fully managed and serverless. You do not install anything; the factory provisions compute on demand and tears it down when the work finishes. It handles data movement between cloud data stores, it dispatches transformation activities to public compute, and it runs mapping flows by spinning up the Spark cluster behind them. An Azure integration runtime has a region, and that region matters for two reasons. It controls where the movement compute physically runs, which is a data residency consideration, and it affects latency to your stores. The default auto-resolve setting lets the service pick a region close to the sink, which is convenient but worth overriding when residency rules require the compute to stay in a specific geography. The thing the Azure runtime cannot do is reach into a network it has no route to. It lives on the public Azure backbone and managed private endpoints, so an on-premises database behind a corporate firewall is invisible to it. That limit is not a bug to work around with a firewall exception; it is the reason the next runtime exists.

The self-hosted integration runtime is software you install on a machine that already sits where the data is, either a server inside your on-premises network or a virtual machine inside an Azure virtual network. Because it runs on a host with a route to the private source, it can read from an on-premises SQL Server, a file share, or a database that only listens on a private endpoint, and it then moves that data to the cloud sink. It runs as a service on the host, registers to your factory with a key, and needs outbound connectivity to a set of Azure endpoints over port 443 so it can receive instructions and report status. It dials out; you never open an inbound port to it. For availability and throughput you can register more than one node against the same self-hosted runtime, and the service distributes work across the healthy nodes, so a single node going down does not strand every pipeline that depends on private data. A self-hosted runtime can also be shared across multiple data factories, which avoids installing a separate agent for every factory in the same network. When a self-hosted runtime shows offline or unavailable, the downstream effect is every pipeline that touches a private source failing at once, which is why its health deserves a monitor of its own.

The Azure-SSIS integration runtime is a managed cluster of virtual machines whose entire purpose is to run SQL Server Integration Services packages in the cloud without rewriting them. If a team has years of SSIS investment and wants to retire the on-premises servers without a rebuild, this runtime lifts those packages into a factory and executes them on Azure-hosted nodes. It is billed differently from the other two because it is, in effect, a running cluster you provision rather than serverless compute that scales to zero, so leaving it running when no packages are executing is a common and avoidable cost.

The decision table below is the findable artifact for this guide, the InsightCrunch integration runtime decision table. Read each row as source location plus transformation need on the left, mapping to the runtime and the activity on the right, with the signal that should tip you toward that row.

Source location	Work to do	Integration runtime	Activity to use	Deciding signal
Cloud store, public endpoint	Move data as-is	Azure IR (auto-resolve or pinned region)	Copy activity	Both endpoints reachable on the public backbone
Cloud store, private endpoint	Move data as-is	Azure IR with managed private endpoint, or self-hosted IR in the VNet	Copy activity	The store only answers on a private endpoint
On-premises database or file share	Move data to cloud	Self-hosted IR on a host in that network	Copy activity	The source has no public route; a firewall fronts it
Cloud store	Heavy row-level transform	Azure IR (flow compute)	Mapping flow	Joins, aggregates, derived columns at scale
Cloud store	Simple move, light shaping	Azure IR	Copy activity (with mappings)	A column map or type cast is all that is needed
On-premises, behind firewall	Transform during move	Self-hosted IR for reach, then Azure IR for the flow	Copy to staging, then mapping flow	Reach and transform are two separate problems
Existing SSIS packages	Run packages unchanged	Azure-SSIS IR	Execute SSIS Package	A package estate you do not want to rewrite

The table encodes the rule. Reach is decided by the runtime; the kind of work is decided by the activity. Most early mistakes come from collapsing those two questions into one and assuming the default Azure runtime can both reach the source and do the transform, when the source is private and the default runtime cannot see it at all.

Why is my source unreachable even though the credentials are correct?

Because reach is a property of the runtime, not the credential. If a linked service to an on-premises database is bound to the Azure integration runtime, the connection attempt originates from public Azure compute that has no route into the private network, so it times out regardless of how correct the username and password are.

This is the single most common confusion the runtime model creates, and it produces a frustrating symptom: the credentials test fine from a developer laptop, the firewall team swears the database accepts connections, and yet the factory reports a connection timeout. The laptop works because it sits inside the corporate network. The factory’s Azure runtime sits outside it. The fix is not a new password or a broader firewall rule that punches a hole from the public internet to the database, which would be both ineffective and dangerous. The fix is to bind that linked service to a self-hosted integration runtime installed on a host that already lives in the network where the database answers. Once the runtime can route to the source, the same credentials that always worked start working through the factory too. When a private-source pipeline keeps failing to connect, the first thing to confirm is which runtime the linked service uses, and the second is that the self-hosted node is online and has outbound connectivity. The deeper treatment of that specific failure lives in our walkthrough of how to fix a self-hosted integration runtime that shows offline, which traces every cause from a stopped service to an expired registration key.

Copy Activity Versus Mapping Data Flows

The two workhorses of a factory are the copy activity and the mapping flow, and choosing between them correctly saves both money and time. They look similar on the canvas and both move data, but they are built for different jobs, run on different compute, and bill on different meters. Reaching for the wrong one is the second most common early mistake after the runtime confusion, and it is more expensive because a misplaced flow quietly provisions a Spark cluster you are paying for.

The copy activity is the lightweight path. It reads from a source and writes to a sink, with optional column mapping, type conversion, and basic file-format translation in between. It does not run a Spark cluster. On an Azure integration runtime it scales through Data Integration Units, a measure of the compute the service allocates to a single copy run, and through parallelism settings that let it read partitions concurrently. It can stage through blob storage when a sink such as a dedicated SQL pool benefits from a bulk load path, and it supports a long list of connectors on both the source and sink sides. For the enormous category of work that is fundamentally “take this data and put it over there, perhaps with a column renamed or a type adjusted,” the copy activity is the right and cheap answer.

The mapping flow is the heavyweight path, and it earns its weight when the transformation is genuinely complex: joining several sources, aggregating across groups, deriving columns from expressions, pivoting, deduplicating, applying slowly changing dimension logic, or doing anything that needs to reason about the data row by row at scale. A mapping flow is authored visually as a graph of transformations, but underneath it compiles to Spark and executes on a cluster that the Azure integration runtime provisions for the run. That cluster is the source of both the power and the catch. It has a startup time, often a few minutes from cold, before a single row is transformed, and it is billed by the vCore-hour for as long as it runs. You choose the compute type, such as general purpose, memory optimized, or compute optimized, and the core count, and you can set a time-to-live so a warm cluster is reused across successive flow activities instead of paying the startup cost every time.

When should I use a mapping flow instead of a copy activity?

Use a mapping flow only when the transformation genuinely needs distributed row-level processing, such as joins, aggregations, pivots, or derived columns at scale. For a straight move with at most a column mapping or type cast, use the copy activity, which avoids provisioning and paying for a Spark cluster.

The decision is really about whether the work needs Spark. A column rename, a data type change, a file format conversion from CSV to Parquet, or a filtered extract are all things the copy activity does without a cluster. The moment you need to join order lines to a customer table, sum revenue by region, or compute a running total, you are doing distributed compute work that the copy activity was never meant to do, and a flow is the honest choice. The expensive anti-pattern is building a mapping flow to do a job a copy activity would have done, paying a multi-minute cluster startup and a vCore-hour bill to accomplish what a serverless copy would have finished cheaply. If you find a flow whose only transformation is a select or a rename, replace it with a copy activity and watch both the run time and the cost drop. The reverse anti-pattern, forcing complex many-source logic through chained copy activities and stored procedures because someone wanted to avoid the cluster, trades a predictable cluster bill for an unmaintainable pipeline, which is usually the worse deal.

There is a third option worth naming so you do not overuse the first two. When the transformation logic already exists somewhere, calling out to it beats rebuilding it in a flow. A stored procedure activity runs logic that already lives in your database. A Databricks notebook activity hands the work to a cluster your data team already manages. The factory orchestrates; it does not have to own every transformation. Part of using the service well is knowing when an activity should delegate rather than transform.

Triggers: How Pipelines Actually Start

A pipeline that never runs is just a definition. Triggers are what turn a definition into a recurring or event-driven job, and there are three kinds plus the manual run, each suited to a different timing need. Choosing the wrong trigger type produces a class of “my pipeline did not fire” problems that have nothing to do with the pipeline and everything to do with the trigger’s semantics.

The schedule trigger fires on a wall-clock recurrence: every hour, every day at 02:00, every Monday. It is the trigger most people start with, and it does exactly what the name says, firing on the clock regardless of whether new data has arrived. A schedule trigger can start more than one pipeline, and one pipeline can be started by more than one trigger. The thing it does not do is reach back in time. If you create a daily schedule trigger today, it does not retroactively run yesterday’s missed windows; it starts firing from its defined start time forward.

The tumbling window trigger is the one to reach for when each run should correspond to a fixed, non-overlapping slice of time and those slices must be processed reliably and in order. Tumbling windows are contiguous and fixed size, so an hourly tumbling window trigger produces a run for 00:00 to 01:00, then 01:00 to 02:00, and so on, with each window passed to the pipeline as a parameter through system variables. This trigger supports backfill, so you can set a start time in the past and have it process every historical window to catch up. It supports dependencies between triggers, so a downstream window can wait for an upstream one to complete. It supports retry and a concurrency limit on how many windows run at once. For data that arrives in time slices and must be processed exactly once per slice, the tumbling window trigger is the correct tool, and trying to force that behavior onto a schedule trigger is where people end up writing brittle bookkeeping logic by hand.

The event trigger fires in response to a storage event rather than a clock, most commonly a blob being created or deleted. It is built on Azure Event Grid, which means the Event Grid resource provider has to be registered on the subscription for the trigger to work, a prerequisite that quietly blocks the trigger from firing if it is missing. Event triggers are how you build a pipeline that runs the moment a file lands rather than polling for it on a schedule, and the file’s path and name flow into the pipeline as parameters so the run can act on exactly the file that arrived.

Why is my trigger not firing?

The usual causes are a trigger that was authored but never started or published, a schedule whose start time is in the future, an event trigger blocked because the Event Grid provider is not registered, or a tumbling window waiting on an unmet dependency. Confirm the trigger’s state and the trigger run history before suspecting the pipeline.

A trigger that exists is not a trigger that runs. After authoring a trigger you have to start it and publish it, and a trigger left in the stopped state simply sits there. For schedule triggers, a start time set in the future means the trigger correctly does nothing until that time arrives, which reads as a failure when it is actually obedience. For event triggers, the Event Grid provider registration is the prerequisite people forget, and without it the trigger never receives the storage notification. For tumbling window triggers, a dependency on another window that has not completed will hold a run in a waiting state by design. The trigger run history in the monitoring view is the right place to look, because it shows whether the trigger fired at all, separately from whether the pipeline it started then succeeded or failed. Separating “did the trigger fire” from “did the pipeline run well” turns a vague complaint into a precise diagnosis.

Parameters, Variables, and the Expression Language

A factory becomes reusable through parameters and expressions, and the difference between a pipeline that works for one file and a pipeline that works for a thousand is almost always whether it was parameterized. The expression language is small but it is the connective tissue that lets one pipeline serve many cases, so a little fluency pays off quickly.

Parameters are values passed into a pipeline, a dataset, or a linked service at runtime. A pipeline parameter might be the name of the file to process or the date of the window. A dataset parameter might be the folder path, so a single dataset definition serves many folders. A linked service parameter might be the database name, so one connection definition reaches several databases. Parameters are read with the @pipeline().parameters.name form, and they are how you avoid hardcoding values that change from run to run.

Variables differ from parameters in that they are mutable within a single pipeline run. You declare a variable, set it with a Set Variable activity, append to it with Append Variable inside a loop, and read it later. Variables are how a pipeline accumulates state as it executes, such as building a list of files processed or holding a value computed in one branch for use in another.

The expression language itself is built from functions, system variables, and references to the outputs of prior activities. System variables expose run context: @pipeline().RunId, @pipeline().TriggerTime, and for tumbling windows the window boundaries @trigger().outputs.windowStartTime and @trigger().outputs.windowEndTime. Functions cover the usual ground of string manipulation, date math, type conversion, and collection handling, so an expression like @formatDateTime(utcNow(), 'yyyy/MM/dd') builds a date-partitioned path, and @activity('GetMetadata').output.childItems reads the list of files a Get Metadata activity discovered. The wiring pattern that unlocks dynamic pipelines is the combination of Get Metadata or Lookup to discover what to process, ForEach to iterate over the result, and parameters to pass each item into the work inside the loop. That pattern, discover then iterate then act, is how a single pipeline processes a folder of arbitrary files or a list of tables read from a control table, and it is worth building once until it is second nature.

How do I make one pipeline process many files or tables?

Read the set of items with a Lookup or Get Metadata activity, iterate with a ForEach activity, and pass each item into the inner activities through parameters and expressions. The pipeline stays a single reusable definition while the iteration count and the items vary at runtime.

This metadata-driven design is the difference between a factory that needs a new pipeline for every table and one where a single parameterized pipeline reads a control table listing the tables to move and processes all of them. The control table holds the source, the sink, and any per-table settings; a Lookup activity reads it; a ForEach iterates the rows; and the copy activity inside the loop takes the current row’s values as parameters. Adding a new table becomes a row insert rather than a new pipeline, which is how teams scale from moving ten tables to moving hundreds without the authoring effort scaling with it. The same approach drives file processing, where Get Metadata lists the files in a landing folder and the loop processes each one, often deleting or archiving it afterward so the next run does not reprocess it.

Tiers, Limits, and the Pricing Model

A factory does not have service tiers in the way a database does; instead it has a consumption-based pricing model with several meters, and understanding those meters is how you predict and control cost. Treat every specific number as a value to confirm against the current official pricing, because Azure revises rates and raises limits regularly, but the structure of what gets billed is stable and worth knowing cold.

Cost in a factory comes from a few distinct dimensions. Orchestration is billed by activity runs, so every activity execution counts, which means a pipeline that fans out thousands of iterations through ForEach incurs an activity-run charge per iteration. Data movement through the copy activity is billed by Data Integration Unit hours, so a copy that uses more DIUs to go faster finishes sooner but bills at a higher rate per hour; the total depends on both the rate and the duration. Mapping flow execution is billed by vCore-hours of the Spark cluster, which is why cluster startup time and time-to-live settings have a direct cost effect: a cold start you pay for, and a warm cluster reused across activities amortizes that cost. Pipeline activities that run on an external compute, such as a Databricks notebook, also incur the cost of that external compute separately. The Azure-SSIS integration runtime is billed by the running VM, independent of whether packages are executing, which makes an idle SSIS runtime a pure waste. There are also charges for operations such as monitoring reads and for inactive pipelines that sit unrun. None of these numbers should be memorized as constants; the meters and their behavior are what to internalize, and the rates are what to verify against the current pricing page when a budget depends on them.

Limits follow the same principle. There are caps on the number of entities in a factory, on concurrent pipeline runs, on activities per pipeline, on the number of self-hosted integration runtime nodes you can register against one runtime, on Data Integration Units per copy activity, and on the queue depth for runs. These limits are generous enough that most workloads never approach them, but a metadata-driven pipeline that fans out to tens of thousands of iterations can bump into a concurrency or queue limit, at which point batching the work or staging it across pipeline runs is the answer rather than assuming the platform will absorb unbounded fan-out. As with prices, the specific numbers move, so the discipline is to design with the existence of the limit in mind and confirm the current value when you are building something that pushes against it.

What does Azure Data Factory cost and how do I keep it down?

It bills on activity runs, copy Data Integration Unit hours, flow vCore-hours, any external compute you invoke, and the Azure-SSIS runtime by the VM. The largest controllable costs are usually flow clusters and an idle SSIS runtime, so right-size flows, set a sensible cluster time-to-live, and stop the SSIS runtime when it is not running packages.

The cost levers that matter most are the ones tied to compute you provision rather than the per-run orchestration charges, which are typically small. A mapping flow that runs on an oversized cluster, or that pays a cold start on every activity because no time-to-live keeps a cluster warm across a sequence of flows, is where money leaks. Right-sizing the flow compute to the volume, enabling a time-to-live when several flows run in succession, and replacing flows that do trivial work with copy activities together cut the largest line items. An Azure-SSIS integration runtime left running overnight while no packages execute is the other classic waste, and the fix is to schedule it to start before the package run and stop after, or to start and stop it from the pipeline itself. For copy-heavy workloads, the trade-off is between Data Integration Units and duration: more units finish faster at a higher hourly rate, so the cheapest setting is rarely the lowest unit count if it stretches the run long enough to cost more in total. Measuring the actual run, rather than guessing, is the only way to find the economical point. Routing the factory’s diagnostics to a Log Analytics workspace, as described in our guide to working with Azure Monitor and Log Analytics, gives you the run durations and costs to base that measurement on.

Failure Modes and How to Avoid Them

Most factory failures fall into a handful of recognizable shapes, and knowing them in advance turns an incident into a checklist. The value of the orchestration model is that each failure mode maps to a specific part of the machine, so the diagnosis starts by asking which part broke.

The first and most common shape is the pipeline that reports a failure with an Operation on target ... failed message. This is not a single error; it is the pipeline telling you that one of its activities failed and naming which one. The actionable detail is never in that summary line. It is in the failed activity’s own error output, which carries the real message from the source, the sink, or the runtime: a permission denied, a throttling response, a schema mismatch, a cluster that could not start. The discipline that saves time is to open the activity that failed, read its error, and only then form a hypothesis. Blind reruns without reading the activity error are how an afternoon disappears, because a deterministic failure such as a wrong schema or a missing permission will fail identically every time. When the failure is transient, such as a brief sink throttle, a rerun from the failed activity is exactly right, and the service lets you rerun from the point of failure rather than from the top, which matters when the early activities are expensive. The full diagnostic procedure for these pipeline failures, cause by cause, is laid out in our walkthrough of how to fix Data Factory pipeline failures, which pairs each Operation on target cause with the command or view that confirms it.

The second shape is the self-hosted integration runtime that goes offline, and its signature is unmistakable: every pipeline that touches a private or on-premises source fails at once, while pipelines that only touch public cloud stores keep working. That pattern points straight at the runtime rather than at any individual pipeline. The causes cluster around the runtime host: the Windows service stopped, the node lost the outbound connectivity it needs to the Azure endpoints, the registration key expired or was rotated, a single node failed where high availability would have needed several, or a version mismatch between nodes broke the cluster. Confirming it means checking the node status and the service state on the host and running the connectivity test, and the reason it deserves its own monitor is that nothing in the pipeline is wrong; the compute that the pipeline depends on simply vanished.

The third shape is the copy activity that throttles against its sink. When the destination, such as a database or a Cosmos DB container, returns throttling responses because the copy is pushing harder than the sink’s provisioned throughput allows, the copy slows or fails with a throttling error that originates from the sink, not from the factory. The fix lives on the sink side, by raising the throughput, by tuning the copy’s parallelism and batch size down to a sustainable rate, or by staging the load through a bulk path that the sink handles more efficiently. Reading the error correctly is what points you at the sink rather than sending you to tweak factory settings that are not the cause.

The fourth shape is the mapping flow that fails to start its Spark cluster or times out waiting for it. Because a data flow depends on a cluster the runtime provisions, a capacity or configuration problem at cluster start surfaces as the data flow failing before any transformation runs. The startup latency is also a normal cost, not a fault, so a flow that simply takes a few minutes before processing begins is behaving as designed, and the way to remove that latency from a sequence of flows is a time-to-live that keeps the cluster warm rather than treating the wait as a bug.

Why does my pipeline say “Operation on target … failed” with no useful detail?

That message is only the pipeline-level summary naming the activity that failed. The actionable error is inside that activity’s own error output in the monitoring view, where the real cause appears, whether a permission denial, a sink throttle, a schema mismatch, or a runtime outage. Open the failed activity, not the pipeline, to diagnose it.

The recurring theme across all four shapes is that the orchestration model gives you a precise place to look. A connectivity failure points at a runtime. A throttle points at a sink. A data flow start failure points at the cluster. A vague pipeline error points at the activity beneath it. Diagnosis in a factory is almost always a matter of descending from the pipeline to the specific part that owns the failure, reading the error there, and matching it to one of these known shapes rather than guessing at the level of the whole pipeline.

Error Handling and Retries Inside a Pipeline

A pipeline that only describes the happy path is a pipeline that pages someone at 03:00, so building error handling into the control flow is part of authoring rather than an extra. The factory gives you the primitives to handle failures gracefully, and using them turns a fragile chain into a flow that recovers, cleans up, or alerts on its own terms.

The first primitive is the activity dependency condition, which controls when a downstream activity runs based on how the upstream one finished. An activity can depend on its predecessor succeeding, failing, completing regardless of outcome, or being skipped, and those four conditions are how you wire branching behavior into the flow. A cleanup activity wired on the failure condition of a load runs only when the load fails, so you can delete a partial output or write a failure record without that cleanup firing on every successful run. An activity wired on completion runs whether the predecessor succeeded or failed, which suits a step that must always happen, such as releasing a resource. Composing these conditions is how you build a try-and-then-clean-up shape out of the declarative control flow rather than wishing for an imperative exception block.

The second primitive is the retry policy on an individual activity, which automatically reattempts a failed activity a set number of times with a set interval before giving up. This is the right tool for transient failures, such as a brief network blip or a momentary throttle, where a second attempt a few seconds later simply succeeds. It is the wrong tool for deterministic failures, such as a missing permission or a wrong schema, where every retry fails identically and the only effect is to delay the inevitable error while consuming activity runs. Setting a sensible retry count and interval on activities that touch flaky dependencies absorbs the transient failures that would otherwise wake someone, while leaving deterministic failures to surface quickly so they can be fixed rather than retried into a longer outage.

The third primitive is the timeout, which caps how long an activity may run before the factory fails it. A copy or a data flow with no sensible timeout can hang indefinitely against a stuck source or a cluster that never starts, holding the pipeline open and the compute billing. A timeout converts an indefinite hang into a clean, bounded failure that the dependency conditions and retry policy can then respond to, which is why leaving the default in place is rarely the right call for a long-running activity against an external system.

How do I make a pipeline clean up after itself when an activity fails?

Wire the cleanup activity on the upstream activity’s failure dependency condition so it runs only when that activity fails, and use the completion condition for steps that must run regardless of outcome. Combine these with per-activity retry policies for transient errors and timeouts to bound hangs, and the control flow handles failure deliberately rather than just stopping.

The combination of these primitives is what lets a pipeline express intent about failure. A load activity with a retry policy absorbs transient sink hiccups, a timeout bounds it against a hang, a cleanup activity on its failure condition removes a partial write if it ultimately fails, and a notification activity on the same failure condition records the incident, while the success path continues to the next stage. None of this requires imperative code; it is the declarative control flow used deliberately, with the dependency conditions doing the branching that a try-catch block would do in a procedural language. Pipelines built this way fail safely and recover on their own where recovery is possible, which is the difference between a pipeline that needs babysitting and one that earns trust.

Schema Drift and Mapping Mismatches

Real source data does not hold still, and a pipeline that assumes a fixed schema breaks the day a source adds, renames, or reorders a column. Handling schema change deliberately is one of the quieter marks of a mature pipeline, because the failures it prevents are the intermittent, hard-to-reproduce kind that strike only when the source changes.

The copy activity handles schema through its mapping, which pairs source columns to sink columns. An explicit mapping is precise and predictable: it states exactly which source column lands in which sink column, with any type conversion declared. The cost of an explicit mapping is that a source schema change can break it, since a renamed or removed column no longer matches the mapping. An automatic mapping, where the copy infers the pairing by name, is more tolerant of additive change but can silently misroute data if columns are reordered and matched by position rather than name. The right choice depends on how stable the source is and how strict the sink is: a strict sink with a fixed schema rewards an explicit mapping that fails loudly on a mismatch, while a tolerant landing zone that captures whatever arrives may prefer the flexibility of automatic mapping.

Mapping flows offer a schema drift feature precisely for sources whose shape changes, allowing a flow to carry and process columns it was not explicitly designed for, so a new column flows through to the sink without a redesign. This is powerful for ingestion patterns where the source genuinely evolves, but it is a deliberate choice rather than a default to enable everywhere, because a flow that silently accepts any column also silently accepts a mistake in the source. The judgment is whether you want the pipeline to adapt to source change automatically or to halt and demand attention when the source changes, and both are legitimate depending on whether the source is a trusted, evolving feed or one where an unexpected change signals a problem upstream.

Why does my pipeline fail when the source adds or renames a column?

Because an explicit copy mapping pairs specific source columns to sink columns, so a renamed, removed, or reordered column no longer matches and the activity fails. Decide deliberately between an explicit mapping that fails loudly on schema change and an automatic mapping or a data flow’s schema drift handling that tolerates change, based on how stable the source is.

The reason schema handling deserves a deliberate decision is that the two failure modes are opposites and you have to choose which one you prefer. A strict explicit mapping fails the moment the source changes, which is loud and inconvenient but safe, because no data is silently misrouted or dropped. A tolerant automatic or drift-aware approach keeps running through source changes, which is convenient but can let a structural problem flow through unnoticed until it surfaces as bad data downstream. Neither is universally right. A pipeline feeding a strict analytical model that must not receive surprise columns wants the loud failure, while an ingestion pipeline whose job is to capture an evolving source into a flexible landing zone wants the tolerance. Making the choice on purpose, and documenting why, prevents the worst outcome, which is a pipeline that was assumed to be strict quietly accepting drift, or one assumed to be tolerant breaking on a change nobody expected it to reject.

Tuning Data Movement and Transformation

Getting a pipeline to run is the first milestone; getting it to run efficiently is where the reasoning pays off, and the copy activity and the mapping flow each expose levers that behave predictably once you know what they control. Tuning blindly wastes money on one side and time on the other, so the goal is to match the lever to the bottleneck rather than to turn every dial up.

The copy activity’s main throughput control is the Data Integration Unit setting, which governs how much compute the Azure runtime allocates to a single copy run. More units let the copy read and write with more parallelism, finishing sooner, but they bill at a higher hourly rate, so the cheapest run is not always the one with the fewest units: a low setting that stretches a copy across many hours can cost more in total than a higher setting that finishes quickly. Alongside the unit count, the copy activity exposes a degree of parallelism that controls how many concurrent connections it opens to read partitions of the source, which matters when the source can be partitioned, such as a large table split by a key range or a folder of many files. Reading partitions concurrently is how a single copy activity saturates the available throughput rather than trickling data through one connection. The honest way to find the right settings is to run the copy, read the actual duration and throughput the monitoring view reports, and adjust from measurement rather than from a guess about what should be fast.

For loads into a sink that has a bulk path, staging changes the economics. When the destination is a dedicated SQL pool, for example, a direct row-by-row insert is slow, while staging the data in blob storage and then invoking the sink’s bulk load path moves far more data per unit of time. The copy activity supports this staged copy pattern, where it lands the data in an interim blob location and then triggers the efficient bulk load into the sink. The trade-off is the transient storage and the extra hop, which is almost always worth it for large loads into a sink built for bulk ingestion and pointless for small loads where the staging overhead exceeds the gain. Recognizing when a sink rewards staging, rather than staging everything reflexively, is the judgment that separates a tuned pipeline from a superstitious one.

How do I speed up a slow copy activity?

Raise the Data Integration Units to allocate more copy compute, increase the degree of parallel copies when the source can be partitioned so multiple connections read at once, and use a staged copy through blob storage when the sink has an efficient bulk load path. Measure the throughput the monitoring view reports and tune from that number rather than guessing.

The bottleneck determines which lever helps, which is why measurement comes first. If the source is the limit, partitioning and parallel reads help; if the sink is throttling, the answer is on the sink side through more provisioned throughput or a bulk path, not more copy units that simply push harder against a wall. If the copy compute itself is the limit, more Data Integration Units help. Throwing units at a sink-bound copy wastes money without speeding anything, because the data already arrives faster than the sink will accept it, which is exactly the throttling case discussed earlier. Reading the monitoring view to see whether the time is spent reading, writing, or waiting tells you which of these worlds you are in, and the right lever follows directly from that diagnosis.

Inside a mapping flow the tuning surface is different because the work runs on Spark. The optimize settings on each transformation control partitioning, which governs how the data is distributed across the cluster’s workers, and getting partitioning wrong is a common cause of a flow that runs far slower than the volume should require. The debug mode lets you preview the data and the row counts at each transformation while authoring, which is how you confirm a join is matching the rows you expect or an aggregate is grouping correctly before you run the flow at full scale. The compute type and core count of the flow cluster, set on the integration runtime, size the cluster to the data, and a time-to-live keeps a cluster warm across successive flows so a sequence of transformations does not pay a cold start each time. The pattern that keeps data flow cost sane is to size the cluster to the actual data volume, reuse a warm cluster across a run of several flows, and reserve flows for transformations that genuinely need Spark, sending lighter work to copy activities instead.

Incremental Loading: The Pattern That Defines Real Pipelines

Almost every production pipeline that matters loads data incrementally rather than reloading everything on each run, and building incremental loading correctly is one of the clearest tests of whether someone understands the service or merely copied a full-load tutorial. Reloading an entire source on every run is acceptable for a tiny table and ruinous for a large one, both in run time and in the load it puts on the source, so the moment data grows, incremental loading stops being optional.

The watermark pattern is the standard approach, and it fits the factory’s parts cleanly. A watermark is a value, usually a timestamp or an incrementing key, that marks how far the last run got. The pipeline reads the stored watermark, queries the source for only the rows newer than that watermark, loads them, and then updates the stored watermark to the high value it just processed so the next run picks up exactly where this one left off. In factory terms, a Lookup activity reads the current watermark from a control table, the copy activity’s source query filters on it with an expression so only new rows are extracted, and a stored procedure or a second activity updates the watermark after a successful load. The whole pattern is parameter and expression work over the parts already covered, which is why understanding the orchestration model makes incremental loading straightforward rather than mysterious.

The tumbling window trigger pairs naturally with incremental loading when the data partitions by time, because each window corresponds to a time slice and the trigger hands the window boundaries to the pipeline. Instead of storing and updating a watermark by hand, the pipeline filters the source to the window passed in by the trigger, so the run for the 02:00 to 03:00 window loads exactly the rows for that hour. The trigger’s backfill capability then lets you process historical windows to catch up, and its dependency and concurrency controls keep the windows processing in order without overlapping. For time-partitioned incremental loads, this combination of a tumbling window trigger and a window-filtered source query is cleaner than a hand-managed watermark, because the trigger owns the bookkeeping the watermark pattern would otherwise require you to write.

How do I load only new or changed data instead of everything?

Use a watermark: store the high value of the last successful load, query the source for only rows above that value, load them, and update the stored watermark. A Lookup reads the watermark, the copy activity filters on it, and a final step updates it. For time-partitioned data, a tumbling window trigger can supply the boundaries instead of a hand-managed watermark.

The reason incremental loading deserves its own design attention is that getting it subtly wrong produces silent data quality problems rather than loud failures. A watermark updated before the load completes, rather than after, loses rows if the load fails midway, because the next run reads a watermark that claims data was processed when it was not. A watermark with a boundary off by one reprocesses or skips a row at the edge of each window. A source query that filters on a column the source does not index turns the incremental query into a full scan that defeats the purpose. These are not platform failures the monitoring view flags; they are logic errors in the pattern that show up as missing or duplicated data downstream, which is why the discipline is to update the watermark only after a confirmed successful load, to be precise about the boundary condition, and to ensure the filter column is one the source can seek on efficiently. Built carefully, the pattern moves only what changed and keeps the source load and the run time proportional to the new data rather than to the total, which is what makes a pipeline scale as the data grows.

Source Control, Deployment, and Security

A factory authored only in the live service is a factory with no history and no safe path to production, so understanding the Git integration and the deployment model is part of using it properly rather than an afterthought. By default the studio publishes directly to the live service, which is fine for a first experiment and wrong for anything a team maintains.

Connecting a factory to a Git repository, either Azure Repos or GitHub, changes the authoring experience in a way that matters. In Git mode your changes are saved to a feature branch as you work, so you have version history, code review through pull requests, and the ability to collaborate without overwriting each other’s work in a shared live canvas. Publishing then becomes a deliberate act: you merge to the collaboration branch and publish, which generates the deployment artifacts the service uses to promote the factory. The promotion across environments, from development to test to production, is done by deploying those artifacts with parameters that swap the environment-specific values such as connection strings and runtime references, so the same pipeline definition runs against the right stores in each environment without hand-editing. Teams that skip Git and author directly against the live service lose the history and the review gate, and they discover the cost the first time an unreviewed change breaks a production run with no clean way to see what changed.

Security in a factory rests on two pillars worth setting up deliberately. The first is identity: a factory has a managed identity, and using it to authenticate to data stores removes secrets from the picture entirely. Rather than storing a key or a password in a linked service, you grant the factory’s managed identity the appropriate role on the target, such as a data-plane role on a storage account, and the factory authenticates as itself with no credential to leak or rotate. Where a secret is unavoidable, it belongs in Azure Key Vault and is referenced from the linked service rather than embedded in it, so the secret lives in one governed place. The second pillar is network exposure: a factory can use managed private endpoints to reach stores over a private network rather than the public endpoint, which keeps the data path off the public internet and is often a compliance requirement. Combining a managed identity for authentication with private endpoints for the network path produces a factory that holds no secrets and exposes no data to the public internet, which is the posture most production environments should aim for.

How should a factory authenticate to data stores without storing secrets?

Use the factory’s managed identity and grant it the appropriate role on each data store, so it authenticates as itself with no key or password to store. Where a secret is genuinely required, keep it in Azure Key Vault and reference it from the linked service rather than embedding the value.

The managed identity approach is both more secure and less work over time, because there is no credential to rotate, no secret that can appear in an exported template, and no shared password that outlives the person who created it. Granting the identity a scoped role, rather than a broad one, applies least privilege concretely: a factory that only reads from a container needs a reader role on that container, not owner on the storage account. The same identity can be granted distinct roles on distinct stores, so one factory reads from one place and writes to another with exactly the permissions each requires and no more. This is the configuration that turns “the pipeline can connect” into “the pipeline can connect with precisely the access it needs,” which is the difference an auditor cares about.

When to Use Data Factory and When to Reach for Something Else

A factory is the right tool for scheduled and event-driven movement and transformation of data at scale, but it overlaps with several other Azure services, and choosing well means knowing where the boundaries actually sit rather than forcing every integration job into the tool you know best.

The closest overlap is with Synapse pipelines. The pipeline engine inside Azure Synapse Analytics is, in practice, the same orchestration engine, so the pipelines, activities, datasets, linked services, and integration runtimes you have learned here carry over almost directly. The deciding factor between them is the surrounding context. If the work lives inside an analytics workspace that also uses dedicated and serverless SQL pools and Spark for analysis, building the pipelines inside Synapse keeps the orchestration next to the analytics it feeds, and our explanation of how Synapse unifies its engines covers where that integrated workspace pays off. If the data integration is a standalone concern that feeds many destinations and is not anchored to a single analytics workspace, a dedicated data factory keeps the integration layer independent and reusable across consumers. The engines are close enough that the choice is about where the work belongs organizationally, not about a capability gap.

The overlap with Azure Logic Apps and Power Automate runs the other way. Those tools excel at application and business-process integration: reacting to an email, posting to a messaging channel, calling a SaaS API, running a lightweight workflow with hundreds of connectors to business systems. A factory excels at data integration: moving and transforming large volumes of data efficiently with compute built for the job. The line is volume and intent. A workflow that moves a few records in reaction to a business event is Logic Apps work; a pipeline that moves millions of rows on a schedule with proper movement compute is factory work. Using a factory to send a notification is overkill, and using Logic Apps to move a terabyte is the wrong engine.

The overlap with Azure Databricks is about ownership of the transformation. Databricks is a full data engineering and analytics platform where teams write Spark code, manage clusters, and build sophisticated transformations in notebooks. A factory can orchestrate Databricks by calling notebooks as activities, which is the common and effective pattern: the factory schedules, monitors, and retries while Databricks does the heavy transformation the data team authored. The choice is not factory versus Databricks so much as factory orchestrating Databricks, with the factory owning the schedule and the dependency graph and Databricks owning the transformation logic.

Should I build pipelines in Data Factory or in Synapse?

Use Synapse pipelines when the orchestration belongs inside an analytics workspace that already uses Synapse SQL pools and Spark, because keeping pipelines next to the analytics they feed simplifies the whole. Use a standalone data factory when data integration is an independent concern serving many destinations rather than one workspace.

The engines being nearly identical is what makes this an organizational decision rather than a technical one. Neither choice locks you out of a capability the other has for the common cases, so the question to answer is where the integration logic should live and who maintains it. A platform team that runs data integration as a shared service across many analytics and operational consumers benefits from a dedicated factory that is not tied to any one workspace. An analytics team that owns its workspace end to end benefits from pipelines that sit inside it. Picking based on ownership and blast radius, rather than on a feature checklist, produces the architecture that stays maintainable as the number of pipelines grows.

Putting It Into Practice

The fastest way to make this model concrete is to build the smallest real version of it and watch each part behave. Create a factory, define two linked services and the datasets on top of them, drop a copy activity between them, and run it on the Azure integration runtime to see a cloud-to-cloud move with no cluster involved. Then register a self-hosted integration runtime on a machine in a private network, point a linked service at a private source, and watch the same copy activity now reach data the Azure runtime could never see, which makes the runtime-decides-reach rule tangible rather than abstract. Add a mapping data flow that joins two sources and observe the Spark cluster start, the few-minute warmup, and the vCore-hour meter, so the cost of transformation compute stops being a surprise. Wire a schedule trigger, then a tumbling window trigger with a backfill, and see the difference between firing on the clock and processing fixed time slices in order. You can stand all of this up and run it in a guided environment when you run the hands-on Azure labs and command library on VaultBook, which walks through building a pipeline, registering a self-hosted runtime, and running a copy against both public and private sources so the moving parts become muscle memory rather than diagram boxes. The error and issue reference there also pairs each common factory symptom with its root causes, which shortens the path from a failed run to the part of the machine that owns the failure.

Building the small version teaches more than reading about the large one, because the failure modes show up immediately and in isolation. Bind a linked service to the wrong runtime on purpose and watch the connection time out, so the next time it happens in production you recognize the shape instantly. Run a flow that only renames a column and see the cluster cost for work a copy activity would have done free, so the anti-pattern is something you have felt rather than something you have been warned about. The reasoning the model gives you is most durable when it is attached to a run you watched succeed or fail.

How to Think About Azure Data Factory

The single best summary of the service is this: a factory is a control plane that orchestrates data work, built from pipelines, activities, datasets, linked services, and integration runtimes, where the integration runtime decides reach, the activity decides the kind of work, and the trigger decides timing. Almost every decision and almost every failure resolves to one of those three questions. Can the compute reach the data? That is a runtime question. Is this a move or a transform? That is an activity question. When and on what signal should this run? That is a trigger question. The sprawling surface of connectors, settings, and blades collapses into those few coupled decisions once you stop reading the service as a feature list and start reading it as a small machine with named parts.

The discipline that follows from the model is to ask the three questions before provisioning anything. Decide reach first, because it determines the runtime and rules out whole approaches: a private source means a self-hosted runtime, full stop, and no amount of configuration on the Azure runtime changes that. Decide the kind of work second, because it determines the activity and the cost: a move is a copy, a real transform is a data flow or a delegated compute, and choosing the heavy path for light work is the most common way to overpay. Decide timing last, because it determines the trigger and its semantics: clock work is a schedule trigger, time-slice work is a tumbling window, and event-driven work is an event trigger, and matching the trigger to the timing need prevents the brittle bookkeeping people write when they force the wrong one. Hold those three decisions in order and the service stops being a place where pipelines mysteriously fail and becomes a place where you can predict the failure before it happens.

The Authoring and Monitoring Experience

Knowing the parts is half the job; knowing where you assemble and watch them is the other half, because the studio is where the model becomes something you can touch. The authoring surface presents a canvas where pipelines are composed, a place to define the connections and the named pointers to your sources and sinks, a section for the triggers that start runs, and the monitoring panes that show what happened. Treating the studio as a place to reason rather than a place to click through a tutorial is what makes the experience productive.

When you author a pipeline, you drag activities onto the canvas and wire their dependencies with the success, failure, completion, and skip conditions that govern the control flow. The properties of each activity expose the settings that matter for that step: the source and sink for a copy, the transformation graph for a flow, the expression for a control step, the retry policy and timeout that harden it. The debug run is the authoring counterpart to a full execution, letting you run the pipeline against real connections while you build, so you confirm a step behaves before you publish and schedule it. Building with debug runs rather than publishing and hoping shortens the loop between an idea and the confirmation that it works, which is how you avoid discovering a wiring mistake only after a trigger fires it at 02:00.

The monitoring panes are where an engineer mid-incident actually lives, and they are organized around the three things that run: the pipeline runs, the activity runs inside them, and the trigger runs that started them. A pipeline run shows the overall status and the activities it executed, each activity run carries its own status and the input, output, and error that make diagnosis possible, and the trigger run history shows whether and when a trigger fired. The rerun capability lets you restart a failed pipeline from the activity that failed rather than from the beginning, which matters when the early steps are expensive and only a late step broke. Pairing the studio’s live monitoring with telemetry routed to a workspace gives you both the immediate view for an incident and the historical view for trends, and an engineer who knows both panes can move from a paged alert to a root cause without guessing.

Where do I see why a specific pipeline run failed?

Open the monitoring pane, find the failed pipeline run, and drill into its activity runs until you reach the activity that failed, whose error output carries the real message. The pipeline run only reports that something broke; the activity run tells you what broke and why, which is where every diagnosis should begin.

The reason the monitoring layer rewards familiarity is that it mirrors the orchestration model exactly. A pipeline is a container of activities, so a pipeline run is a container of activity runs, and the failure always belongs to a specific activity rather than to the abstract pipeline. An engineer who internalizes that descends straight to the failing activity instead of staring at a red pipeline wondering where to look. The input and output captured on each activity run also let you confirm what an expression evaluated to, which resolves the common confusion where a pipeline behaves unexpectedly because a parameter or an expression produced a value other than the one assumed. Reading the actual input an activity received, rather than the value you intended it to receive, is often the fastest way to find a logic error in the control flow.

The Verdict

Azure Data Factory is the right default for scheduled and event-driven data integration on Azure, and it rewards engineers who treat it as an orchestration engine rather than as a drag-and-drop ETL toy. Its strength is the clean separation of concerns: reach, work, and timing are three independent decisions made through three independent parts, and once that separation is internalized the service is predictable and its failures are diagnosable. Its sharp edge is that two of those decisions, the runtime and the activity, carry consequences that are invisible on the canvas, where a wrong runtime looks identical to a right one until the connection times out, and a needless data flow looks identical to a copy until the cluster bill arrives. The engineers who get value from the service are the ones who decide reach before they build, choose the lightest activity that does the job, match the trigger to the timing semantics rather than the most familiar option, and authenticate with a managed identity so there is no secret to leak. Do that, lean on Git for history and safe promotion, route the diagnostics somewhere you can measure run cost and duration, and the factory becomes the reliable backbone of a data platform rather than a recurring source of incidents. The model is small. The discipline of applying it in order is what separates a pipeline that survives production from one that breaks the first time the real world stops resembling the tutorial.

Frequently Asked Questions

Q: What is Azure Data Factory and what does it orchestrate?

Azure Data Factory is a managed cloud service for data integration and orchestration. It coordinates the movement and transformation of data across stores and compute services, on a schedule or in response to an event, with built-in monitoring and retry. It does not hold your data or run heavy transformations on its own hardware in the way a database or a Spark platform does; it directs other systems to do that work and tracks the outcome. The work itself is organized into pipelines made of activities, where activities either move data with a copy step, transform it with a data flow or a delegated compute, or steer the flow of the pipeline with control activities. Think of it as a conductor that tells the orchestra what to play and when, rather than as the instruments themselves, which is why the central design questions are always about where it lives, where it must go, and what compute touches it along the way.

Q: What is an integration runtime and which type do I need?

An integration runtime is the compute infrastructure a factory uses to reach and move data, and it is the part that determines what a factory can connect to. There are three types. The Azure integration runtime is fully managed and serverless, used for work between cloud stores on the public network or managed private endpoints, and it runs the Spark clusters behind flows. The self-hosted integration runtime is software you install on a host inside a private or on-premises network, and it is required to reach sources that have no public route, such as an on-premises database or a file share. The Azure-SSIS integration runtime is a managed cluster for running existing SSIS packages unchanged. You choose by source location: public cloud means the Azure runtime, private or on-premises means the self-hosted runtime, and an SSIS package estate means the Azure-SSIS runtime.

Q: When do I use a mapping data flow versus a copy activity?

Use the copy activity for moving data with at most light shaping such as a column rename, a type cast, or a file format conversion, because it runs without a Spark cluster and is cheap and fast. Use a mapping data flow only when the transformation genuinely needs distributed row-level processing, such as joining several sources, aggregating across groups, deriving columns at scale, pivoting, or applying slowly changing dimension logic. A data flow compiles to Spark and runs on a cluster the Azure runtime provisions, which carries a startup time of a few minutes and bills by the vCore-hour, so using one for work a copy activity could do means paying for a cluster you did not need. The common waste is a data flow whose only step is a select or a rename; replace it with a copy activity and both the run time and the cost drop noticeably.

Q: Why can my pipeline not reach a source even though the credentials are correct?

Because reach is a property of the integration runtime, not the credential. If a linked service to an on-premises or private source is bound to the Azure integration runtime, the connection originates from public Azure compute that has no network route into the private network, so it times out no matter how correct the username and password are. The credentials test fine from a laptop because the laptop sits inside the network; the Azure runtime sits outside it. The fix is not a new password or a public firewall hole, which would be ineffective and unsafe. Bind the linked service to a self-hosted integration runtime installed on a host that already lives in the network where the source answers. Once the runtime can route to the source, the same credentials start working. When a private-source pipeline fails to connect, confirm which runtime the linked service uses first.

Q: How do Data Factory triggers schedule pipelines?

Triggers turn a pipeline definition into a running job, and there are three plus the manual run. A schedule trigger fires on a wall-clock recurrence such as hourly or daily at a set time, regardless of whether new data arrived, and it does not retroactively run windows from before its start time. A tumbling window trigger fires for fixed, non-overlapping, contiguous time slices, passes each window’s boundaries into the pipeline as parameters, and supports backfill, inter-trigger dependencies, retry, and a concurrency limit, which makes it the right choice when each run must process exactly one time slice reliably and in order. An event trigger fires in response to a storage event such as a blob being created, built on Azure Event Grid, so the file’s path flows into the run. One pipeline can have several triggers and one trigger can start several pipelines.

Q: Why is my Data Factory trigger not firing?

The usual causes are that the trigger was authored but never started or published, a schedule trigger whose start time is still in the future, an event trigger blocked because the Event Grid resource provider is not registered on the subscription, or a tumbling window trigger waiting on a dependency that has not completed. A trigger that exists is not the same as a trigger that runs; after authoring it you must start and publish it, and a stopped trigger simply sits idle. Check the trigger run history in the monitoring view, which shows whether the trigger fired at all, separately from whether the pipeline it started then succeeded. Separating did the trigger fire from did the pipeline run well turns a vague my pipeline did not run complaint into a precise diagnosis you can act on quickly.

Q: What is the difference between a dataset and a linked service?

A linked service is the connection definition, the equivalent of a connection string, and it tells the factory how to connect to a data store or a compute service. A dataset sits on top of a linked service and names the specific data you want to use within that connection, such as a particular table, a folder, or a file pattern, along with its shape where relevant. The relationship is layered: a linked service connects to Azure Blob Storage as a whole, while a dataset built on it points at one container and folder inside that storage. This separation is what makes both reusable. One linked service to a database serves many datasets, each naming a different table, and a parameterized dataset can serve many folders through a single definition, so you are not redefining the connection every time you reference a new piece of data.

Q: Does a mapping data flow really start a Spark cluster every time?

A mapping data flow runs on a Spark cluster that the Azure integration runtime provisions, and from a cold state that startup takes a few minutes before any rows are transformed. That latency is normal behavior, not a fault, so a flow that pauses for a few minutes before processing begins is working as designed. To avoid paying that startup on every flow in a sequence, set a time-to-live on the integration runtime’s data flow compute, which keeps the cluster warm and reuses it across successive flow activities instead of provisioning a fresh one each time. The cluster also bills by the vCore-hour while it runs, so right-sizing the compute type and core count to the data volume, and reusing a warm cluster, are the two levers that control data flow cost. A copy activity, by contrast, never starts a cluster.

Q: How is Azure Data Factory priced?

It uses a consumption model with several meters rather than fixed tiers. Orchestration is billed by activity runs, so each activity execution counts, which matters for pipelines that fan out many iterations. Data movement through the copy activity is billed by Data Integration Unit hours, where more units finish faster at a higher hourly rate. Mapping data flow execution is billed by Spark vCore-hours, so cluster size and run duration drive that cost. External compute you invoke, such as a Databricks notebook, is billed separately by that service. The Azure-SSIS integration runtime is billed by the running VM whether or not packages execute, which makes an idle SSIS runtime pure waste. Treat any specific rate as a value to confirm against the current pricing, since rates change; the meters and their behavior are the stable part to understand.

Q: How do I move data from an on-premises SQL Server to Azure?

Install a self-hosted integration runtime on a machine inside the network where the SQL Server lives, because the Azure integration runtime cannot route to a private on-premises source. Register the self-hosted runtime to your factory with its key, confirm it is online and has outbound connectivity to the Azure endpoints it needs over port 443, then create a linked service to the SQL Server bound to that self-hosted runtime. Build a dataset for the source table and a dataset for the Azure sink, drop a copy activity between them, and run it. The copy activity now reads from the on-premises database through the self-hosted runtime and writes to the cloud sink. For availability and throughput on a busy pipeline, register more than one node against the same self-hosted runtime so a single node failure does not strand the work.

Q: Can the Azure integration runtime reach data behind a private endpoint?

It can when you use a managed private endpoint, which gives the Azure integration runtime a private route to a store that only answers privately, keeping the data path off the public internet. Without a managed private endpoint or another private path, the Azure runtime reaches stores over their public endpoints, so a store locked to private access only is unreachable from the plain Azure runtime. The alternative for sources inside a virtual network is a self-hosted integration runtime installed on a VM in that network, which routes to private resources the same way it routes to on-premises ones. The choice between a managed private endpoint and a self-hosted runtime in the VNet depends on whether the source is a managed Azure service that supports managed private endpoints or a resource that needs an agent inside the network to reach it.

Q: What is the difference between a schedule trigger and a tumbling window trigger?

A schedule trigger fires on a wall-clock recurrence and is stateless about the data: it runs at the appointed times whether or not new data has arrived, and it does not go back and process windows from before it started. A tumbling window trigger is stateful about time: it produces a run for each fixed, contiguous, non-overlapping slice, hands the window boundaries to the pipeline as parameters, and tracks each window so it can backfill historical slices, enforce dependencies between triggers, retry a failed window, and limit how many windows run concurrently. Choose a schedule trigger for simple recurring jobs where each run does the same thing on the clock. Choose a tumbling window trigger when each run corresponds to a specific time slice that must be processed exactly once and in order, which is common for incremental data loads partitioned by time.

Q: How do I process many files or tables with one pipeline?

Build a metadata-driven pipeline. Read the set of items to process with a Lookup activity against a control table, or with a Get Metadata activity that lists the files in a folder, then iterate the result with a ForEach activity, passing each item into the inner activities through parameters and expressions. The copy activity inside the loop takes the current item’s source and sink as parameters, so a single pipeline definition handles any number of files or tables. Adding a new table becomes a row in the control table rather than a new pipeline, and a new file is picked up by the next run automatically. This discover-then-iterate-then-act pattern is how teams scale from moving a handful of tables to moving hundreds without the authoring effort growing with the count, and it is worth building once until it is routine.

Q: How should I authenticate a factory to data stores securely?

Use the factory’s managed identity wherever the target supports it, and grant that identity a scoped role on each store rather than storing a key or password. This removes secrets from linked services entirely, since the factory authenticates as itself, and it applies least privilege when you grant only the role the work needs, such as a read role on the one container the pipeline reads from. Where a secret is genuinely unavoidable, keep it in Azure Key Vault and reference it from the linked service rather than embedding the value, so the secret lives in one governed location with rotation and access control. Combining a managed identity for authentication with managed private endpoints for the network path produces a factory that holds no secrets and keeps the data off the public internet, which is the posture most production environments should target.

Q: Why should I connect a factory to Git instead of editing it live?

Authoring directly against the live service gives you no version history, no review gate, and no safe rollback, so an unreviewed change can break a production run with no clean record of what changed. Connecting the factory to Azure Repos or GitHub puts your work on feature branches with version history and pull request review, lets a team collaborate without overwriting a shared canvas, and makes publishing a deliberate merge-and-publish step rather than an accidental save. It also enables clean promotion across environments, where the same pipeline definitions deploy to development, test, and production with parameters swapping the environment-specific connection values, so you are not hand-editing pipelines per environment. For anything a team maintains over time, Git mode is the difference between a manageable factory and one whose history and safety depend on nobody making a mistake.

Q: Should I use Data Factory or Logic Apps for integration?

Use a factory for data integration: moving and transforming large volumes of data on a schedule or an event with compute built for that job. Use Logic Apps for application and business-process integration: reacting to an email, calling a SaaS API, posting to a chat channel, or running a lightweight workflow across the many business-system connectors Logic Apps offers. The dividing line is volume and intent. A workflow that moves a few records in reaction to a business event belongs in Logic Apps, while a pipeline that moves millions of rows with proper data movement compute belongs in a factory. Using a factory to send a notification is heavier than the task warrants, and using Logic Apps to move large data sets puts the work on an engine that was not designed for that throughput, so matching the tool to the volume and the purpose keeps both the cost and the reliability sensible.

Q: Can one self-hosted integration runtime serve more than one data factory?

A self-hosted integration runtime can be shared across multiple data factories, which avoids installing a separate agent on the same host for every factory that needs to reach the same private network. You register the runtime once and grant other factories permission to use it as a linked, shared runtime. This is useful when several factories in an organization all need to reach the same on-premises sources, because it reduces the number of agents to install, patch, and monitor. For availability and throughput you can also register multiple nodes against a single self-hosted runtime, and the service distributes work across the healthy nodes, so sharing and high availability are independent decisions: sharing controls how many factories use the runtime, while multiple nodes control how resilient and how scalable that runtime is for the work it handles.

Q: How do I monitor and observe my pipelines over time?

The studio’s monitoring view shows pipeline runs, activity runs, and trigger runs with their status, input, output, and error detail, which is enough for diagnosing an individual run. For history beyond the built-in retention, for alerting, and for analysis across many runs, route the factory’s diagnostic logs and metrics to a Log Analytics workspace through a diagnostic setting, then query them with KQL and build alerts on failure rates, durations, or specific error patterns. This is also where you find the run durations and Data Integration Unit usage needed to attribute and control cost, since the per-run detail tells you which pipelines and which data flows consume the most compute. Centralizing the telemetry turns one-off run inspection into a durable observability layer, so you can spot a creeping increase in run time or a rising failure rate before it becomes an incident.

Q: Does Data Factory store any of my data?

A factory does not persistently store your business records as part of moving it; it orchestrates the movement between your source and your sink, and the data lands in the sink you specify. During a copy that stages through an intermediate path, such as a blob staging step for a bulk load into a dedicated SQL pool, the payload passes through that staging location transiently as part of the operation rather than being retained by the factory itself. The factory does store metadata about your pipelines, the run history, and the entity definitions, and it records logs and outputs you can inspect in monitoring. The practical implication is that data residency and protection are governed by where your source, sink, and any staging location live and by the region of the integration runtime doing the movement, so those are the things to control when residency rules apply, rather than expecting the factory to be a data store you secure directly.