When a system grows past a handful of services, the wiring between them quietly becomes the hardest part of the design. Order processing calls inventory, inventory calls notifications, notifications call the audit log, and every one of those calls is a synchronous request that has to succeed right now or the whole chain stalls. Add a sixth service and the call graph turns into a web that no single engineer can hold in their head. Event-driven architecture on Azure is the answer to that sprawl: instead of services calling each other directly, a service announces that something happened and walks away, and any number of other services react to that announcement on their own schedule. This guide is about designing such a system deliberately, choosing the right messaging backbone for each kind of event, and building consumers that survive the delivery guarantees Azure actually offers rather than the ones engineers wish it offered.

The promise is decoupling, and the trap is that decoupling is not free. The moment a producer stops waiting for a consumer, you inherit a new class of problems: the same event can arrive twice, events can arrive out of order, a consumer can fail halfway through and leave the system in a partial state, and a failed event can vanish if nothing catches it. None of these problems are exotic. They are the ordinary, daily reality of every event-driven system in production, and the difference between a design that works and one that pages an engineer at three in the morning is whether those realities were engineered for on day one or discovered later under load.
This article walks through the pattern itself, the three Azure services that realize it, a reference design you can adapt, the failure modes you must handle, the line between where the pattern earns its complexity and where it is overkill, and how to evolve a system as it grows. Throughout, one rule does most of the heavy lifting, and it is worth stating up front so everything else hangs off it.
The at-least-once-means-idempotent rule
Here is the claim this whole guide is built around, and it is the single idea most worth carrying away. Azure event delivery is at-least-once, which means every event your system cares about will be delivered one or more times, never guaranteed exactly once. Therefore every consumer must be idempotent: processing the same event twice must produce the same result as processing it once. Skipping that discipline is the root of the most common and most expensive class of bug in event-driven systems, the duplicate-processing bug, where a customer is charged twice, an email is sent twice, or an inventory count drifts because a decrement ran more than once.
Call it the at-least-once-means-idempotent rule. It sounds like a footnote and it is actually the design constraint that should shape your consumer code, your data model, and your choice of backbone. Engineers who internalize it build systems that heal under retry. Engineers who assume exactly-once delivery build systems that corrupt data the first time a network blip causes a redelivery, and then spend weeks hunting a bug that the platform documentation warned about on page one. The rest of this guide returns to this rule again and again, because nearly every design decision in event-driven architecture is downstream of it.
What event-driven architecture actually means
Strip away the diagrams and event-driven architecture is a single inversion of control. In a request-driven system, the caller decides what happens next: it knows which downstream service to invoke, it waits for the response, and it owns the outcome. In an event-driven system, the producer of a change does not know or care who reacts. It states a fact about the world, “order 4471 was placed,” “blob photo.jpg was uploaded,” “payment 9920 settled,” and publishes that fact to a messaging backbone. Whoever is interested subscribes. The producer is released from the chain the instant the fact is recorded.
That inversion changes the unit of design from the call to the event. An event is an immutable record that something happened in the past. The grammar matters: events are named in the past tense because they describe completed facts, not commands to be obeyed. “OrderPlaced” is an event; “PlaceOrder” is a command, and the distinction is not pedantry. A command is a request directed at a specific handler that is expected to act, and it carries an implicit assumption that exactly one consumer will process it. An event is a broadcast about something already true, and any number of consumers, including zero, may react without the producer knowing. Mixing these two up is the source of a surprising amount of architectural confusion, and naming discipline alone prevents a great deal of it.
Why does decoupling matter beyond cleaner diagrams?
Decoupling matters because it changes what can fail independently. When a producer no longer waits on a consumer, the consumer can be down, slow, or redeployed without the producer noticing, and the work simply waits in the backbone until the consumer recovers. That isolation is the entire point of the pattern and the reason it scales.
The deeper consequence is temporal decoupling. In a synchronous chain, the slowest service sets the latency of the whole operation, and the least available service sets the availability of the whole operation, because availability multiplies down a synchronous path. Five services at 99.9 percent availability chained synchronously yield roughly 99.5 percent for the combined path, and the math only gets worse as the chain lengthens. Break the chain with events and each service’s availability stops compounding into its neighbors. The producer records the order and returns; if the notification service is having a bad afternoon, the notifications queue up and flush when it recovers, and the customer who placed the order never sees the outage at all.
The second consequence is independent scaling. A synchronous chain forces every service to scale to the peak throughput of the busiest one, because a request flows through all of them at once. With events between them, a slow consumer can fall behind during a spike and catch up afterward, absorbing the burst in the backbone’s buffer rather than in a wall of failed requests. This is why event-driven designs handle spiky, uneven load gracefully while synchronous chains tend to shed load the moment one link saturates.
The third consequence is extensibility. Because the producer broadcasts a fact and does not enumerate its consumers, adding a new reaction to an existing event requires no change to the producer at all. A new analytics pipeline, a new fraud check, a new downstream cache can subscribe to “OrderPlaced” and the order service never learns of its existence. This is the property that lets event-driven systems grow new capabilities without the producer accumulating an ever-longer list of services to call.
What is publish-subscribe and how does fan-out work?
Publish-subscribe is the messaging model where a producer publishes to a topic and the backbone delivers a copy to every subscriber independently. Fan-out is the same idea seen from the event’s side: one published event spreads to many consumers at once, each receiving its own copy and processing it on its own schedule without affecting the others.
The mechanics differ by backbone, but the shape is consistent. A topic or an event subscription sits between producer and consumers. The producer writes once. The backbone is responsible for delivering a copy to each registered subscriber, tracking each subscriber’s progress separately, and retrying delivery to a subscriber that fails without holding up the others. The critical design property is that subscribers are isolated from one another. If the audit subscriber is slow and the email subscriber is fast, the email goes out immediately and the audit catches up later; neither blocks the other, and neither blocks the producer. Contrast this with a point-to-point queue, where a single message is consumed by exactly one reader and then gone. Point-to-point is for distributing work among competing workers; publish-subscribe is for broadcasting a fact to independent reactors. Most real systems use both, and choosing correctly per interaction is half of getting the design right.
The three Azure messaging backbones
Azure does not offer one event service; it offers three, and the single most consequential decision in an event-driven design is matching each interaction to the right one. The three are Event Grid, Event Hubs, and Service Bus, and they exist as separate products because they solve genuinely different problems. Engineers who treat them as interchangeable, picking whichever they used last, end up forcing a square peg into a round hole and then fighting the consequences. The services serve different event shapes, and the shape of your event tells you which one to reach for.
Event Grid: discrete reactive events
Event Grid is built for discrete, reactive notifications. The model is a single fact, small in payload, that demands a reaction somewhere: a blob was created, a resource group was deployed, a custom business event fired. Event Grid’s job is to take that notification and push it to subscribers fast, with the routing logic, filtering, and retry handling managed for you. It is push-based by default, meaning Event Grid calls your handler’s endpoint when an event arrives rather than waiting for your handler to ask, which makes it a natural fit for serverless reactions where a function spins up only when there is something to do.
The mental model for Event Grid is the reactive glue between Azure services and your own code. It knows how to emit events for dozens of Azure resource types natively, so a blob upload can trigger a function without you writing any polling loop, and it accepts custom events from your own applications through topics you define. Its strength is reach and reactivity: many sources, many handlers, low latency from event to reaction, and no infrastructure for you to run. Its delivery guarantee is at-least-once, and it retries failed deliveries on a schedule with exponential backoff before sending the event to a dead-letter destination if it still cannot deliver. That retry-then-dead-letter behavior is the safety net, and configuring the dead-letter destination is not optional in a serious design.
Event Grid is the wrong choice when the event is part of a high-volume telemetry stream or when strict ordering and transactional handling are required. It is a notification router, not a stream processor and not an enterprise message broker, and trying to push a million telemetry events a second through it, or to demand strict first-in-first-out ordering from it, is asking it to be something it is not.
Event Hubs: high-throughput streaming
Event Hubs is built for ingesting and processing high-throughput streams of data. The model here is not a discrete fact demanding a reaction; it is a firehose of records, telemetry from a fleet of devices, clickstream from a web property, application logs, metrics, anything that arrives continuously and in volume and needs to be processed in order within a partition. Event Hubs can take in millions of events per second and hand them to consumers that read through the stream at their own pace.
The defining concept in Event Hubs is the partition. A hub is divided into partitions, and each partition is an ordered, append-only log. Events written to a partition are read back in the order they were written, and a consumer tracks its position in each partition with a checkpoint, a stored offset that says “I have processed up to here.” Because consumers track their own position, multiple independent consumer groups can read the same stream at different speeds for different purposes: one consumer group feeds a real-time dashboard, another batches the same events into a data lake, and neither interferes with the other. The throughput model is built around the partition too, since parallelism is bounded by partition count, and choosing the partition count and the partition key, the property that decides which partition an event lands in, is the central tuning decision. A poorly chosen partition key concentrates load on one partition and caps throughput regardless of how many partitions you provisioned, the same cardinality trap that haunts every partitioned store.
Event Hubs is the wrong choice for discrete notifications that need to trigger a single handler, and it is the wrong choice for enterprise messaging that needs per-message acknowledgment, dead-lettering of individual poison messages, and transactional semantics. It is a streaming pipe optimized for volume and ordered replay, not a broker optimized for reliable command delivery.
Service Bus: ordered enterprise messaging
Service Bus is the enterprise message broker. Its model is reliable, ordered, transactional delivery of messages that represent work to be done, typically commands, often within a business workflow that cannot tolerate lost or duplicated effort without explicit handling. Where Event Grid optimizes for reactive reach and Event Hubs for streaming volume, Service Bus optimizes for delivery guarantees and message-level control.
Service Bus offers queues for point-to-point work distribution and topics with subscriptions for publish-subscribe fan-out, so it covers both interaction shapes within one broker. Its distinguishing features are the ones that enterprise workflows need. It supports sessions, which guarantee ordered processing of a related group of messages by routing them to a single consumer in sequence, the mechanism you reach for when ordering genuinely matters, such as processing all the events for one customer’s account in the order they occurred. It supports dead-letter queues natively on every queue and subscription, so a message that cannot be processed after the configured number of attempts moves automatically to a dead-letter sub-queue where it waits for inspection rather than disappearing or blocking the line. It supports duplicate detection within a configurable time window, scheduled delivery, message deferral, and transactions that let a consumer atomically complete one message and send another. These features are why Service Bus is the backbone for the messaging that runs a business rather than the telemetry that observes it.
Service Bus is the wrong choice for a firehose of millions of telemetry events per second, where its per-message machinery becomes overhead rather than value, and it is overkill for a simple reactive notification that Event Grid would route with less ceremony. It is the right choice precisely when the message represents work whose ordering, reliability, and individual fate you must control.
The InsightCrunch event-backbone decision table
The decision among the three reduces to one question asked well: what shape is this event? The following table is the artifact to bookmark. Read the event shape on the left, take the backbone in the middle, and confirm with the deciding signal on the right. When two rows seem to fit, the deciding signal breaks the tie.
| Event shape | Backbone | Deciding signal |
|---|---|---|
| Discrete reactive notification (“X happened, react”) | Event Grid | One fact, small payload, needs to trigger one or more handlers fast; no ordering or high volume required |
| High-throughput continuous stream (telemetry, logs, clickstream) | Event Hubs | Millions of records, ordered replay within a partition, multiple consumers reading the same stream at different speeds |
| Ordered enterprise command or workflow message | Service Bus | Work to be done, needs reliable delivery, per-message dead-lettering, sessions for ordering, or transactions |
| Fan-out a business event to several independent reactors | Service Bus topics, or Event Grid for reactive subscribers | Fan-out where each subscriber needs reliable, possibly ordered delivery favors Service Bus topics; lightweight reactive fan-out favors Event Grid |
| Reaction to a native Azure resource change (blob, resource deploy) | Event Grid | Azure emits the event natively; no polling, push to a function or webhook |
| Pull-based consumption with consumer-controlled pace and ordering | Service Bus (commands) or Event Hubs (streams) | Consumer wants to read on its own schedule and acknowledge per message (Service Bus) or checkpoint per partition (Event Hubs) |
The deciding signal column is what saves you from the most common mistake, which is choosing a backbone by familiarity rather than by fit. If you cannot articulate the deciding signal for your choice, you have not yet established that the choice is right. The team that picks Event Hubs for a discrete order notification because they already run Event Hubs for telemetry will find themselves writing checkpoint-management code for an event that needed none, and the team that pushes ordered financial commands through Event Grid will discover it has no sessions and no native dead-letter queue when the first poison message arrives.
A reference design walked through
Abstract principles are easier to trust once you have seen them carry a concrete flow, so consider a retail order system, the canonical example because it exercises every interaction shape at once. A customer places an order. That single act needs to do many things: reserve inventory, charge a payment, send a confirmation email, update an analytics warehouse, and write an audit record. In a request-driven design, the order service would call each of these in turn, owning the success of all five and failing the customer’s request if any one of them stumbled. The event-driven design inverts that completely.
When the order is accepted and persisted, the order service publishes a single business event, “OrderPlaced,” carrying the order identifier and the minimal payload other services need. The order service then returns to the customer immediately. It does not call inventory, payment, email, analytics, or audit. It records a fact and moves on. From that one published event, five independent reactions unfold, each on its own backbone chosen by the shape of its work.
Inventory reservation is a command that must be reliable and must not double-decrement stock, so it flows through Service Bus. The inventory service subscribes to OrderPlaced, reserves the items, and because Service Bus delivers at-least-once, the inventory consumer is written to be idempotent: it records that it has already reserved against order 4471, and a redelivered OrderPlaced finds the reservation already done and returns without decrementing again. Payment is likewise a reliable command on Service Bus, with the same idempotency discipline keyed on the order identifier so a redelivery never charges the card twice.
The confirmation email is a discrete reactive notification with no ordering requirement and a tolerance for the rare duplicate, so it can ride Event Grid, which pushes the event to a function that composes and sends the message. Even here idempotency earns its keep: the email function records that it has sent the confirmation for order 4471, so a redelivery does not send a second email. Analytics is a stream concern; rather than reacting to each order in isolation, the analytics pipeline consumes a continuous stream of order events from Event Hubs, reading through partitions in order and checkpointing its progress, feeding a warehouse in batches. Audit is a durable record that must never be lost, so it consumes from a Service Bus subscription with a dead-letter queue catching anything that fails to write.
How does the producer stay ignorant of its consumers?
The producer stays ignorant because it publishes to a topic or an event endpoint, never to a named consumer. It writes the OrderPlaced fact to the backbone and the backbone owns delivery. Adding, removing, or changing a consumer is a subscription change on the backbone side, invisible to the order service, which is exactly the decoupling the pattern promises.
This ignorance is the property that makes the design extensible without risk to the core. Six months later, the business wants a fraud check on every order. In a request-driven design that means editing the order service to add a sixth call, redeploying the most critical service in the system, and accepting the new failure mode that the order path now depends on the fraud service. In the event-driven design, the fraud team stands up a new consumer, subscribes it to OrderPlaced, and ships. The order service is not touched, not redeployed, and not made newly dependent on anything. The cost of a new capability is borne entirely by the team adding it, which is the organizational property that lets event-driven systems scale across many teams without a coordination bottleneck on the producer.
Where does the design place its boundaries?
A good event-driven design draws its boundaries around business capabilities, not around technical layers. Inventory, payment, fulfillment, and notification are each a bounded context that owns its data and reacts to events on its own terms. Events cross those boundaries; internal calls stay within them. The boundary is where the event lives.
Getting the boundaries right is harder than getting the plumbing right, and it is where most of the long-term value or pain comes from. If two services constantly need each other’s data synchronously, they are probably one capability that was split too aggressively, and the events flying between them are just function calls wearing a costume. If a single service is reacting to a dozen unrelated events and owning a dozen unrelated concerns, it is probably several capabilities crammed into one box. The events you publish should read like a narrative of the business: an order was placed, a payment settled, an item shipped, a refund issued. When the event log reads like a story a domain expert would recognize, the boundaries are usually right, and the technical choices about backbones and idempotency fall into place beneath them.
Delivery guarantees and the idempotency they force
Now to the part of the design that decides whether the system corrupts data under load or survives it. The delivery guarantee a backbone offers is not a detail to skim; it is the constraint that dictates how every consumer must be written. Azure’s event services deliver at-least-once. They will deliver each event one or more times. They do not, in their default and most-used configurations, deliver exactly once, and assuming they do is the original sin of event-driven design.
Why at-least-once and not exactly-once? Because exactly-once delivery across a network is, in the general case, impossible to guarantee, and the systems that appear to offer it do so by combining at-least-once delivery with idempotent processing on the consumer side. The fundamental problem is the gap between delivering a message and recording that it was delivered. Suppose the backbone hands an event to a consumer, the consumer processes it, and then the consumer’s acknowledgment back to the backbone is lost to a network failure. The backbone never heard the acknowledgment, so it must assume the worst and redeliver, because the alternative, assuming success and not redelivering, would lose events whenever an acknowledgment genuinely failed to send. Faced with the choice between occasionally delivering twice and occasionally losing an event entirely, every serious messaging system chooses to deliver twice, because a duplicate is recoverable and a lost event is not. At-least-once is not a weakness; it is the correct choice, and it pushes the responsibility for handling duplicates onto the consumer, which is the only place it can correctly live.
What does it mean for a consumer to be idempotent?
A consumer is idempotent when processing the same event twice leaves the system in the same state as processing it once. The second processing is a safe no-op. The consumer recognizes that it has already handled this event, by its unique identifier, and declines to repeat the side effect, so a redelivery cannot charge a card twice, send a duplicate email, or double-decrement stock.
The implementation patterns are well established, and the right one depends on the side effect. The most direct pattern is the deduplication store: the consumer keys on the event’s unique identifier and records, transactionally with the work it performs, that this identifier has been processed. On redelivery it finds the identifier already recorded and returns without acting. The subtlety is that the deduplication record and the side effect must be committed together; if you do the work and then crash before recording the identifier, the redelivery does the work again, and if you record the identifier and then crash before doing the work, the event is lost. The standard solution is to make the side effect itself idempotent where possible and to fall back on a transactional dedup store where it is not.
// Idempotent consumer keyed on the event id, with the dedup
// record committed in the same transaction as the side effect.
public async Task HandleAsync(OrderPlaced evt)
{
using var tx = await _db.BeginTransactionAsync();
// Has this event already been processed? The unique index on
// ProcessedEventId enforces the guarantee even under a race.
var already = await _db.ProcessedEvents
.AnyAsync(p => p.EventId == evt.EventId);
if (already)
{
await tx.CommitAsync(); // safe no-op on redelivery
return;
}
await ReserveInventoryAsync(evt.OrderId, evt.Items);
_db.ProcessedEvents.Add(new ProcessedEvent { EventId = evt.EventId });
await _db.SaveChangesAsync(); // unique-index violation here
// means a concurrent delivery won
await tx.CommitAsync();
}
A second pattern leans on the natural idempotency of the operation. Setting a value is idempotent; incrementing one is not. “Set order 4471 status to shipped” can run a thousand times and the status is shipped; “increment the shipped counter” run twice over-counts. Where you can express the side effect as a set rather than an increment, as an upsert rather than an insert, as a conditional update guarded by the event’s identifier or a version number, you get idempotency for free from the data model and need no separate dedup store. A third pattern uses the backbone’s own features: Service Bus duplicate detection will drop a message whose identifier it has seen within a configured window, which handles producer-side duplicates cheaply, though it does not replace consumer-side idempotency because the detection window is finite and redeliveries can fall outside it.
Why does exactly-once thinking cause data corruption?
Exactly-once thinking causes corruption because it leads engineers to write non-idempotent consumers, ones that blindly apply each delivery as if it were the first. The first redelivery, caused by a lost acknowledgment, a consumer restart, or a backbone failover, then applies the side effect a second time, double-charging, double-shipping, or drifting a count, and the bug surfaces far from its cause.
The treacherous part is that these systems test clean. In development and in light testing, redeliveries are rare, so a non-idempotent consumer appears to work perfectly for weeks. Then traffic rises, a deployment restarts a consumer mid-batch, a transient network fault drops an acknowledgment, and suddenly duplicates appear in production with no obvious trigger. The engineer who assumed exactly-once now hunts a phantom, because the duplicate processing happened correctly according to the platform’s contract; the platform did exactly what at-least-once promised. The only durable fix is to make the consumers idempotent, which is why the discipline belongs in the design from the first line of code rather than retrofitted after the first incident. This is the at-least-once-means-idempotent rule earning its place as the organizing principle of the whole architecture.
Does ordering survive in an event-driven system?
Ordering survives only where you deliberately preserve it, and it costs throughput to do so. Event Grid gives no ordering guarantee. Event Hubs preserves order within a partition but not across partitions. Service Bus preserves order within a session but processes sessions one consumer at a time. If a flow needs strict order, you choose the backbone and the mechanism that provides it and accept the reduced parallelism.
The practical guidance is to need ordering as rarely as possible, because every ordering guarantee is a serialization point that caps how much you can parallelize. Often the apparent need for ordering dissolves under inspection: if consumers are idempotent and operations are commutative or expressed as sets rather than sequences, the order of arrival stops mattering, and you regain full parallelism. Where ordering genuinely matters, scope it as narrowly as possible. You rarely need all events ordered globally; you need the events for one aggregate, one customer, one account, ordered relative to each other. Event Hubs gives you that by partitioning on the aggregate’s key so all of one customer’s events land on one ordered partition. Service Bus gives you that with sessions keyed on the aggregate. Both confine the serialization to the aggregate and let different aggregates process in parallel, which is the right granularity: order within the thing that needs it, parallelism across the things that do not.
Dead-lettering: where failed events go to be seen
Every event-driven system eventually produces an event it cannot process. The payload is malformed, a downstream dependency is permanently rejecting it, a bug in the consumer throws on a particular shape of data, or the event references an entity that no longer exists. The question is not whether this happens but what the system does when it does, and the answer separates resilient designs from fragile ones. A naive consumer that throws on a bad event and lets the backbone redeliver it forever creates a poison message: the same event is delivered, fails, gets redelivered, fails again, and blocks or churns indefinitely, sometimes stalling the entire flow behind it. Dead-lettering is the mechanism that breaks that loop.
A dead-letter destination is a side channel where events that cannot be processed after a configured number of attempts are set aside for inspection rather than retried forever or silently dropped. Service Bus provides a dead-letter queue automatically on every queue and every subscription; after the maximum delivery count is reached, the message moves to the dead-letter sub-queue with metadata explaining why, and it waits there for an operator or an automated process to examine it. Event Grid supports a dead-letter destination too, a storage account where it deposits events it failed to deliver after exhausting its retry schedule. Event Hubs, being a stream, handles the concept differently: because consumers control their own position through checkpoints, a consumer that hits a bad event must decide whether to skip it, route it elsewhere, or stop, and many stream-processing designs write unprocessable records to a separate “dead-letter” stream or store explicitly in consumer code.
Why do ignored dead-letter queues become incidents?
Dead-letter queues become incidents because they are invisible until someone looks, and nobody looks until something breaks. Events pile up silently, the orders they represent never complete, and the first sign is a customer complaint or a reconciliation mismatch weeks later. A dead-letter queue with no alert on its depth is a failure the system already had and has not told anyone about.
The discipline that prevents this is treating dead-letter queue depth as a first-class operational signal, monitored and alerted on like error rate or latency. A nonzero and growing dead-letter count is a defect report the system is filing against itself, and it deserves the same urgency. The operational loop has three parts. First, alert when the dead-letter depth crosses a low threshold, because in a healthy system that depth should be zero or near it, and any sustained growth is news. Second, make the dead-lettered events inspectable, with enough metadata, the failure reason, the delivery count, the original event, to diagnose why each one failed without guesswork. Third, build a path to reprocess, because many dead-lettered events failed for a transient or now-fixed reason and can be replayed successfully once the underlying issue is resolved. A dead-letter queue you can inspect and replay turns a class of permanent failures into recoverable ones; a dead-letter queue nobody watches turns a recoverable failure into a silent loss.
How many retries before dead-lettering, and how should they back off?
The retry count and backoff should match the failure you expect. Transient faults, a momentary timeout or a brief dependency blip, deserve a handful of retries with exponential backoff so the system rides through the blip without human intervention. Permanent faults, a malformed payload or a logic error, deserve to dead-letter quickly because retrying them only wastes effort and delays the alert.
The art is distinguishing the two at the point of failure. A consumer that catches its exceptions can often tell a transient fault, a timeout, a 503 from a dependency, a throttling response, from a permanent one, a deserialization failure, a validation error, a 400. The disciplined pattern routes them differently: retry the transient class with backoff and jitter so a thundering herd of retries does not synchronize and hammer a recovering dependency, and dead-letter the permanent class immediately so it surfaces for a human rather than burning the retry budget. Service Bus lets a consumer dead-letter a message explicitly with a reason rather than waiting for the delivery count to exhaust, which is exactly what you want for a recognized permanent failure: do not retry a message you already know will never succeed, move it to the dead-letter queue now with a clear reason, and let the retry budget protect against the faults that retries can actually fix. This connects directly to the broader resilience toolkit of retry with backoff and circuit breaking that any distributed system needs.
Choreography versus orchestration
Once events flow between many services, a second design question surfaces: who, if anyone, is in charge of the overall workflow? There are two answers, and the choice between them shapes how a multi-step process is understood, debugged, and changed. The two answers are choreography and orchestration, and most real systems use a blend, but they pull in opposite directions and it pays to know which one a given workflow is leaning on.
In choreography, no service is in charge. Each service reacts to events and emits its own events, and the overall workflow is an emergent property of those reactions, like dancers each responding to the music and to each other with no conductor. The order service publishes OrderPlaced; the inventory service reacts and publishes InventoryReserved; the payment service reacts to that and publishes PaymentSettled; the fulfillment service reacts and publishes OrderShipped. No single component knows or controls the whole sequence. The workflow exists only in the chain of reactions, distributed across the services that participate in it.
In orchestration, one component is in charge. An orchestrator holds the workflow definition and directs each step explicitly: it tells inventory to reserve, waits for the result, tells payment to charge, waits, tells fulfillment to ship. The workflow lives in one place, the orchestrator, which makes it visible and modifiable as a unit. On Azure, Durable Functions is the common orchestration engine, letting you write the workflow as code that survives across the long waits between steps, and Logic Apps offers a lower-code orchestration surface for integration-heavy flows.
When does choreography sprawl into something nobody understands?
Choreography sprawls when the workflow grows past a few steps and no single artifact describes it. Each service knows only its own reaction, so understanding the end-to-end flow means tracing events across many services and many repositories, reconstructing in your head a process that lives nowhere as a whole. At that point a change to the flow becomes archaeology.
This is the central trade-off. Choreography’s strength is loose coupling and independent evolution: each service owns its reactions, teams ship without coordinating through a central workflow owner, and the system stays decoupled in exactly the way the event-driven pattern intends. Its weakness is that the workflow becomes invisible. There is no one place to look to see what happens when an order is placed, no single component to put a breakpoint in, and no obvious owner when the flow misbehaves between services. For a short, stable reaction chain, that invisibility is a fair price, and choreography is the cleaner choice. For a long, evolving, business-critical workflow with compensation logic and conditional branches, the invisibility becomes a liability, and an orchestrator that holds the whole flow as a readable, debuggable, single-owner artifact is worth the central coupling it introduces.
The practical guidance is to choreograph the small and orchestrate the complex. When a reaction is a simple “this happened, so do that,” choreography keeps it decoupled and clean. When a process has multiple steps that must happen in a defined order, with compensating actions if a later step fails, an orchestrator gives you a place to express that logic coherently, to see its state, and to drive its recovery. A common mature pattern uses orchestration within a bounded context for its internal multi-step workflows and choreography between contexts for the loosely coupled business events that cross boundaries, getting the visibility of orchestration where the complexity lives and the decoupling of choreography where the boundaries are.
How do you compensate when a step fails midway?
You compensate by issuing an event or command that undoes the effect of the steps already completed, because in a distributed workflow there is no shared transaction to roll back. If payment fails after inventory was reserved, the system publishes an event that releases the reservation. The undo is itself an event, handled idempotently like every other.
This is the saga pattern, and it is how distributed workflows maintain consistency without a distributed transaction spanning multiple services and data stores. A saga is a sequence of local transactions, each with a defined compensating action, and when a step fails the saga runs the compensations for the steps that already succeeded, in reverse, to return the system to a consistent state. Orchestrated sagas put the compensation logic in the orchestrator, which knows what has run and what to undo. Choreographed sagas distribute it: each service that performed work also listens for the failure events that mean it should compensate. Either way, the compensations are ordinary events subject to at-least-once delivery, so they too must be idempotent, and the system must tolerate the window between a failure and the completion of its compensations, during which the data is temporarily inconsistent. Designing that window deliberately, deciding what a reader sees while a saga is unwinding, is part of designing the workflow, not an afterthought.
The trade-offs and failure modes you must handle
Event-driven architecture trades a set of problems you understand for a set of problems you may not. The synchronous chain was hard to scale and brittle under partial failure, but it was easy to reason about: you could read the code top to bottom and see what happened. The event-driven system scales and tolerates partial failure, but it scatters the logic across services and introduces a category of distributed-systems problems that the synchronous design never had. Naming those failure modes lets you design against them instead of discovering them.
The first is duplicate processing, already covered at length, and the one that causes the most production damage because it corrupts data rather than merely failing. The defense is idempotent consumers, applied without exception, keyed on event identifiers or expressed through naturally idempotent operations. There is no shortcut around this; it is the cost of admission to at-least-once delivery, and a system that skips it is not finished, only untested under the load that will eventually expose it.
The second is eventual consistency. Because reactions happen asynchronously, there is a window after an event is published during which some services have reacted and others have not. The order exists but the inventory has not yet been reserved; the payment settled but the confirmation email has not yet gone out. For most flows this window is milliseconds and harmless, but it is real, and any part of the system that reads across services must tolerate it. The classic symptom is a user who completes an action and then refreshes a page that has not yet caught up, seeing stale data and concluding the action failed. The design must account for this, by showing pending states, by reading from the service that owns the data rather than from a projection that may lag, or by setting user expectations that some effects are not instantaneous. Eventual consistency is not a bug to be fixed; it is a property to be designed for, and pretending the system is synchronous when it is not produces confusing behavior at the edges.
Why do all-synchronous reactions defeat the purpose?
All-synchronous reactions defeat the purpose because they reintroduce the coupling the events were meant to remove. If a consumer reacts to an event by synchronously calling three other services and waiting on all of them, the failure of any one of them fails the reaction, and the event-driven facade hides a synchronous chain that has all the brittleness the pattern was supposed to eliminate.
This is one of the most common ways an event-driven migration disappoints. A team puts a backbone between the producer and the first consumer, declares the system event-driven, and then writes that consumer to synchronously orchestrate everything downstream. The producer is decoupled, but the consumer has become a synchronous monolith with one event-shaped door. The fix is to push the decoupling all the way through: where a reaction needs to trigger further work, it should usually do so by publishing another event, not by calling a service and waiting. There are legitimate synchronous calls inside a consumer, a query for reference data, a validation against an authoritative source, but a reaction that fans out synchronously to multiple services and depends on all of them is a design that has kept the chain and merely changed its first link.
What happens when an event schema changes?
When an event schema changes, every consumer that reads it is at risk, because the producer and consumers deploy independently and you cannot update them all at once. A producer that adds a required field, renames a property, or changes a type can break consumers that were happily reading the old shape, and because the coupling is now through the event’s structure, schema is the new contract.
The discipline is to treat the event schema as a published interface with the same care you would give a public API. Evolve it additively: add optional fields, never remove or repurpose existing ones, and never change the meaning of a field that consumers already depend on. Version the event when a breaking change is unavoidable, publishing both the old and new versions during a transition window so consumers migrate on their own schedule rather than being forced to deploy in lockstep with the producer. A schema registry or a shared contract definition makes the interface explicit and lets producers and consumers validate against it. The failure mode this prevents is the silent one: a producer ships a schema change, a consumer three teams away starts throwing on every event, the events dead-letter, and nobody connects the two changes until the dead-letter queue alert fires. Schema governance is the unglamorous discipline that keeps a decoupled system from coupling through the back door of its data shapes.
Why is observability harder, and what restores it?
Observability is harder because a single business operation no longer lives in one call stack; it is spread across many services reacting to events at different times, so a stack trace shows only one fragment. Without deliberate tracing you cannot follow an order through inventory, payment, and fulfillment, and debugging becomes guesswork across disconnected logs.
What restores it is correlation. Every event carries a correlation identifier that flows from the originating action through every reaction it triggers, so the logs, traces, and metrics of all the services that touched one operation can be stitched back into a single timeline. Distributed tracing, propagated through the event metadata, lets you see the whole flow as one trace even though it executed across five services and three backbones. Beyond per-operation tracing, the system needs aggregate signals that are specific to event-driven designs: backbone queue depth and age of the oldest unprocessed message, which reveal a consumer falling behind; dead-letter queue depth, which reveals events failing; consumer lag against the stream in Event Hubs, which reveals a stream consumer that cannot keep up. These signals have no equivalent in a synchronous system and are precisely the ones that tell you the health of an asynchronous one. Investing in them is not optional polish; it is the instrumentation without which an event-driven system is a black box that works until it does not and gives you nothing to go on when it stops.
When the pattern fits and when it is overkill
Event-driven architecture is a tool, not a virtue, and applying it everywhere is its own failure mode. The pattern earns its complexity when the system has genuine independence between its parts, when load is spiky or uneven, when the set of reactions to an event grows over time, and when teams need to ship independently without coordinating through a shared call graph. It is overkill when the system is small, the flow is a simple linear sequence that always runs the same way, the components are tightly bound and deploy together anyway, and the asynchrony buys nothing but the operational weight of running backbones and reasoning about eventual consistency.
The honest test is whether the decoupling solves a problem you actually have. A three-service application owned by one team, deployed together, with a synchronous flow that is fast and reliable enough, gains little from being pulled apart with backbones between every step, and it pays the full price: more infrastructure, eventual consistency to reason about, idempotency to implement, distributed tracing to build, and a workflow that no longer reads top to bottom. The same three services, if they belong to three teams that want to ship independently, or if one of them is an unreliable third-party integration that should not be allowed to fail the others, or if the load on one spikes while the others stay flat, are a strong case for events. The pattern’s value is proportional to the independence and the variability in the system, and where those are low, a synchronous design is not a failure of ambition; it is the right call.
There is also a middle path that engineers often miss. You do not have to make a system fully event-driven to get the benefits of events where they matter. A mostly synchronous application can introduce a backbone at exactly the seams that need it, the slow third-party call, the spiky background job, the audit trail that must never block the request, and keep the rest synchronous and simple. Reaching for events surgically, at the boundaries where decoupling pays, is more mature than converting everything to events on principle. The architecture that uses events where they earn their keep and direct calls where they do not is usually better than the one that picks a single style and applies it dogmatically.
Which backbone for a reactive event versus a stream?
For a discrete reactive event that should trigger a handler, reach for Event Grid, which pushes the notification to your function or webhook with retry and dead-lettering managed for you. For a continuous high-volume stream that consumers read through in order, reach for Event Hubs, which ingests at scale and lets multiple consumer groups replay the same data independently. The shape of the data decides.
The mistake that this question heads off is using one backbone for both shapes because it is the one already in place. A team running Event Hubs for telemetry, asked to add an order-confirmation notification, may be tempted to publish the order event to Event Hubs too, and then they are writing partition keys, consumer groups, and checkpoint logic for a single discrete notification that Event Grid would have pushed to a function with no infrastructure at all. The reverse mistake sends a million-per-second telemetry stream through Event Grid, which is built to route discrete notifications, not to ingest a firehose. Each backbone is excellent at its shape and awkward at the others, and the cost of the mismatch is paid continuously in code that fights the tool. The decision table earlier in this guide exists precisely to make this choice by fit rather than by habit.
When does fan-out justify a topic over a queue?
Fan-out justifies a topic the moment more than one independent consumer needs its own copy of the same event. A queue delivers each message to exactly one reader, which is right for distributing work among interchangeable workers, but wrong when several distinct services each need to react to the same fact. A topic gives each subscriber its own copy and its own progress, so the email, audit, and analytics consumers all see every order without competing for it.
The signal that you have outgrown a queue is the appearance of a second reason to consume the same event. The first consumer of OrderPlaced was inventory; now audit needs it too, and analytics after that. If those share a single queue, they steal messages from one another, each seeing only the fraction the others did not grab. A topic with a subscription per consumer fixes this cleanly: inventory, audit, and analytics each have a subscription, each receives every event, and each tracks its own position and its own dead-letters. Service Bus topics give you this with the reliability and per-subscription dead-lettering that enterprise fan-out wants, while Event Grid gives you a lighter reactive fan-out where subscribers are functions or webhooks reacting to notifications. The choice between them returns to the shape of the event and the delivery guarantees the subscribers need.
When do ordering needs force Service Bus sessions?
Ordering needs force Service Bus sessions when a group of related messages must be processed strictly in sequence by a single consumer, such as all the commands for one bank account applied in the order they were issued. A session ties those messages together and routes them to one consumer in order, trading the parallelism of competing consumers for the guarantee that the sequence holds.
The judgment is to scope the ordering to the narrowest group that needs it, because a session is a serialization point. You almost never need every message in the system ordered; you need the messages for one aggregate ordered relative to each other, and different aggregates can proceed in parallel. Keying the session on the aggregate, the account, the order, the customer, confines the serial processing to that aggregate and lets the consumer pool process many aggregates at once, one session each. This is the same principle as partitioning a stream by aggregate key in Event Hubs: order within the boundary that requires it, parallelism across the boundaries that do not. Reaching for global ordering when aggregate-scoped ordering would do is the common error, and it caps throughput far below what the workload actually requires.
How to evolve an event-driven system
A system is never event-driven all at once, and the realistic path is incremental. The most reliable way to introduce the pattern into an existing synchronous application is to find the single seam where decoupling pays the most, usually a slow or unreliable downstream step that should not block the main flow, and place a backbone there first. The producer publishes an event instead of calling the slow step synchronously, a consumer reacts to it, and the main flow returns faster and stops failing when the slow step stumbles. One seam, one backbone, one consumer, measured against the latency and failure rate before and after. This proves the pattern in your context and builds the operational muscle, idempotency, dead-letter monitoring, tracing, before you depend on it broadly.
From that first seam, the system grows by adding consumers to existing events and by introducing new events at new seams, never by converting everything in one disruptive rewrite. Because the producer is ignorant of consumers, each new reaction is additive and low-risk, which is the property that makes incremental growth safe. As the number of events and consumers rises, the disciplines that were optional at small scale become mandatory: a schema registry to govern the event contracts, correlation identifiers and distributed tracing to keep the system observable, dead-letter monitoring on every subscription, and a clear convention for naming events and scoping ordering. These are the load-bearing practices that let an event-driven system stay comprehensible as it scales from one seam to dozens, and adding them deliberately as you grow is far easier than retrofitting them after the system has sprawled.
There is a point at which choreographed reactions accumulate into workflows complex enough to deserve an orchestrator, and recognizing that point is part of evolving the system well. When a business process spans many events, carries compensation logic, and no longer fits in anyone’s head as a chain of reactions, lifting that process into a Durable Functions orchestration or a Logic App gives it a home, a visible state, and an owner, without abandoning the choreographed events that connect bounded contexts. The mature system is rarely purely choreographed or purely orchestrated; it choreographs across boundaries and orchestrates the complex flows within them, and it arrives at that shape by evolving toward it rather than by decree.
To build and test these flows hands-on, including wiring an event through Event Grid, Event Hubs, or Service Bus and proving a consumer is idempotent under redelivery, VaultBook provides Azure labs and a command library where you can stand up the backbones, publish test events, force redeliveries, and watch dead-lettering behave, so the idempotency discipline this guide insists on is something you verify in a sandbox rather than discover in production.
How the design connects to the rest of the platform
An event-driven architecture is not a self-contained island; it sits on top of decisions about which messaging service to use, which patterns to apply, how to handle a high-volume stream, and what compute reacts to the events. Each of those is a deep topic in its own right, and treating them as part of one coherent design is what separates a system that hangs together from a collection of services that happen to share a backbone.
The first connected decision is the messaging-service choice itself, the Event Grid versus Event Hubs versus Service Bus question that the decision table in this guide compresses into a few rows. That comparison deserves its own careful treatment because the three services differ along more axes than event shape alone, including pricing models, throughput ceilings, ordering guarantees, and protocol support, and a thorough side-by-side of Service Bus, Event Hubs, and Event Grid is the reference to consult when a choice is genuinely close and the deciding signal needs more nuance than a single column can hold. The decision table here is the fast path; the full comparison is where you go when the fast path leaves you between two options.
The second connected topic is the catalog of messaging patterns the backbones enable. Publish-subscribe and fan-out are the patterns this guide leans on, but the full vocabulary, competing consumers, the claim-check for large payloads, the outbox pattern for reliably publishing an event in the same transaction as a database write, the saga for distributed workflows, runs deeper. The broader treatment of asynchronous messaging patterns on Azure lays out that vocabulary, and an event-driven design is mostly an exercise in composing those patterns correctly: a fan-out here, an outbox there to close the gap between persisting state and publishing the event about it, a saga to span the workflow. Knowing the patterns by name turns design from invention into selection.
The third connected topic is streaming at scale, which is where Event Hubs earns its own deep dive. The analytics arm of the reference design, the continuous stream of order events feeding a warehouse, is a streaming problem with its own concerns: partition count and key selection, consumer-group management, checkpoint frequency, throughput units and the auto-inflate behavior, and the capture feature that lands the stream in storage automatically. The Event Hubs engineering guide goes into the partition mathematics and the tuning that keeps a high-throughput stream healthy, depth that a general architecture guide can only gesture at. When the streaming arm of an event-driven system is the part under pressure, that is the reference that addresses it directly.
The fourth connected topic is the compute that reacts to the events, which for most event-driven Azure systems means serverless functions. The reactive consumers in the reference design, the email function triggered by Event Grid, the inventory and payment handlers, are naturally a fit for serverless because they run only when there is an event to process and scale with the event volume. The relationship runs both ways: serverless compute is the most common consumer of events, and events are the most common trigger for serverless compute, so serverless architecture with Azure Functions and event-driven architecture are two views of the same system from different angles. Designing the events well and designing the functions that consume them well are the same project, and the cold-start, state, and cost considerations of the serverless side shape how the consumers in an event-driven design behave under load.
What does capacity planning look like across the backbones?
Capacity planning differs by backbone because each meters throughput differently. Event Hubs sells throughput units or processing units that cap ingress and egress rates, and you size them to peak stream volume with headroom. Service Bus tiers govern message throughput and features, with the premium tier providing dedicated capacity and predictable performance. Event Grid scales its push throughput largely transparently, billed per operation.
The planning discipline is to find the dimension that constrains each backbone and size against the peak of that dimension, not the average. For Event Hubs the constraint is the throughput unit against ingress megabytes and events per second, and a stream that averages comfortably can still throttle during a spike that exceeds the provisioned units, so you size to the spike or enable auto-inflate to raise the ceiling automatically. For Service Bus the premium tier’s messaging units set the throughput, and the planning question is how many messages per second at peak and whether features like sessions and large messages push you toward premium regardless of raw volume. For Event Grid the cost scales with operations rather than provisioned capacity, so the planning is mostly about the operation count at scale and the dead-letter storage. Across all three, the recurring mistake is sizing to average load and being surprised by the spike, because event-driven systems exist partly to absorb spikes, and a backbone provisioned only for the average defeats the buffering that justified the asynchrony in the first place.
How does security work across an event-driven system?
Security in an event-driven system follows the same identity-first principle as the rest of Azure: services authenticate to the backbones with managed identities rather than connection strings or shared keys wherever the backbone supports it, and authorization is scoped so a producer can publish to its topic and a consumer can read from its subscription, nothing more. The event payload is data in transit and at rest in the backbone, so it inherits the encryption the backbone provides and should never carry secrets or more personal data than the consumers need.
The subtle security concern unique to events is that the event becomes a new data-flow path that audit and governance must account for. A business event carrying customer data now travels through a backbone and into every subscriber, which expands the set of places that data lives and the set of identities that can read it. The discipline is to treat the event schema as a data-classification decision, keeping sensitive fields out of broadly fanned-out events and using the claim-check pattern, where the event carries a reference and the consumer fetches the sensitive payload from a secured store with its own authorization, when a reaction genuinely needs protected data. Least-privilege on the backbone, managed identities for every producer and consumer, and deliberate restraint about what goes into a widely broadcast event are the three habits that keep a decoupled system from quietly becoming a data-exposure surface.
The recurring misdiagnoses, named
Patterns of failure repeat across event-driven systems with enough regularity that naming them is the fastest way to avoid them. Each one below is a diagnosis you will eventually make, either in advance by design or in retrospect by incident, and making it in advance is cheaper.
The assumed-exactly-once misdiagnosis is the deepest. A consumer is written as though each event arrives once, the system runs clean in testing, and then production redeliveries, inevitable under at-least-once, double-apply side effects and corrupt data. The diagnosis is that the consumer was never idempotent, and the fix is to make it so, keyed on the event identifier or expressed through naturally idempotent operations. This is the at-least-once-means-idempotent rule unobserved, and it accounts for more event-driven data-corruption incidents than any other single cause.
The one-backbone-for-everything misdiagnosis is the second. A team standardizes on the backbone they know and forces every event shape through it, writing checkpoint code for discrete notifications or pushing telemetry firehoses through a notification router. The diagnosis is a shape mismatch, and the fix is the decision table: match the backbone to the event shape and accept running more than one service because the three exist for genuinely different jobs. The discomfort of operating multiple backbones is smaller than the friction of fighting one backbone to do work it was not built for.
The ignored-dead-letters misdiagnosis is the third and the quietest. Events fail, dead-letter, and pile up unwatched, and the loss surfaces weeks later as a reconciliation gap or a customer complaint. The diagnosis is missing operational instrumentation, and the fix is to alert on dead-letter depth and build an inspection-and-replay path so that a dead-lettered event is a tracked, recoverable condition rather than a silent disappearance. A dead-letter queue without an alert is a defect the system is hiding from its operators.
The hidden-synchronous-chain misdiagnosis is the fourth. A system is declared event-driven because a backbone sits at the front, but the first consumer synchronously orchestrates everything behind it, preserving the coupling the events were meant to dissolve. The diagnosis is that the decoupling stopped at the first hop, and the fix is to publish events between the downstream steps too, pushing the asynchrony all the way through rather than letting it terminate at a single event-shaped door in front of a synchronous monolith.
The premature-event-everything misdiagnosis is the fifth, and the counterweight to the others. A small, tightly coupled system owned by one team is pulled apart with backbones between every step on the principle that events are better, and it inherits eventual consistency, idempotency work, and operational weight that buy nothing because the independence and variability that justify events were never present. The diagnosis is the pattern applied where it does not pay, and the fix is restraint: reach for events at the seams that need decoupling and keep the rest synchronous and simple.
Testing an asynchronous design before it reaches production
A system built on at-least-once delivery has to be tested for the conditions that only appear under load, and the cheapest place to find those conditions is a test harness rather than a live incident. The first test every consumer deserves is the redelivery test: hand the same message to the consumer twice and assert that the second pass changes nothing observable, no second charge, no duplicate record, no drifted count. If that assertion fails, the consumer is not idempotent, and no amount of clean behavior in a happy-path test will save it once the platform redelivers in production. This single test, run for every consumer, catches the most damaging defect class before it ships.
The second test is fault injection at the boundary. Kill a consumer mid-processing and confirm that the in-flight message is redelivered and handled correctly on restart, which validates that the acknowledgment and the side effect are committed together rather than leaving a gap where a crash loses work or double-applies it. Drop a dependency the consumer calls and confirm that transient faults retry with backoff while permanent faults dead-letter promptly with a clear reason. Send a malformed payload and confirm it dead-letters rather than wedging the line as a poison message redelivered forever.
The third test exercises the operational signals themselves. Force a backlog by pausing a consumer while the producer keeps publishing, and confirm that queue-depth and message-age alerts fire as designed, because an alert that has never been triggered in a test is an alert you are trusting blind. Force a dead-letter by feeding an unprocessable message and confirm the depth alert fires and the inspection path surfaces the failure reason. These tests prove that when the system degrades in production, it will tell you, which is the difference between a quiet failure discovered weeks later and a paged engineer who can act while the backlog is still small.
Local development benefits from emulators where they exist and from disposable cloud resources where they do not, so that a developer can publish a test message, force a redelivery, and watch a dead-letter land without touching a shared environment. The discipline this builds, treating redelivery, partial failure, and backlog as ordinary conditions to assert against rather than rare surprises, is exactly the mindset the at-least-once-means-idempotent rule demands, and a team that tests this way ships consumers that survive the load that breaks the consumers nobody tested under failure.
Closing verdict
Event-driven architecture on Azure is the right design when a system has real independence between its parts, spiky or uneven load, a growing set of reactions to its events, and teams that need to ship without coordinating through a shared call graph. In those conditions it delivers what the synchronous chain cannot: isolation of failures, independent scaling, and extension without touching the producer. The platform gives you three backbones to realize it, and the central skill is matching each interaction to the right one by the shape of its event, Event Grid for discrete reactive notifications, Event Hubs for high-throughput ordered streams, Service Bus for reliable ordered enterprise messaging, with the decision table as the artifact that makes that match by fit rather than by habit.
The deciding discipline, the one that determines whether the system heals or corrupts under the load it was built for, is the at-least-once-means-idempotent rule. Azure delivers at-least-once, so every consumer must be idempotent, without exception, designed in from the first line rather than retrofitted after the first duplicate-processing incident. Around that rule sit the supporting disciplines: dead-letter every subscription and alert on its depth, version event schemas as the published contracts they are, propagate correlation identifiers so the system stays observable, scope ordering to the narrowest aggregate that needs it, and choose between choreography and orchestration by the complexity of the workflow rather than by fashion. Get the rule and these disciplines right and event-driven architecture is a durable foundation that grows gracefully; skip them and it is a distributed system that fails in ways the synchronous design never could. The pattern is not better or worse than request-driven design in the abstract; it is the correct tool for a specific and common set of conditions, and applied to those conditions with the idempotency discipline intact, it is how serious systems on Azure stay decoupled, scalable, and resilient as they grow.
Frequently asked questions
What is event-driven architecture on Azure?
Event-driven architecture on Azure is a design where services communicate by producing and reacting to events rather than calling each other directly. A producer records that something happened, such as an order being placed, and publishes that fact to a messaging backbone. Any number of consumers subscribe and react on their own schedule, with the producer unaware of who they are. Azure realizes the pattern with three services: Event Grid for discrete reactive notifications, Event Hubs for high-throughput streams, and Service Bus for reliable ordered messaging. The benefit is decoupling, so services can fail, scale, and deploy independently, and new reactions can be added without changing the producer. The cost is a set of distributed-systems concerns, chiefly duplicate delivery, eventual consistency, and harder observability, that the design must handle deliberately rather than discover under load.
When should I use Event Grid, Event Hubs, or Service Bus?
Match the service to the shape of your event. Use Event Grid when a discrete fact needs to trigger one or more handlers quickly, such as a blob upload firing a function, with no ordering or high-volume requirement. Use Event Hubs when you are ingesting a continuous high-throughput stream, such as telemetry, clickstream, or logs, that consumers read through in order within a partition and that multiple consumer groups replay independently. Use Service Bus when the message represents work to be done that needs reliable delivery, per-message dead-lettering, sessions for ordering, or transactions, which is the enterprise-messaging case. The deciding signal is the event shape: a notification, a stream, or a command. If you cannot name which of those three you have, you are not ready to choose, and the most common mistake is picking by familiarity rather than by fit, which forces one service to do work it was never built for.
How do publish-subscribe patterns work on Azure?
Publish-subscribe is a model where a producer publishes to a topic and the backbone delivers an independent copy to every subscriber. Each subscriber processes at its own pace and tracks its own progress, so a slow subscriber never blocks a fast one and neither blocks the producer. On Service Bus you create a topic and add a subscription per consumer, with each subscription getting its own copy and its own dead-letter queue. On Event Grid you create a topic and add event subscriptions that push to functions or webhooks. The defining property is subscriber isolation, which is what makes fan-out, one event reaching many independent reactors, safe and clean. This contrasts with a point-to-point queue, where a single message is consumed by exactly one reader. You reach for publish-subscribe when several distinct services each need every event, and for a queue when interchangeable workers share one stream of work.
Why must event consumers be idempotent?
Consumers must be idempotent because Azure delivers events at-least-once, meaning every event will be delivered one or more times and never guaranteed exactly once. A redelivery happens whenever an acknowledgment is lost, a consumer restarts mid-processing, or a backbone fails over, and these are routine, not rare. An idempotent consumer processes the same event twice with the same result as once, so a redelivery is a safe no-op rather than a double charge, a duplicate email, or a drifted count. The practical implementations are a deduplication store keyed on the event identifier and committed transactionally with the side effect, or expressing the operation as a naturally idempotent set or upsert rather than an increment or insert. Skipping idempotency is the single most common and most damaging mistake in event-driven systems, because the resulting duplicate-processing bug corrupts data and surfaces only under the production load that finally triggers a redelivery.
Why does Azure deliver at-least-once instead of exactly-once?
Azure delivers at-least-once because exactly-once delivery across a network is, in general, impossible to guarantee. The unavoidable gap is between delivering a message and recording that delivery succeeded. If a consumer processes an event but its acknowledgment is lost to a network fault, the backbone cannot tell success from failure and must redeliver, because the alternative would lose events whenever an acknowledgment failed to send. Faced with delivering twice or occasionally losing an event, every serious messaging system chooses to deliver twice, since a duplicate is recoverable and a lost event is not. Systems that appear to offer exactly-once achieve it by pairing at-least-once delivery with idempotent processing on the consumer side, which is exactly the discipline this guide insists on. At-least-once is therefore not a limitation to work around but the correct semantic, and it places the responsibility for handling duplicates on the consumer, the only place that responsibility can correctly live.
How do I handle failed events and dead-lettering?
Dead-lettering sets aside events that cannot be processed after a configured number of attempts so they can be inspected rather than retried forever or lost silently. Service Bus provides a dead-letter queue automatically on every queue and subscription; after the maximum delivery count, the message moves there with metadata explaining why. Event Grid supports a storage-account dead-letter destination for events it fails to deliver after its retry schedule. Event Hubs, being a stream, leaves the decision to consumer code, which typically routes unprocessable records to a separate store. The operational discipline that matters most is alerting on dead-letter queue depth, because in a healthy system that depth should be near zero, and any sustained growth is a defect the system is reporting against itself. Pair the alert with an inspection path that exposes the failure reason and a replay path for events that failed transiently, which turns a class of permanent losses into recoverable conditions.
Should I use choreography or orchestration for my workflow?
Choreograph the simple and orchestrate the complex. In choreography, services react to events and emit their own, and the workflow is emergent with no central controller, which keeps services loosely coupled and lets teams ship independently. It suits short, stable reaction chains. Its weakness is that a long workflow becomes invisible, living nowhere as a whole, so understanding it means tracing events across many services. In orchestration, one component, such as a Durable Functions orchestration or a Logic App, holds the workflow definition and directs each step, which makes the flow visible, debuggable, and owned. It suits long, evolving, business-critical processes with conditional branches and compensation logic. A mature system often blends the two: it orchestrates the complex multi-step workflows inside a bounded context and choreographs the loosely coupled business events that cross context boundaries, getting visibility where the complexity is and decoupling where the boundaries are.
What is eventual consistency and how do I design for it?
Eventual consistency is the property that, because reactions happen asynchronously, there is a window after an event is published during which some services have reacted and others have not, so the system is briefly in an inconsistent intermediate state before all reactions complete. For most flows this window is milliseconds, but it is real, and any code that reads across services must tolerate it. Design for it by showing pending states in the user interface, by reading from the service that owns the data rather than from a projection that may lag, and by setting expectations that some effects are not instantaneous. The classic failure is treating an eventually consistent system as if it were synchronous, which produces confusing edge behavior such as a user refreshing immediately after an action and seeing stale data. Eventual consistency is a property to design for, not a bug to eliminate, and pretending it is absent is how systems develop their most baffling intermittent issues.
How do I keep event schemas from breaking consumers?
Treat the event schema as a published interface with the same care you would give a public API, because producers and consumers deploy independently and you cannot update them in lockstep. Evolve schemas additively: add optional fields, never remove or repurpose existing ones, and never silently change the meaning of a field consumers already read. When a breaking change is unavoidable, version the event and publish both the old and new shapes during a transition window so consumers migrate on their own schedule. A schema registry or shared contract definition makes the interface explicit and lets both sides validate against it. The failure this prevents is silent: a producer ships a schema change, a consumer on another team starts throwing on every event, those events dead-letter, and nobody connects the two deployments until the dead-letter alert fires. Schema governance is the unglamorous discipline that stops a decoupled system from re-coupling through the back door of its data shapes.
How do I preserve ordering in an event-driven system?
Preserve ordering only where you genuinely need it, and scope it as narrowly as possible, because every ordering guarantee is a serialization point that limits parallelism. Event Grid gives no ordering. Event Hubs preserves order within a partition, so partition on the aggregate key, such as customer or account, to keep that aggregate’s events ordered while different aggregates process in parallel. Service Bus preserves order within a session, so key the session on the aggregate to get the same effect, one consumer handling one aggregate’s messages in sequence while the consumer pool processes many aggregates at once. You almost never need global ordering across all events; you need the events for one aggregate ordered relative to each other. Often the apparent need for ordering dissolves entirely when consumers are idempotent and operations are expressed as commutative sets rather than ordered sequences, at which point arrival order stops mattering and you regain full parallelism.
What is the saga pattern and when do I need it?
A saga maintains consistency across a multi-step distributed workflow without a single distributed transaction spanning multiple services and data stores. It is a sequence of local transactions, each with a defined compensating action that undoes its effect. When a later step fails, the saga runs the compensations for the steps that already succeeded, in reverse, returning the system to a consistent state. For example, if payment fails after inventory was reserved, a compensating action releases the reservation. Orchestrated sagas place the compensation logic in a central orchestrator that knows what has run; choreographed sagas distribute it, with each service listening for the failure events that mean it should compensate. You need a saga whenever a business process spans multiple services that each own their own data and the process must stay consistent despite partial failure. The compensations are ordinary events under at-least-once delivery, so they too must be idempotent, and the system must tolerate the temporary inconsistency while a saga unwinds.
How do I make an event-driven system observable?
Make it observable with correlation and event-specific signals, because a single business operation no longer lives in one call stack but is spread across many services reacting at different times. Propagate a correlation identifier from the originating action through every event it triggers, so the logs and traces of all services that touched one operation stitch into a single timeline, and use distributed tracing carried in the event metadata to see the whole flow as one trace. Beyond per-operation tracing, monitor signals unique to asynchronous systems: backbone queue depth and the age of the oldest unprocessed message, which reveal a consumer falling behind; dead-letter queue depth, which reveals failing events; and consumer lag against an Event Hubs stream, which reveals a stream consumer that cannot keep up. These signals have no equivalent in a synchronous design and are exactly what tells you the health of an asynchronous one, so investing in them is the instrumentation without which the system is a black box when it stops working.
When is event-driven architecture overkill?
It is overkill when the system is small and tightly coupled, owned by one team that deploys everything together, with a simple linear flow that is already fast and reliable enough. In that case the asynchrony buys nothing and you pay the full price: more infrastructure to run, eventual consistency to reason about, idempotency to implement, and distributed tracing to build, plus a workflow that no longer reads top to bottom. The value of events is proportional to the independence and variability in the system, so where those are low, a synchronous design is the right call rather than a failure of ambition. A useful middle path is to stay mostly synchronous and introduce a backbone only at the specific seams that need it, such as a slow third-party call or an audit trail that must never block the request, keeping the rest simple. Reaching for events surgically where they pay is more mature than converting everything on principle.
How do I migrate an existing synchronous system to events?
Migrate incrementally, never in one rewrite. Find the single seam where decoupling pays most, usually a slow or unreliable downstream step that should not block the main flow, and place a backbone there first: the producer publishes an event instead of calling that step synchronously, a consumer reacts, and you measure latency and failure rate before and after. That one seam proves the pattern in your context and builds the operational muscle, idempotency, dead-letter monitoring, tracing, before you depend on it broadly. From there the system grows by adding consumers to existing events and introducing new events at new seams, each addition low-risk because the producer is ignorant of its consumers. As the count of events and consumers rises, the disciplines that were optional become mandatory: schema governance, correlation identifiers, dead-letter alerts, and naming conventions. Adding these deliberately as you grow is far easier than retrofitting them after the system has sprawled into something nobody fully understands.
What is the difference between an event and a command?
An event is an immutable record that something happened in the past, named in the past tense, such as OrderPlaced or PaymentSettled. It is a broadcast of a fact, and any number of consumers, including zero, may react without the producer knowing or caring who they are. A command is a request directed at a specific handler that is expected to act, named in the imperative, such as PlaceOrder or ChargePayment, and it carries an implicit assumption that exactly one consumer will process it. The distinction is not pedantry: it shapes the design. Events suit publish-subscribe and fan-out, where the producer is decoupled from an open set of reactors, and they suit Event Grid and Event Hubs naturally. Commands suit point-to-point delivery to a known handler and the reliable, ordered, transactional semantics of Service Bus queues. Confusing the two, treating a command as a broadcast or an event as a directed request, is a frequent source of architectural muddle that naming discipline alone prevents.
Do I need a service mesh or API gateway in an event-driven system?
Usually not for the event flows themselves, because the messaging backbone is the communication fabric and it already provides the routing, buffering, and delivery guarantees that a mesh or gateway would handle for synchronous calls. An API gateway still has a place at the edge, fronting the synchronous APIs that clients call to issue commands or read data, but the event traffic between services flows through Event Grid, Event Hubs, or Service Bus rather than through a gateway. A service mesh addresses synchronous service-to-service calls with mTLS, retries, and traffic control, so its value is proportional to how much synchronous inter-service calling remains, which an event-driven design deliberately minimizes. The mistake is adding a mesh or gateway to manage traffic that is actually flowing asynchronously through a backbone, where it adds operational weight without addressing the real communication path. Secure and observe the backbone itself, with managed identities and dead-letter and lag monitoring, rather than layering synchronous-era infrastructure over asynchronous flows.
How does serverless compute fit with event-driven architecture?
Serverless compute and event-driven architecture are two views of the same system. Azure Functions are the most common consumer of events because they run only when there is an event to process and scale with the event volume, so a function triggered by Event Grid, Event Hubs, or a Service Bus queue is the natural reactive consumer. The relationship is symmetric: events are the most common trigger for serverless functions, and functions are the most common reactors to events, so designing the events and designing the functions that consume them is the same project. The serverless side brings its own considerations that shape consumer behavior, chiefly cold starts that affect first-response latency, the statelessness that forces externalized state, and the consumption cost model that rewards spiky event-driven workloads and penalizes steady high-volume ones. Where the event volume is steady and high, a plan-based hosting model may beat consumption, and Durable Functions add orchestration for the multi-step workflows that outgrow simple choreography.
What payload should an event carry?
Carry the minimum that consumers need to react, plus a stable identifier they can use to fetch more if required. A lean event, the identifier of the entity that changed, the type of change, a timestamp, and a few fields most consumers need, keeps the event small, reduces coupling to the producer’s internal data shape, and limits how much sensitive data spreads to every subscriber. When some consumers need a large or sensitive payload, use the claim-check pattern: the event carries a reference, and the consumer fetches the full payload from a secured store with its own authorization, which keeps the broadcast event light and the protected data behind access control. Avoid the temptation to stuff the entire entity into every event, because that couples consumers to the producer’s full schema and spreads data widely. The balance is enough in the event for common reactions to proceed without a callback, and a reference for the uncommon reactions that need more, with sensitive fields kept out of widely fanned-out events deliberately.