Overview
Orchestration uses a central “conductor” to call tasks, enforce order, and track state. Choreography lets services react to events independently. Use orchestration when steps, deadlines, and approvals matter; use choreography for loosely-coupled signals with minimal coordination. Model the human and system steps in BPMN 2.0, with decisions in DMN and case work in CMMN.
When to orchestrate (vs. choreography)
Pick orchestration when…
- There are approvals, SLAs, deadlines, or human reviews
- Steps must run in strict order with compensation rules
- Executives need a single place to see status and evidence
Pick choreography when…
- Independent services react to events with weak ordering
- Loose coupling matters more than central control
- Temporary failure can be handled locally
Hybrid
Central orchestration for the critical path; event choreography for peripheral updates (notifications, analytics).
Core components
Orchestrator
- Workflow engine/state machine; correlation IDs and timers
- Compensations, retries, timeouts, and escalation hooks
Workers
- Stateless, idempotent handlers; safe to retry
- Access via allowlisted tools/APIs; role-scoped credentials
Queues & storage
- At-least-once delivery (AMQP, Kafka) with DLQs
- Idempotency keys; exactly-once semantics via design
- Audit-grade state store; immutable logs
Patterns (saga, retries, idempotency)
Saga & compensation
- Split long transactions; define compensating steps for partial failure
- Prefer logical undo over database-level XA
Retries & backoff
- Exponential backoff + jitter; cap retries; send to DLQ
- Timeouts and circuit breakers for unstable dependencies
Idempotency
- Use correlation/idempotency keys; ignore duplicate work
- Design handlers to be repeat-safe (HTTP semantics: RFC 9110)
References
- Saga pattern — microservices.io
- AMQP — OASIS
- Kafka — kafka.apache.org
- HTTP idempotency — RFC 9110
Routing, queues & SLAs
Work routing
- Queues by priority, skill, region; FIFO within class
- Assignment: round-robin, load, or skill-based
- Limit WIP to protect lead time (Little’s Law)
SLAs
- Define per step; timers and escalations; visible aging
- Auto-reassign stuck work; notify owners
Metrics
- Lead time, queue time, throughput, first-pass yield
- Backlog aging; reassignments; breach counts
HITL design & thresholds
Thresholds
- Confidence × impact grid: auto-approve, review, block
- Dual-control for high-risk steps (four-eyes)
Reviewer UX
- Show sources, diffs, and suggested actions
- One-click edits; capture rationale; next-best steps
Workforce management
- Queue sizing and shifts meet SLA windows
- Sampling of “auto” decisions for quality
- Feedback loops to improve rules/models
Evidence, audit & controls
Observability & SLOs
Tracing & metrics
- Distributed traces across orchestrator, workers, and queues
- SLIs: success rate, latency, error types, retries, DLQ size
SLOs
- Targets for latency/success; error budgets to govern change pace
References
- OpenTelemetry — opentelemetry.io
- Google SRE: SLOs — sre.google
90-day starter
Days 0–30: Model & scope
- Draft BPMN L2/L3; list approvals and SLAs
- Pick orchestrated vs. choreographed segments
- Define compensations and idempotency keys
Days 31–60: Build & guard
- Implement retries, backoff, DLQ; add correlation IDs
- Add HITL thresholds and reviewer UX
- Wire tracing; set SLOs; create runbooks
Days 61–90: Pilot & prove
- Pilot one corridor; track lead time, breach rate, rework
- Fix hotspots; publish deltas; plan scale-out
References
- OMG BPMN 2.0.2 — omg.org
- DMN / CMMN — DMN · CMMN
- Saga pattern — microservices.io
- AMQP / Kafka — OASIS · kafka.apache.org
- RFC 9110 (HTTP idempotency) — rfc-editor.org
- OpenTelemetry — opentelemetry.io
- Google SRE / SLOs — sre.google
- ISO/IEC 27001 · NIST SP 800-53 — iso.org · nist.gov
Coordinate the flow. Route the edge cases. Keep evidence tight.
If you want an orchestration checklist (saga, retries, idempotency, HITL, SLOs), ask for a copy.