Why Connect Sync Fails Quietly on Real Projects
Sitecore Connect sits between your PIM, DAM, commerce catalog, and the XM content tree. On paper it is event driven and idempotent. In production it fails in boring ways: duplicate items after a webhook retry, orphaned media after a partial asset push, SKU fields overwritten because two integrations claim the same Sitecore field, and dead letters piling up while dashboards still show green.
This post documents sync patterns we used on XM 10.4 and Connect 5.x deployments where Hub fed assets, a commerce middleware fed SKUs, and Sidekick owned marketing copy on the same item templates. The goal is reliable delivery, not perfect real time. You want every inbound event to land exactly once in Sitecore semantics, with a traceable external identity and a runbook when something stalls.
Webhook Ingestion and the Idempotency Contract
Connect receives webhooks from Hub, commerce platforms, or custom middleware. Treat every POST as a retry candidate. Hub and most SaaS sources will redeliver on timeout or 5xx. Your handler must not create a second item when the first attempt succeeded but the ACK never arrived.
Minimum handler shape
We standardize on a three step ingest pipeline inside the Connect flow or a thin custom API in front of it:
- Validate signature and schema version.
- Compute a deterministic idempotency key.
- Either skip as duplicate, enqueue for processing, or return 409 when state is ambiguous.
The idempotency key is not the webhook delivery ID alone. Delivery IDs change on retry from some vendors. We concatenate source system, entity type, external ID, and event type, then hash:
idempotencyKey = SHA256( source + "|" + entityType + "|" + externalId + "|" + eventType + "|" + payloadVersion
)
Store processed keys in SQL or Redis with a TTL longer than the vendor retry window. We use 72 hours for Hub asset events and 7 days for commerce catalog deltas because some ERP batches replay weekly.
HTTP response rules
- 200 when the event is accepted and either applied or safely deduplicated.
- 202 when queued for async work; include a correlation ID in the body.
- 409 when the same key maps to a conflicting payload hash (see below).
- 422 for schema errors; do not retry automatically.
- 5xx only for transient infrastructure faults you want retried.
Returning 200 on duplicate is correct. Teams that return 409 on duplicate cause infinite retry loops from naive publishers.
ExternalId Mapping Table
Every synced entity needs a durable map between the upstream identifier and the Sitecore item GUID. Do not rely on item names or paths. Marketing renames folders. Commerce SKUs get recycled. The mapping table is the source of truth for reconciliation jobs.
Recommended columns
SourceSystem(Hub, Commerce, PIM)EntityType(Asset, Product, Category, Variant)ExternalId(vendor native ID, string)SitecoreItemId(GUID)SitecorePath(cached, rebuildable)LastPayloadHash(SHA-256 hex)LastSyncedUtcSyncStatus(Active, Tombstoned, Conflict)
On create, insert the row before publishing the item when possible. On update, load by ExternalId, not by path search. Tombstone rows when upstream sends delete events; soft delete in Sitecore or move to an archive branch per governance policy.
We expose a small internal API GET /sync/map?source=Hub&externalId=... for support engineers during incidents. Connect flows should use the same store, not a parallel dictionary in flow variables.
Payload SHA-256 and Conflict Detection
Idempotency keys stop duplicate work. Payload hashing stops silent drift and detects conflicting retries.
Canonicalize JSON before hash: sort keys, normalize line endings in embedded text, strip fields marked connect:ignore in your contract. Store LastPayloadHash on the mapping row after successful apply.
incomingHash = SHA256(canonicalJson(payload))
if (mapping exists && incomingHash == mapping.LastPayloadHash) { return OK_DUPLICATE;
}
if (mapping exists && incomingHash != mapping.LastPayloadHash && sameIdempotencyKey) { return CONFLICT; // log, route to DLQ, alert
}
Conflicts happen when a vendor replays an old event after a newer one, or when two environments share a webhook URL by mistake. Never auto merge. Queue for human or rules engine review.
Dead-Letter Queues and Replay
Not every failure should retry forever. After N attempts with exponential backoff, move the message to a dead-letter queue with the full payload, correlation ID, mapping row snapshot, and exception stack.
DLQ entry requirements
- Original webhook headers (redact secrets)
- Parsed entity payload
- Connect flow execution ID
- Sitecore item ID if partially written
- Suggested action: retry, skip, manual fix
Replay tooling must accept a DLQ message ID and re inject with a forceReplay=true flag that bypasses idempotency only when an operator confirms. Blind replay without hash checks reintroduces duplicates.
We partition DLQs by entity type so commerce SKU failures do not block Hub asset fixes. Azure Service Bus or RabbitMQ both work; Connect’s own retry is not a substitute for an operations visible DLQ.
Hub to XM Asset Sync
Hub pushes asset metadata and binary URLs. XM stores media items under /sitecore/media library/... with alt text and rights fields authors depend on.
Patterns that held up
- Binary handoff: Connect downloads from Hub signed URL, uploads to Sitecore media library via MediaCreator API or chunked upload for files over 50 MB.
- Stable external ID: Hub asset ID in mapping table; never key on filename.
- Field mapping document: One row per Hub property to Sitecore field ID; version controlled in git.
- Rights and expiry: Map Hub license end date to a Sitecore datetime field; scheduled job unpublishes expired assets.
- Variant handling: Hub renditions become separate media items linked by a shared
AssetFamilyIdfield for Scriban image pickers.
Partial failure mode: binary uploaded but metadata write failed. Use a two phase status on the mapping row: BinaryReady, MetadataReady, Published. Reconciliation job completes stuck rows.
Commerce SKU Sync
Commerce integrations differ from Hub. SKUs change price, inventory, and attributes hourly. Marketing fields on the same product item must not be clobbered.
Template split strategy
We use a dedicated commerce section on the product template: fields prefixed or placed under a Commerce folder in the Content Editor. Connect has write access only to that section via item security or field level rules where supported.
- Connect writes: SKU, price, inventory flag, category IDs, ERP attributes.
- Sidekick and authors write: headline, body, SEO, campaign tags.
- Scriban on delivery reads commerce fields for price and marketing fields for copy; never merge upstream into a single rich text field.
SKU delete events should tombstone, not hard delete, when order history or analytics references the Sitecore item ID.
Field Ownership: Sidekick vs Connect
The most common production argument is who owns the product description field. Resolve this in a field ownership matrix before go live, not during a P1.
| Field group | Owner | Connect behavior | Sidekick behavior |
|---|---|---|---|
| SKU, price, stock | Commerce | Overwrite on event | Read only in UI |
| Marketing headline | Marketing | No write | Suggest and apply on approve |
| Short description | Hybrid | Seed once on create | Refresh on author request |
| Alt text on hero image | Marketing | No write | Suggest from image context |
| Technical specs table | PIM | Full overwrite | No access |
Enforce with Sitecore security roles, Connect flow guards that skip fields outside the contract, and Sidekick prompt scope limited to allowed field IDs. Authors should see read only styling on Connect owned fields in Experience Editor.
Observability Counters
Export metrics to Application Insights, Prometheus, or Datadog. Minimum counters:
connect_webhook_received_totalby source and entity typeconnect_webhook_deduplicated_totalconnect_sync_success_totalandconnect_sync_failure_totalconnect_sync_duration_mshistogramconnect_dlq_depthconnect_conflict_totalconnect_partial_write_total
Alert on DLQ depth over threshold for 15 minutes, conflict rate spike above baseline, and sync duration p95 doubling. Green success counts alone hide duplicate suppression doing too much work.
Log correlation: propagate X-Correlation-Id from webhook to Connect flow to Sitecore log scope. Support can trace one Hub asset ID end to end.
Sandbox Replay
Before promoting a Connect flow change, replay production shaped payloads against a sandbox XM instance with a copy of the mapping table schema but anonymized external IDs.
- Export last 7 days of webhook payloads from secure storage (not production logs with secrets).
- Scrub PII and rotate any embedded tokens.
- Point sandbox Connect at sandbox XM connection.
- Run replay script with rate limit matching peak production TPS.
- Compare item counts, mapping row counts, and hash distribution to expected baselines.
Sandbox replay caught a Hub rendition rename that would have created duplicate media items because our idempotency key omitted rendition ID. Fix the key, replay again, then promote.
Operator Runbook
Symptom: DLQ growing, no user visible errors
- Check
connect_sync_failure_totalby error code. - Sample 5 DLQ messages; classify schema vs Sitecore vs network.
- If schema: open ticket with vendor, pause subscription temporarily.
- If Sitecore: verify item locks, workflow state, disk space on CM.
- Replay one message with operator flag after fix; monitor mapping row.
Symptom: duplicate items in media library
- Query mapping table for duplicate ExternalId rows (should be unique constraint).
- Search items created without mapping insert (race in flow).
- Merge duplicates; backfill mapping; patch flow to insert map before media create.
Symptom: commerce price correct in CM, wrong on site
- Confirm publish pipeline ran for target database.
- Check field ownership: author may have overwritten commerce field in EE.
- Verify CD cache and edge cache keys include SKU version or timestamp.
Symptom: conflict rate spike after deploy
- Compare payload canonicalization before and after deploy.
- Roll back Connect flow version if hash logic changed.
- Reprocess conflicts from DLQ after fix with manual approval.
Webhook Authentication and Network Boundaries
Connect endpoints sit on the public internet or a vendor facing API gateway. Do not rely on obscurity. Hub supports HMAC signatures on the raw body. Commerce middleware may use OAuth client credentials with short lived tokens. Document the verification step in the same repo as your idempotency key definition so a new engineer cannot disable signature check during debugging and forget to re enable.
We terminate TLS at Application Gateway and forward only to Connect runtime subnets with NSG rules. Webhook URLs use environment specific hostnames: connect-webhook-qa.contoso.com vs production. A classic failure mode is Hub still configured with QA URL after UAT sign off. Include webhook URL in your sandbox replay checklist and verify Hub subscription points at the correct hostname the day before cutover.
Rate limiting and backpressure
Black Friday catalog updates can spike webhook volume tenfold. Without backpressure, Connect threads exhaust and Sitecore CM item saves queue behind long transactions. Return 429 with Retry-After when internal queue depth exceeds threshold. Vendors that honor backoff will spread load. Pair with autoscale on Connect workers but cap max instances to protect Mongo or SQL backing the mapping table.
Serialization and Package Conflicts
Connect creates items under predictable paths. Unicorn or CLI serialization teams sometimes exclude /sitecore/content/Commerce/ from git because data is volatile. That is correct for SKU fields but dangerous for template changes on commerce sections. Template updates must serialize; item data must not. Document the split in README so a developer does not force sync commerce items from an old branch and wipe Connect mappings on next deploy.
After item create via Connect, run a lightweight event handler that verifies mapping row exists. We saw race conditions where media binary upload succeeded, item saved, but mapping insert failed on SQL timeout. The reconciliation job queries Sitecore for items with Hub external ID field populated but no mapping row, then backfills or flags conflict.
Testing Connect Flows Locally
Sitecore Connect developer experience varies by version. Capture webhook payloads to JSON files during QA and replay with curl against local tunnel. Validate idempotency by sending the same file twice and confirming one mapping row and one item version increment pattern you expect. Test delete tombstone events before create retries from out of order delivery. Commerce vendors especially replay creates after updates when their queue reordering bugs surface.
curl -X POST https://localhost:443/connect/hub/asset -H "Content-Type: application/json" -H "X-Hub-Signature: sha256=..." --data @fixtures/hub-asset-create-001.json
Keep fixtures sanitized in git beside Connect flow definitions. Tag fixture set version when schema changes.
Sidekick and Connect on the Same Save
An author approves Sidekick copy on a product item while a commerce price event arrives mid save. Sitecore item locking usually serializes writes, but Connect async flows may read stale item state. Use field level merge in Connect flows: load item, compare field timestamps or use Sitecore event queue ordering, skip commerce overwrite on fields touched in last N minutes by non service accounts if business rules allow. Simpler pattern: commerce writes only to commerce section fields never edited in EE; Sidekick never touches those field IDs. Simpler is better when legal asks who won the conflict.
Closing Checklist
Use this before go live and after any Connect flow or template change.
- Idempotency key documented and implemented with vendor retry window TTL
- ExternalId mapping table with unique constraint on (SourceSystem, EntityType, ExternalId)
- Payload SHA-256 stored on successful apply; conflict path routes to DLQ
- Dead-letter queue with operator replay tool and force flag
- Hub asset sync uses two phase status for binary vs metadata
- Commerce SKU fields isolated; Sidekick and authors cannot overwrite price or stock
- Field ownership matrix signed by marketing and commerce leads
- Observability counters and alerts on DLQ depth, conflicts, and p95 duration
- Sandbox replay of 7 day payload sample after each Connect promotion
- Runbook linked from on-call wiki with correlation ID lookup steps