Content Hub & Connect

Sitecore Content Hub: Enrichment Profiles, Taxonomy, and XM Sync

Workshop with stakeholders
Photo: Lucas / Unsplash · Royalty-free

Content Hub AI enrichment scales until it does not: taxonomy drift, over-tagged assets, and confidence scores authors learn to ignore. The goal is structured metadata that flows through Connect into XM picklists and search facets without turning DAM into a junk drawer. This runbook covers taxonomy design, enrichment profiles, batch jobs, sampling methodology, and ops controls I use before turning AI tagging loose on more than a pilot corpus.

Content Hub taxonomy foundations

Start with a shallow taxonomy, not an ontology science project. Three layers work for most enterprises:

  • Business line: product family or region (single select per asset in pilot)
  • Content type: photo, illustration, logo, video, document
  • Topic tags: controlled vocabulary, max 5 per asset in pilot phase

Store taxonomy in Content Hub with explicit relations, not free-text keywords copied from legacy DAM. Every tag needs a definition owners agree on. “Office” means corporate workplace photography, not Microsoft Office product shots.

Example taxonomy fragment:

Taxonomy: /Taxonomy/Corp/ BusinessLine/ Healthcare Financial Services Retail ContentType/ Photography Icon Video Topic/ Leadership Product UI Event Compliance

Assign stewards per branch. Stewards approve new tag requests weekly; AI does not invent tags outside the vocabulary during pilot.

Enrichment profiles

An enrichment profile defines which AI models run, which fields they write, and minimum confidence thresholds. Separate profiles by asset type: running video transcription on PNGs wastes money and adds noise.

Profile: Photography

  • Auto caption (alt text candidate)
  • Topic tags from controlled vocabulary
  • Dominant colors (optional, for search facets)
  • Detected objects (internal only, not synced to XM in pilot)

Profile: Document PDF

  • Title suggestion from first heading
  • Summary (max 300 chars)
  • Language detection
  • Compliance class (public, internal, restricted) via rules plus AI hint

Profile configuration (conceptual):

{ "profileId": "corp-photo-v1", "entityType": "M.Asset", "operations": [ { "field": "Description", "model": "caption", "minConfidence": 0.75 }, { "field": "TopicTags", "model": "classifier", "vocabulary": "Topic", "minConfidence": 0.82, "maxTags": 5 } ], "onFailure": "skip_field", "auditChannel": "enrichment-audit"
}

Use onFailure: skip_field, not fail entire asset. Partial enrichment beats blocking ingest pipeline.

Digital asset management workspace
Keep enrichment profiles scoped by DAM entity type from day one. Photo: Campaign Creators / Unsplash. Reference: Sitecore Documentation, Content Hub DAM.

Confidence scores authors can trust

Authors ignore scores they do not understand. Use three bands with explicit UI labels:

Score Label Action
0.85 and above High Auto-apply in batch; spot-check 5% sample
0.70 to 0.84 Review Queue for human approval in Content Hub
Below 0.70 Low Do not write; log for model tuning

Display confidence per field, not per asset aggregate. One strong caption and five weak tags is common; averaging hides the problem.

Calibrate thresholds on a labeled set of 200 assets (see sampling below) before batch jobs. If reviewers override High tags more than 10% of the time, raise the threshold 0.05 and re-run sample.

Batch jobs without chaos

Batch enrichment jobs should be bounded:

  • Filter by ingest date or folder, not entire repository on first run
  • Cap concurrency to protect API limits and reviewer queues
  • Idempotent: re-run safe; overwrite only fields marked AI-managed
  • Dry-run mode writes to shadow fields for comparison

Job manifest example:

Job: enrich-2026-q2-pilot Query: EntityType=M.Asset AND Folder=/Corp/Photography/2026 AND Status=Final Profile: corp-photo-v1 BatchSize: 50 MaxAssets: 500 Schedule: off-peak UTC 02:00 Notify: [email protected]

Track job metrics: processed, skipped, failed, average confidence, human override rate. Stop job automatically if failure rate exceeds 5% in any batch of 50.

Sampling 200 assets for calibration

Before production batch, stratified sample:

  1. 50 assets: simple product on white background
  2. 50 assets: people and events
  3. 50 assets: screenshots and UI
  4. 50 assets: legacy scans or low quality

Two reviewers independently label ground truth tags and captions. Compare AI output to ground truth:

  • Precision and recall per tag at candidate thresholds
  • Caption edit distance median
  • False positive rate on restricted content (people, logos, trademarks)

Document results in a calibration sheet signed by DAM lead and legal for external-facing alt text rules.

Synonym guide for search and enrichment

AI models and authors use different words for the same concept. Maintain a synonym guide mapped to taxonomy IDs:

Tag: Topic/Leadership Synonyms: executive, C-suite, management team, board Not: team photo (use Topic/Event unless executives featured) Tag: Topic/Product UI Synonyms: screenshot, interface, dashboard, application Not: device mockup (use ContentType/Photography + object phone)

Inject synonym guide into enrichment prompt context where Content Hub supports custom NLP configuration. At minimum, post-process classifier output to map synonyms to canonical tag IDs before write.

Connect sync to XM picklists

XM authors pick DAM assets and metadata via fields backed by Content Hub or synced picklists. Sync only approved, published taxonomy values:

Content Hub Taxonomy (published) --Connect sync-->
XM /sitecore/content/Corp/Settings/Taxonomy/Topic --source-->
Template droplist field "Topic Tag"

Rules:

  • One direction of truth: Content Hub taxonomy publishes downward; XM does not create tags locally
  • Sync job runs after taxonomy steward approval, not on every AI suggestion
  • Map Content Hub tag GUID to XM item name stable across renames (use external ID field)

Picklist item template fields:

Taxonomy Tag Item TagId (Single-Line Text, Content Hub GUID) DisplayName (Single-Line Text) Synonyms (Multi-Line Text, pipe-separated) Retired (Checkbox)

Retire tags in Content Hub first; sync sets Retired on XM picklist items; existing content keeps historical value but authors cannot select retired tags for new items.

Search facets in Content Hub and XM

Facets should mirror taxonomy branches authors already know:

  • Business line (single select facet)
  • Content type
  • Topic (multi select, max 5)
  • Approval status (Final, In Review)
  • Enrichment status (None, Partial, Complete, Needs Review)

Do not facet raw AI object detection labels in pilot; noise overwhelms useful filters. Keep internal labels in a separate field not exposed to authors.

XM media library search (if integrated) should use same facet names as Content Hub portal to reduce training burden.

DAM entity types

Align entity types with enrichment profiles:

Entity type Typical use Enrichment profile
M.Asset Images, video files corp-photo-v1 / corp-video-v1
M.File PDF, DOCX corp-document-v1
M.Collection Campaign bundles rollup tags from members, manual only in pilot

Collections inherit tags slowly; do not auto-roll up AI tags from children until child precision exceeds 0.9 on sample.

Ops runbook

Weekly

  • Review enrichment override queue (Review band)
  • Approve or reject pending taxonomy tag requests
  • Check batch job failure logs and API quota

Monthly

  • Re-sample 50 random enriched assets for drift
  • Update synonym guide from support tickets
  • Retire unused tags (zero selections in 90 days)

Incident: tag storm

Symptoms: thousands of assets tagged “Office” incorrectly after model update.

  1. Disable enrichment profile immediately
  2. Identify job ID and timestamp range
  3. Run rollback script clearing AI-managed fields where confidence below new threshold or job ID matches
  4. Re-calibrate on 200 sample before re-enable

Incident: Connect sync stall

XM picklists stale; authors pick outdated tags.

  1. Check Connect agent health and last successful sync
  2. Verify taxonomy publish state in Content Hub
  3. Manual delta sync for affected branch after fix

Audit log fields on asset:

EnrichmentJobId
EnrichmentProfileVersion
EnrichmentTimestamp
EnrichmentConfidenceJson
HumanReviewedBy
HumanReviewedAt
Team reviewing documents at a table
Human review queue is a feature, not a backlog failure, for mid-confidence enrichment. Photo: Scott Graham / Unsplash. Reference: Sitecore Documentation, Content Hub Connect.

Governance and legal

Alt text and document summaries may constitute compliance copy. Route external-facing fields through legal banned terms check (same list as Sitecore Sidekick prompts). Block auto-publish of Description to public XM fields until legal signs pilot profile.

Face detection and biometric labels: disable for HR and event photography unless privacy review complete. Store EXIF strip policy separately from AI enrichment.

Integration with Sitecore XM authoring

Authors should see enrichment provenance in XM media item or linked datasource:

  • “Alt text suggested by AI, reviewed by {user}”
  • Link back to Content Hub asset detail
  • Warning badge if enrichment status Needs Review

Do not sync Low confidence fields to XM picklists or required template fields. Empty is better than wrong.

Metrics dashboard

  • Assets enriched vs pending by profile
  • Override rate by field and confidence band
  • Tag precision on monthly 50-asset audit
  • Connect sync lag time
  • Author search success (zero-result rate on DAM search)

Content Hub entity permissions and AI

Enrichment jobs run as a service account with write on metadata fields only, not Final approval status. Separate role EnrichmentBot from DAMAdmin. Authors retain approve rights on Review band items. Audit trail must show bot vs human writes distinctly.

Custom vocabulary import

When marketing uploads a synonym CSV from legacy DAM, validate before merge:

Import rules:
- Reject rows with empty canonical tag ID
- Reject duplicate synonym mapping to two canonical tags
- Flag synonyms that match banned terms list
- Require steward approval before production vocabulary update

Bad imports poison classifier training for weeks.

Video enrichment specifics

Video profiles add transcript, chapter markers, and thumbnail key frame selection. Transcript confidence below 0.80 goes to Review for external captions. Do not auto-generate subtitles for broadcast without legal review. Store transcript in separate field from public Description until approved.

Duplicate detection before enrich

Run perceptual hash or file checksum job on ingest before AI enrichment spend. Duplicate assets should inherit tags from canonical copy, not re-run classifier. Saves cost and keeps facets clean.

Connect mapping for picklists

Connect flow steps typical pattern:

  1. Trigger on taxonomy publish event in Content Hub
  2. Transform tag to XM item create-or-update message
  3. Write under /sitecore/content/.../Settings/Taxonomy/
  4. Log sync batch ID to ops dashboard

Idempotency key: Content Hub tag GUID plus publish version number. Retries must not create duplicate picklist items.

Author search facet tuning

After enrichment pilot, review zero-result searches weekly. If authors search “logo” but taxonomy uses “Brand Asset”, add synonym mapping, do not rename tag without migration script on existing assets.

Retention and rollback of AI metadata

Store previous field value in history table before AI overwrite when confidence is Review band and auto-applied after human approve. Rollback one asset without full job replay. Required for regulated industries when wrong tag hits public site.

Cost controls on batch enrichment

Set monthly API budget alert. Profile-level caps: max 10k images per month in pilot. Queue overflow waits until next month or manual approval. Finance cares about this slide more than precision metrics.

Handoff to Sitecore Sidekick and RAZL

DAM alt text and topic tags synced to XM improve Sidekick retrieval context on media-heavy pages. Align taxonomy tag IDs between Content Hub, XM picklists, and RAZL prompt retrieval scopes. One tag registry in git referenced by all three systems.

Quarterly taxonomy health review

Every quarter, export tags with zero asset assignments and tags with more than 500 assets (potential overly broad labels). Retire or split tags with steward approval. Re-run 50-asset sample after taxonomy change before next batch enrichment job. Prevents classifier from learning obsolete vocabulary.

Pilot exit criteria

Exit pilot when override rate on High confidence fields stays below 8% for 30 days, Connect sync lag under 15 minutes P95, and legal signs off on external alt text profile. Promote enrichment profiles from pilot flag to production with version bump and communication to all DAM users.

Training DAM users on enrichment badges

Authors ignore metadata they do not understand. UI badges: green check for human-approved High confidence, yellow clock for Review queue, gray dash for no enrichment. One 30-minute training session showing badge meanings reduces support tickets more than a written policy PDF.

Record training attendance in DAM ops wiki. Re-run session when enrichment profile version increments with visible UI changes.

Actionable checklist

  • Define three-layer taxonomy with stewards and written tag definitions.
  • Create separate enrichment profiles per DAM entity type (M.Asset, M.File).
  • Calibrate confidence thresholds on 200-asset stratified sample with dual review.
  • Publish synonym guide mapped to canonical taxonomy IDs.
  • Run batch jobs with caps, dry-run shadow fields, and automatic stop on 5% failure rate.
  • Sync only steward-approved taxonomy to XM picklists via Connect with stable TagId.
  • Expose facets aligned between Content Hub and XM search; hide raw object detection.
  • Route mid-confidence output to human review queue with per-field scores.
  • Document weekly and monthly ops cadence plus tag-storm rollback procedure.
  • Block Low confidence AI values from required XM fields and legal-facing copy.