Testing Sitecore Sidekick Prompts: Fixtures, Rubrics, and Release Control

Prompts Are Code, Treat Them Like Code

Sidekick prompts look like copy documents. In production they drive field values on legal pages, product claims, and healthcare disclaimers. A prompt tweak without regression testing is equivalent to deploying an untested processor to item:saved. We learned this after a single adjective change in a system instruction caused every CTA to exceed the template max length and fail publish validation on forty items.

This post describes a fixture branch, git stored prompt JSON, rubric scoring, and CI hooks we use before promoting Sidekick prompt packs from QA to production XM.

Developer workstation with code editor — Prompt definitions belong in source control beside serialization and Scriban templates.

Fixture Branch: /sitecore/content/QA/SidekickFixtures/

Create a dedicated QA content branch isolated from marketing campaigns. Authors do not browse here during normal work. Platform seeds it once and refreshes from production shaped samples quarterly.

Branch rules

Path root: /sitecore/content/QA/SidekickFixtures/
Workflow: auto approve for machine runs; no publish to web databases
Security: Sidekick service account write; authors read only
Label items with FixtureId text field matching git test case ID

Each fixture item mirrors a real template with representative field values, edge cases, and empty optional fields. Do not use lorem ipsum alone. Include awkward real patterns: nested tables in rich text, trademark symbols, empty CTA links, and parent items missing headlines.

Ten Fixture Item Types

Minimum set that caught most regressions in our programs:

ArticleLongForm with 1200 word body, inline images, footnotes.
ArticleShortNews with 280 character teaser limit stress.
ProductDetail with commerce fields read only and marketing fields Sidekick owned.
LandingHero with General Link CTA populated and a sibling with empty CTA.
CardGridParent with three child cards missing teasers.
LegalDisclaimer with forbidden term bait phrases in source notes field.
HealthcareProvider with credentials acronym soup.
FAQBlock with twelve Q and A pairs at max field count.
NavigationLabel with 12 character menu label cap.
EmptyShell new item with only required fields, nothing optional filled.

Store fixture metadata in git as YAML listing item path, template ID, and which fields each prompt test touches.

fixtures: - id: FIX-ART-001 path: /sitecore/content/QA/SidekickFixtures/Articles/LongForm-A template: {GUID} prompts: [headline-v3, teaser-v2, meta-description-v1] - id: FIX-PROD-003 path: /sitecore/content/QA/SidekickFixtures/Products/SKU-Edge-Empty-CTA template: {GUID} prompts: [product-short-desc-v4]

Prompt JSON in Git

Export Sidekick prompt definitions to JSON files under /prompts/sitecore/ in your platform repo. One file per prompt or one file per pack with version field inside.

{ "promptId": "headline-v3", "version": "3.2.0", "model": "gpt-4.1", "system": "You write headlines for regulated financial content...", "userTemplate": "Source notes: {{SourceNotes}}. Max length: {{MaxLength}}.", "outputField": "Headline", "temperature": 0.3, "forbiddenPatterns": ["guaranteed returns", "risk-free"]
}

Tag releases: prompt-pack-2026.06.1. CM deployment script imports JSON to Sidekick config and writes the tag name to a Sitecore settings item for audit.

Rubric Scoring

Automated LLM output is non deterministic. Rubric scoring reduces noise. Human reviewers score the first run after prompt change; automation scores subsequent CI runs against frozen golden outputs where possible.

Sample rubric dimensions (1 to 5 each)

Length compliance: within template max and min.
Factual grounding: no claims not present in source notes.
Tone: matches brand voice guide snippet in prompt.
Field format: plain text vs rich text rules honored.
Accessibility: no click here; meaningful link text if CTA generated.
Regulatory: no forbidden terms; required disclaimer present when flag set.

Pass threshold: average 4.0 with no dimension below 3. Block promotion if Legal fixture fails regulatory dimension.

Store rubric results in JSON next to prompt version for trend charts. Sporadic 3 on tone is noise; repeated 2 on length compliance is a prompt bug.

Forbidden Term Grep

Maintain forbidden-terms.txt per industry vertical. CI runs case insensitive grep across all Sidekick outputs from fixture runs.

# forbidden-terms-pharma.txt
cure
miracle
FDA-approved unless source contains exact approval text
# lines starting with # are comments

Also grep for competitor names unless competitor comparison template flag is true. False positives go in an allowlist file with ticket reference.

Run grep after Sidekick apply simulation, not only on final published HTML. Field level catch is faster to triage.

Template Validation and Max Length Checks

Sitecore template validators enforce max length on Single-Line Text and Multi-Line Text. Sidekick must know limits before call, not after save failure.

Pre flight extraction

Serialize template standard values and validation rules.
Build a map field ID to max length from Max Length validation rule.
Inject {{MaxLength}} into prompt user template from map, not hardcoded in prose.
Post generate script truncates with log only in dev; in CI truncation is a failure.

function assertLength(fieldName, value, maxLen) { if (value.length > maxLen) { throw new Error(`${fieldName} length ${value.length} exceeds ${maxLen}`); }
}

Rich text fields need strip_html length check separate from raw HTML length. A 200 character plain text limit may allow 800 characters of markup otherwise.

CI Ideas

You do not need Sidekick in GitHub Actions calling production models on every PR. Tier the pipeline.

PR validation (no model call)

JSON schema validate prompt files.
Diff forbidden term list changes require legal review label.
Verify prompt outputField exists on declared template in serialized templates.
Ensure version bump when promptId content changes.

Nightly (calls QA Sidekick endpoint)

Run full fixture suite against QA CM.
Grep outputs for forbidden terms.
Assert max lengths.
Compare embedding similarity to golden outputs; flag drift below threshold.
Upload rubric JSON artifact to pipeline storage.

Pre prod promotion gate

Manual approver on prompt pack tag.
Replay nightly artifact from same commit hash.
Import to staging CM; authors smoke test three real items outside fixture branch.

CI pipeline visualization on screen — Nightly fixture runs catch prompt drift before authors hit publish errors in EE.

Release Notes for Authors

Authors ignore git tags. Ship a short release note in Confluence or Sitecore gutter help when prompt packs promote.

Prompt pack version and date
Templates affected
Behavior change in plain language: shorter headlines, new disclaimer rule
Fields now suggest only vs auto apply on approve
Who to ping for bad suggestions

Example: “Product Short Description v4 no longer mentions competitor names. If comparison is required, use Comparison Page template and legal workflow.”

Regression After Template Change

Template changes break prompts silently when field renames occur or validation rules tighten.

Trigger matrix

Change type	Required action
Field rename	Update prompt JSON outputField; rerun full fixture suite
Max length decrease	Rerun length asserts; adjust prompt system instructions
New required field	Add fixture item; add prompt or explicit skip doc
Template split	Map prompts to new templates; retire old IDs
Workflow state change	Verify Sidekick apply still allowed in Draft vs AI Review

Hook Sitecore PowerShell or CLI into template serialization PR in git. If /sitecore/templates/... changes, CI fails unless prompts/CHANGELOG.md entry exists.

Operational Tips

Keep golden outputs for Legal and Healthcare fixtures only; refresh golden when legal approves new baseline wording.
Log promptId, pack version, and model name on every Sidekick apply in item history custom field.
Rate limit nightly runs to avoid QA model quota exhaustion.
Rotate fixture source notes when real production phrasing evolves; stale fixtures miss new failure modes.

Scoring Sidekick Output with Human in the Loop

Automation handles length and forbidden terms. Tone and factual grounding often need spot checks until rubric weights stabilize. Assign two reviewers per vertical for the first prompt pack version: one marketing, one compliance. They score 10 fixture outputs each using the rubric spreadsheet. Export scores to CSV and compute inter rater reliability. Low agreement on “factual grounding” means the prompt system instruction is ambiguous, not that reviewers are wrong.

After three pack versions, shift to sampling: nightly CI fails hard on mechanical rules; human review runs weekly on 5 random fixture outputs. Escalate to full suite review when template or forbidden term list changes.

Prompt Diff Review in Pull Requests

Git diffs on prompt JSON should be readable by legal without IDE. Use pull request template sections:

Prompt IDs changed
Templates affected
Reason for change
Fixture test evidence link
Legal label required yes or no

For system instruction changes over 200 characters, attach side by side before and after output for Legal fixture FIX-LEG-001. Prevents “we only changed one word” surprises in prod.

Multi Language and Locale Fixtures

If Sitecore language versions share prompts, add fixture items for fr-CA and es-US with diacritics and length expansion cases. German compound words break max length asserts tuned on English. Forbidden term lists must be per locale file. CI runs grep per language output file, not one mixed blob.

fixtures/i18n: - id: FIX-ART-FR-001 language: fr-CA path: /sitecore/content/QA/SidekickFixtures/Articles/LongForm-FR prompts: [headline-v3-fr]

Handling Model Upgrades

Azure OpenAI deployment upgrades are not semver on your prompt pack. When platform bumps model version, rerun full fixture suite even if prompt JSON unchanged. Store model deployment ID in nightly artifact. Compare rubric scores week over week; drift above 0.3 average on tone means revisit temperature or few shot examples embedded in user template.

Failure Triage Playbook

Identify failing fixture ID and prompt version from CI artifact.
Reproduce manually in QA CM with same source notes field values.
Classify: mechanical (length, grep), model drift (rubric), template bug (wrong field), Sidekick service error.
Mechanical: fix prompt or validation map; no legal needed.
Model drift: adjust system instruction or pin previous deployment; notify editorial.
Template bug: platform fix; rerun suite after serialization merge.

Integration with Release Train

Align prompt pack tags with fortnightly CM deploy. Never import new prompts on Friday without hypercare coverage. Sequence: merge prompt PR, nightly green, import to staging CM, author advisory, import to prod CM during window, smoke three live templates post deploy. Rollback is redeploy previous tag, not hand edit CM fields.

Cost and Quota Management

Nightly fixture runs consume model tokens. Estimate cost per run: fixture count times average output tokens times model price. Cap concurrent Sidekick calls in QA to avoid throttling that produces false failures. Track monthly spend; alert when 80 percent of budget consumed mid month because someone duplicated fixtures without cleanup.

Author Sandbox vs Fixture Branch

Authors want to play with Sidekick before prod promotion. Provide a sandbox site under QA where they may experiment without polluting fixture items. Sandbox items are excluded from CI grep suites. Clear sandbox weekly. Authors must not copy sandbox wording directly to production legal templates without workflow. Fixture branch stays machine owned and stable.

Documenting Skipped Fields

Not every field gets Sidekick. Maintain prompts/SKIPPED.md listing field IDs with reason: regulatory lock, Connect owned, computed field, deprecated template. CI fails if new prompt targets a skipped field without SKIPPED.md update and approver label. Prevents orphan prompts after template deprecation.

Measuring Prompt Quality Over Time

Track rubric dimension averages per promptId across nightly runs in a simple time series database or spreadsheet. When headline-v3 tone score drops from 4.2 to 3.5 over two weeks without prompt change, suspect model deployment drift or fixture source notes that no longer represent production inputs. Refresh fixtures before tuning prompts. Teams that only react to publish failures miss slow drift that erodes brand voice until marketing escalates.

Publish a monthly internal quality report: top three failing fixtures, forbidden term hits count, mean length overrun attempts blocked. Share with editorial leads so they trust the gate is working, not hiding Sidekick from them arbitrarily.

Pair quality metrics with author feedback loop: when rubric tone scores slip, editorial workshop beats prompt hacking in isolation. Platform provides data; marketing owns voice guide updates reflected in next prompt pack semver.

Store nightly CI artifacts for at least one year even when sandbox body retention is 90 days. Metadata only artifacts prove control operation for auditors without retaining full model outputs indefinitely.

Closing Checklist

Fixture branch /sitecore/content/QA/SidekickFixtures/ exists with ten item types seeded
Fixture metadata YAML in git linked to template and prompt IDs
Prompt JSON exported to repo with version field and git tags per pack
Rubric defined with pass threshold and Legal fixture gate
Forbidden term lists per vertical with CI grep on nightly outputs
Max length map generated from template validation rules; CI asserts on all fixture fields
PR pipeline validates JSON schema and template field references without model calls
Nightly QA Sidekick fixture run publishes artifacts and rubric scores
Author release note template ready for each prompt pack promotion
Template change trigger matrix wired to prompt CHANGELOG requirement