Operator dashboard for the live Spaceduckling stack.

LIVE Runbooks Operator Map Operator Surfaces Drill Scheduler Change Impact Lab Ops Checklist System Health API Stats Traffic Analyzer Deploy Diff Handoff Admin →
Incident mode active Auto-refresh paused for operator triage.
Change freeze active
Governance actions are visually frozen while this operator hold is active.
Reason and freeze window pending.
Operator anomalies detected
  1. Waiting for live status snapshot…
Loading…
RegisteredLoading live account metrics…
CertifiedLoading hatch completion metrics…
BondedLoading bonded spaceduck metrics…
GalaxyLoading live galaxy version…

Audit Activity · Last 24h

Loading…

Live 24h audit volume from POST /beak/audit, with explicit fallback when a category is not yet emitted by the backend.

SignupsWaiting for audit signal…
CertsWaiting for audit signal…
PecksWaiting for audit signal…
ReportsWaiting for audit signal…
Mission Control loading

Live system posture

Loading Beak system status…

Auth surfaceChecking auth routes…
Runtime stateChecking lambda metadata…

Operator Freshness

Loading…

Age of the operator-side caches that drive fast triage: the latest status snapshot, audit baseline, latency log, and fleet cache.

Status Checked: — Audit Baseline: — Latency Log: — Fleet Cache: —
Status snapshotWaiting for live status…
Audit baselineWaiting for local baseline…
Latency logWaiting for local probes…
Fleet cacheWaiting for fleet cache…

Auth Status

Loading…

Live from GET /beak/system/status.

Backend: GET /beak/system/status · Fields: auth.cognito_pool, auth.ses_sender, auth.signout/signup/verify/phone_verify/forgot_reset
Cognito pool ID
SES sender
signout signup verify phone_verify forgot_reset
Health

Communication Status

Loading…

Live SES/SNS provider posture from GET /beak/system/status.

Backend: GET /beak/system/status · Fields: ses.sandbox, sns.sandbox
SESLoading…
SNSLoading…
Health

Sandbox Exit Readiness

Loading…

Live SES/SNS exit blockers from GET /beak/system/status — exact next operator action.

Backend: GET /beak/system/status · Fields: ses.sandbox, ses.daily_quota, sns.sandbox, sns.monthly_spend_limit
SES mode
SES daily quota
SNS mode
SNS spend limit
Daily Quota Usage
Next SES action: checking…
Next SNS action: checking…

Webhook Health

Operator-set

EventBridge wiring currently documented from deployed infra.

EventBridge busspaceduck-events
StatusConnected
Eventsduck.hatched, duck.bonded, duck.pulse, duck.unpecked
Logs→ CloudWatch /spaceduck/events (90 day retention)
HealthEvent bus connected

Runtime State

Loading…

Live Lambda metadata from system/status.lambda.

Backend: GET /beak/system/status · Fields: lambda.version, lambda.runtime, lambda.memory_mb, lambda.timeout_s, lambda.function, lambda.last_modified
Function
Runtime
Memory
Timeout
Regionus-east-1
Online sinceOnline since last deploy
Current UTC
UptimeUptime: Live
Concurrency
Health

Agent Fleet

Loading…

Live bonded agent telemetry from the production status route.

Backend: GET /beak/system/status · Fields: agents.alive, agents.slow, agents.dead, agents.total_bonded
Alive
Slow
Dead
Total bonded
Health

Database Posture

Loading…

Live DynamoDB table counts from the system status route.

Backend: GET /beak/system/status · Fields: database.* (eggs, ducklings, birth_certificates, …) + GET /beak/metrics for growth strip
Loading table posture…
Ducklings
Certs issued
Connections
Peck requests
Health

DynamoDB Billing

Loading…

Live DynamoDB billing modes from system/status.

Backend: GET /beak/system/status · Fields: database_billing.*
Loading billing modes…
Health

Pending Peck Requests

Loading…

Live from GET /beak/metrics.

Backend: GET /beak/metrics · Fields: spaceducks_bonded
Spaceducks bonded
HistoryPeck request history available via /beak/audit
Health

🔗 Peck Protocol

Loading…

Live Step Functions execution counts from peck-approval-workflow.

Backend: GET /beak/system/status · Fields: peck_protocol.running, peck_protocol.succeeded, peck_protocol.failed
🔄 Running
✅ Succeeded
❌ Failed

Lambda Cost Estimator

Loading…

Projected monthly cost from invocation count and duration.

Backend: POST /beak/audit (for invocation count), GET /beak/system/status (for memory)
Monthly Invocations
Average Duration— ms
Configured Memory— MB
Projected Cost$—. est.
Excludes free tier. Prices for us-east-1.

Peck Failure Analysis

Loading…

Analysis of failed peck requests from the Step Functions workflow, split into denied, expired, and callback-error buckets.

Backend: GET /beak/system/status · Fields: peck_protocol.failed, peck_protocol.succeeded, peck_protocol.failure_breakdown.*
Failed Pecks
Success Rate—%
Failure Rate—%
Denied —%
Expired —%
Callback errors —%
View audit →

EventBridge Health

Loading…

Health of the EventBridge bus, based on the timestamp of the last event from /beak/audit.

Backend: POST /beak/audit · Fields: entries[0].timestamp
Last event
Health

Lambda Config Audit

Loading…

Configuration of the deployed Lambda function from /beak/system/status.

Backend: GET /beak/system/status · Fields: lambda, lambda_alias_version, lambda_alias_parity
Lambda Version
Alias Version
Parity
Health

Alias Drift Investigation

Loading…

Direct prod alias investigation: function version, alias version, explicit parity verdict, and exact remediation command.

Prod alias target (get_alias)
Invoked version (context)
Reported lambda.version
Parity verdict
Status generated at
Status age
FreshnessLoading…
Remediation
Last checked: —

Deploy Readiness Checklist

Loading…

Pre-promotion checklist comparing prod alias state, reported status version, sandbox posture, and the latest deploy log snapshot.

Prod alias parityWaiting for status…
Status freshnessWaiting for generated timestamp…
Sandbox blockersWaiting for provider posture…
Recent deploy logWaiting for static log snapshot…
Recent deploy entries will appear here.
Last checked: —

Route Latency Strip

Loading…

Live fetch durations for the three operator-critical routes with green / amber / red thresholds.

GET /beakWaiting for probe…
GET /beak/metricsWaiting for probe…
GET /beak/system/statusWaiting for probe…
Last checked: —

Operator Watchlist

0 pinned

Pin up to 5 duckling IDs or cert IDs in localStorage for one-click support lookup.

No watchlist items pinned yet.

Incident Mode

Inactive

Freeze auto-refresh during active incidents, highlight critical cards, and stamp the UTC incident start time in localStorage.

Start time: —

Pending Operator Actions

Loading…

Unresolved items from sd_todos and sd_gov_actions localStorage queues.

Data Freshness Scoreboard

Loading…

Age of the last successful system/status, metrics, and audit fetch with stale thresholds.

System statusWaiting for fetch…
MetricsWaiting for fetch…
AuditWaiting for fetch…

Operator Handoff Bundle

Ready

Export a markdown handoff with build stamp, drift state, sandbox state, peck failure stats, and local shift notes.

Includes shift log localStorage notes automatically.

System Health Score

Loading…

Computed client-side from GET /beak/system/status.

Backend: GET /beak/system/status · Fields: agents.dead, agents.slow, database.ducklings + POST /beak/audit for recent event count
— / 100
Waiting for live data…

Cert Pipeline

Loading…

Live system throughput from eggs to bonded spaceducks. → Cert Monitor

Backend: GET /beak/system/status · Fields: database.eggs/ducklings/birth_certificates, agents.total_bonded + GET /beak/metrics for trend
Eggs
Ducklings
Certified
Spaceducks
Total certs issuedWaiting for metrics…
Last cert issuedTimestamp pending…
TrendWeekly movement pending…

Cert Issuance Latency

Loading…

Last-issued age plus rolling 24h certificate volume from live metrics and audit signals.

Backend: GET /beak/metrics · Fields: last_cert_issued_at (+ fallback POST /beak/audit for duck.cert_issued / duck.hatched events)
Last issued ageWaiting for issuance telemetry…
Issued in last 24hWaiting for audit volume…
Latest cert timestampUsing /beak/metrics when present, then audit fallback.

Mission Control API

Loading…

Request health for the operator-facing APIs used by this page.

Backend: GET /beak/system/status (latency + fields), GET /beak/metrics, POST /beak/audit — all three fetched each refresh cycle
Status route/beak/system/status
Metrics route/beak/metrics
Audit route/beak/audit
Observed latency
Health

Governance Actions

Manual approval required

Production changes move through the governance lane: open a request ID, record the target version cycle, and capture approval before any alias change. Frozen surfaces still require T-JOSH sign-off. See GOVERNANCE-LOG.md for the full audit trail.

Lambda Environment Diff

Baseline pending

Local operator diff of known Lambda environment variable names only — no values are surfaced. Baseline is stored in this browser for additive drift awareness.

Added since previous snapshot
Capturing baseline…
Removed since previous snapshot
None recorded
Current known env var names
Loading…
Last env-name changeBaseline pending
Snapshot sourcelocalStorage missionControlEnvVarSnapshotV1

Alert Thresholds

Reference

Static operator thresholds for quick review in the control surface.

API error ratealert if > 5%
Lambda memoryalert if > 100MB
Cert agealert if > 60 days
Peck failuresalert if > 10 in 24h

Failure Budget

Within budget

24h SLA targets — Galaxy 1.1

Auth flow (signup/signin)99.9% target · est. OK
Cert issuance99.5% target · est. OK
Peck approval flow99.0% target · est. OK
SMS deliverysandbox only · limited

Manual review — update after incidents

CloudWatch Alarms

No alarms

Configured alarm thresholds (checked manually via AWS Console)

Lambda errors > 5%OK
Lambda duration > 5sOK
API Gateway 5xx > 1%OK
DynamoDB throttleOK

Last verified: 2026-03-21 · Update manually after incidents

Recent Deploys

Static snapshot

Last 5 Lambda deployments

2026-03-21 05:40 UTC · v39prod
2026-03-21 05:33 UTC · v38prod
2026-03-21 05:31 UTC · v37prod
2026-03-21 05:24 UTC · v36prod
2026-03-21 05:18 UTC · v35prod

Update this card after each deploy. Refresh it periodically because this is a one-time static snapshot from coordination/DEPLOY-LOG.md.

Lambda Version History

Static snapshot

Last 5 prod promotions from coordination/DEPLOY-LOG.md.

v392026-03-21 05:40 UTC | DG-033 /beak/pageview route added
v382026-03-21 05:33 UTC | DC-038 lambda_alias_version + SD-034/035/038 prior batch merged
v372026-03-21 05:31 UTC | DC-038 lambda_alias_version in status response
v362026-03-21 05:24 UTC | SD-034 AgentMail cert email wired into /beak/cert/issue
v352026-03-21 05:18 UTC | birth certificate issuance now captures ip_address, user_agent, and country metadata at issuance time; prod alias promoted to v35

Lambda Deploy Timeline

Loading…

Last 10 Lambda deploys from sd_deployment_log localStorage. Latest version highlighted amber. View diff viewer →

Auth Dependencies

Loading…

Mixed live + operator-confirmed auth provider posture.

Cognito
Turnstile✓ Production keys active
AgentMail✓ Inbox configured
SES

Surface Audit

Documented

DC-019 audit: nothing below is decorative.

Live sectionsMetrics strip, hero summary, Auth Status, Runtime State, Agent Fleet, Database Posture, Pending Peck Requests, Mission Control API, Recent Events.
Static sectionsCommunication Status and Webhook Health are explicit operator-truth tiles until dedicated API fields exist. Quick links are utility navigation, not health tiles.
RemovedDecorative/dead-weight tiles from the previous layout were replaced by real data surfaces or explicit static operator state.

Recent Events

Loading…

Last five audit entries from POST /beak/audit.

Backend: POST /beak/audit (body: {}) · Fields: entries[].event_type, entries[].timestamp
Loading recent events…

Route Health

Checking

Live probe of key API routes

Backend: direct HTTP probes — GET /beak, GET /beak/metrics, GET /beak/system/status — independent of main refresh cycle
GET /beak
GET /beak/metrics
GET /beak/system/status
Last checked: —

Recent Activity

Loading…

Last five entries from GET /beak/system/status recent_audit when present.

Loading recent activity…

Cert Inventory

Loading…

Duckling and certificate state distribution computed from live database counts.

Backend: GET /beak/system/status · Fields: database.eggs, database.ducklings, database.birth_certificates, agents.total_bonded
Certs issued
Ducklings pending cert
Unverified eggs
Bonded (fully certified)
Completion ratecerts / ducklings
Pending backlogducklings awaiting cert
Pipeline yieldeggs → bonded ratio

Operator Acknowledgements

0 / 4 acknowledged

Pre-shift checklist — tick each item before making production changes. State persists across refresh with UTC acknowledgement stamps until reset.

Confirmed current deploy matches expected version and behaviour. Acknowledged at: —
Previous known-good alias version identified and rollback command documented. Acknowledged at: —
CloudWatch alarms checked; no unexpected alerts present. Acknowledged at: —
Database table counts and cert pipeline values look expected. Acknowledged at: —

Persisted in localStorage only. For long-term records, log to GOVERNANCE-LOG.md.

Operator Notes

Shift Handoff

Standing shift context and next priorities. Source of truth: coordination/OPERATOR-NOTES.md.

Current Shift Notes
Galaxy 1.1 Beta is live. SES and SNS remain in sandbox mode — production exit requests are pending.
Lambda prod alias is at v30 ($LATEST drift expected; check drift alarm on load).
Cert pipeline: eggs → ducklings → certs → bonded. Monitor Cert Issuance Latency tile for any backlog.
Peck flow is the primary audit event source. Signups and certs do not yet emit audit events.
CloudWatch alarm review required before any production alias promotion.
Standing Operator Actions
1. SES: Request production access via AWS SES console → Sending Statistics → Request Production Access.
2. SNS: Request SMS sandbox exit via AWS SNS console → SMS Sandbox → Request exit.
3. After each deploy: update coordination/DEPLOY-LOG.md and record version + alias promotion.
4. After incidents: log to coordination/GOVERNANCE-LOG.md with request ID + approver.
Next Priorities
Wire signup/cert audit events in Lambda so the Audit Activity strip shows real counts.
Expose agents.stale_list in /beak/system/status for the Stale Agent Spotlight card.
Expose last_cert_issued_at in /beak/metrics to enable real cert latency display.
Source: coordination/OPERATOR-NOTES.md · Last updated: 2026-03-21 UTC · Batch DC-055–DC-057

Operator Shift Log

Ready

Local shift notes for handoff continuity. Saves the latest three UTC-stamped notes to your browser only.

Newest first · UTC timestamps

Last saved: never

Stale Agent Spotlight

Loading…

Dead or never-pulsed agent summary from GET /beak/system/status. Individual agent records are shown when the API exposes them; otherwise summary counts are rendered.

Backend: GET /beak/system/status · Fields: agents.dead, agents.alive, agents.slow, agents.total_bonded, agents.stale_list (when available)
Loading agent health…

⌨️ Operator Command Palette

One-click copyable diagnostics for rapid operator investigation — curl and aws CLI commands pre-filled for prod.

Commands pre-loaded · Click Copy to grab

📊 Metrics Delta Strip

Waiting…

Change since last refresh — green for growth, red for drops, grey for no change.

Last checked: —

⚠️ Degraded-Service Impact Matrix

Loading…

Translates live sandbox state, dead-agent count, and peck failures into plain-English operator risk.

Last checked: —

🔗 Connection Pressure

Loading…

Connections per bonded agent, peck pending vs failed ratio, and overload thresholds.

Last checked: —

🔄 What Changed Since Last Refresh

Metrics that moved on the latest live poll with old → new values.

Waiting for first refresh comparison…

📋 Route Failure Journal

Last route probe failures persisted in localStorage — UTC stamp, endpoint, status code.

No route failures recorded yet.

🕐 Incident Timeline

Chronological handoff view combining deploy-log tail, recent audit events, and operator notes.

📦 Incident Handoff Pack v2

Exports diagnostics JSON, anomaly summary, operator notes, route health, and parity state as a downloadable Markdown + JSON pair.

🚨 Operator Drill Checklist

Incident triage, alias verification, route probe review, and rollback preparation steps.

Step 1 — Confirm live statusRun curl .../beak/system/status and verify lambda.version matches prod alias via aws lambda get-alias.
Step 2 — Check route healthUse Route Latency strip — any route >1000ms or error indicates a degradation. Check Route Failure Journal for recent failures.
Step 3 — Review anomaly summaryCheck the Anomaly Banner and Impact Matrix. Classify as Degraded / Partial Outage / Full Outage.
Step 4 — Rollback readinessCheck Deploy Readiness Checklist. If rollback needed: list versions with aws lambda list-versions-by-function, then use Promote Alias command from palette.
Step 5 — HandoffExport Incident Handoff Pack v2 (Markdown + JSON) for shift handoff or incident ticket. Save a Shift Log note with current state.

📈 Route Latency History

Last few probe latencies per route — movement over time for degradation detection.

No latency history yet — waiting for route probes.

🕵️ Stale Data Detector

Checking…

Cards that have not refreshed within the last 120 seconds are flagged here so operators don't mistake cached UI for live truth.

Waiting for first refresh…

🚧 Operational Blockers

Loading…

Precise operational constraints for SES, SNS, cert pipeline, and GitHub — not vague waiting states.

Last checked: —

🔄 Board / Runtime Reconciliation

Compare board task state against live runtime truth — exports a compact Markdown handoff for operator review.

Cold-start trend

Last 24h cold-starts

SES Exit Readiness

Checklist for leaving AWS SES sandbox

—/6

SES sandbox exit is pending AWS review

DynamoDB Tables

Table health sourced from sd_dynamo_status localStorage key.

DB Write Activity

Last 10 audit log entries from sd_audit_cache localStorage. Auto-refreshes every 30s.

No recent DB activity

Lambda Memory

Local log

Estimated MB per invocation — last 10 data points from this session.

Latest: — MB ⚠ Amber threshold: 100 MB
Sourced from localStorage missionControlMemoryLog

Lambda Timeout Risk

Estimating…

Estimated p95 execution time from sd_latency_log vs the 30s Lambda limit.

p95 latency
0s20s25s30s limit
Timeout risk: LOW — No latency data yet
Sourced from localStorage sd_latency_log · 30s Lambda limit

Pipeline Yield

Loading…

Certified / total pipeline entries — from /beak/system/status.

— certified / — total Trend: —
Snapshot: —

Peck Approval Timeline

Loading…

Last 5 peck outcomes (approved / denied / expired) from POST /beak/audit with UTC timestamps and outcome reason.

Loading peck outcomes…
Last checked: — View full peck history →
Backend: POST /beak/audit (body: {}) · Filters: duck.peck_approved, duck.peck_denied, duck.peck_expired

Runtime Anomaly History

Last 10 detected anomalies (alias drift, dead fleet, peck failures) logged on each page load. Stored in missionControlAnomalyLog localStorage.

No anomalies logged yet.
Source: derived from live status + metrics each refresh cycle · key: missionControlAnomalyLog

Operator Todos

0 / 0 done

Persistent checklist for this operator session. Stored in localStorage.

Loading…

Recent Emails

Last 5 cert issuance and signup emails sent via SES. Email addresses are masked for privacy.

⚡ Lambda Cold Start Tracker

Outlier detections from sd_latency_log — responses >3s flagged as cold start candidates

Scanning…
Source: sd_latency_log · threshold: 3000ms · wired to main refresh cycle · Deployment Comparison →

🎓 Cert Issuance Rate

Live certs/hour and certs/day from audit_log delta snapshots · 7-day trend

Computing…
Certs / Hour
Certs / Day
7-Day Trend
Collecting baseline data…
Source: sd_cert_rate_log delta snapshots · wired to mc:status · min 2 snapshots required

⚡ Lambda Invocation Trend (7-day)

Daily invocation counts from localStorage.sd_invocation_log. Amber at 2× baseline · Red at 5× baseline.

Today
invocations
7-day total
Baseline
avg/day
Loading…
Collecting data — check back once sd_invocation_log has entries.
Normal Amber: ≥2× baseline Red: ≥5× baseline
📬 SES/SNS Sandbox Exit Readiness 0% complete

Step-by-step checklist for requesting SES production access and SNS sandbox exit. Progress is saved in localStorage.

📧 SES Production Access

📱 SNS Sandbox Exit

Progress persisted in localStorage.sd_sandbox_exit_checklist.

🐤 Canary Deploy Tracker

Compares Lambda $LATEST against the prod alias version. Green = in sync · Amber = drift detected.

$LATEST version
Invoked version this request
Prod Alias → version
Currently promoted alias
Drift Status
Checking…
About $LATEST drift: Lambda's $LATEST always points to the most recent deployment — even if it hasn't been promoted to the prod alias. When you see "$LATEST vs prod = v41", it means a new upload exists but your production traffic is still safely routed to v41 via the alias. This is cosmetic drift, not a production incident. Only promote after testing.
📋 Version history
Not yet checked.

⚠️ Anomaly Severity Escalation Matrix

Maps anomaly severity scores to P0–P3 response tiers with SLAs and escalation contacts.

Priority Anomaly Score Description Response SLA On-Call Action
P0 CRITICAL ≥ 90 Full fleet offline, Lambda down, complete service failure, data loss risk 15 min Josh (primary) · backup@spaceduckling.com 🚨 Escalate to Josh
P1 HIGH 70–89 Partial fleet dead (>50%), SES quota critical, peck failure rate >30% 1 hour Josh (primary) · ops@spaceduckling.com 📋 View Runbook
P2 MEDIUM 40–69 Some agents slow/dead, SES near quota, peck failures 10–30%, alias drift 4 hours On-call operator · ops@spaceduckling.com 📝 Log Incident
P3 LOW < 40 Minor drift, cosmetic issues, informational alerts, sandbox warnings 24 hours Operator (self-serve) · next shift 📊 Alert History
Current anomaly assessment: Loading…

🦆 Fleet Bond-Age Histogram

Loading fleet…

Spaceducks binned by time since last pulse. Sourced from sd_fleet_cache localStorage.

Fleet cache not yet populated — data appears after first metrics poll.

🥶 Lambda Cold-Start Detection

Collecting data…

Init duration estimates derived from audit_log timing deltas stored in sd_cold_starts localStorage. Responses >800ms flagged as potential cold starts.

P50 Est.
median init
P95 Est.
95th percentile
7-Day Count
cold start events
7-Day Cold Start Sparkline
7d agoToday
Last cold start: not yet detected this session

📊 Platform SLA Scorecard

Loading…

Rolling 24h and 7d SLA metrics. Targets: API success ≥99.5% · Peck success ≥95% · Cert issuance ≥90%.

API Success Rate
24h · target: 99.5%
7d: —
Peck Success Rate
24h · target: 95%
7d: —
Cert Issuance Rate
24h · target: 90%
7d: —
Uptime Estimate
24h · target: 99.5%
7d: —
SLA metrics computed from route probe history and peck audit data. Uptime estimate derived from successful /beak/system/status probes.

📧 SES Sending Rate

Loading…

Emails/hour tracked from sd_ses_rate_log audit deltas · Amber at 70% daily quota · Red at 90%

Emails / Hour
Today Sent / Quota
Daily Quota
200
24h Sparkline (hourly deltas from sd_ses_rate_log)

📅 Today's Activity

Since 00:00 UTC

Platform activity since midnight UTC · Sources: /beak/system/status + localStorage baseline

New Eggs
New Ducklings
Cert Issuances
Peck Attempts
Audit Events
Lambda Invocations
Deltas computed from sd_activity_baseline (set at session start · resets at midnight UTC)

CloudFront Cache Health

Loading…

Recent invalidation metadata, estimated freshness, and cache risk sourced from deploy metadata fallbacks plus localStorage snapshots.

Invalidations tracked
Last invalidation ID
Estimated freshness
Cache riskUnknown
Waiting for cache metadata…
Last checked: —

✉️ SES Analytics

Loading…

Bounce and complaint rate tracking from sd_ses_events. Auto-refreshes every 60s.

Bounce Rate
Complaint Rate
7-Day Trend
View SES Details → Last refresh:

🔐 Security Posture

5 live security indicators — refreshes every 60s

Cognito MFA
SES SPF
SNS Auth
Lambda Timeout
Peck Failure

🚨 Incident Escalation

Track and escalate platform incidents. P1 = critical outage, P2 = degraded, P3 = minor.

Active Incidents 0 open
No active incidents — all clear ✓

🚨 Create Incident

✅ Resolve Incident

🐦 Lambda Canary Deploy

Gradually shift traffic to a new Lambda version. T-JOSH governance required to execute.

Prod alias: —
Current prod aliasbeak-api:prod → version ?
Canary versionNot set — enter below
No split selected
Rollback if error rate exceeds this value
Prerequisites
✓ Lambda function published with versioning enabled
✓ Alias "prod" exists on target function
⚠ CloudWatch alarm configured for error rate monitoring
⚠ Rollback runbook reviewed and on-call notified
What this does

Executes update-alias --routing-config to shift a percentage of Lambda invocations to the new version. 100% promote removes the routing config entirely, making the new version the sole prod target.

# Step 1: Update alias with weighted routing (e.g. 10% to new version) aws lambda update-alias --function-name beak-api --name prod \ --routing-config '{"AdditionalVersionWeights":{"NEW_VER":0.10}}' # Step 2: Monitor error rate in CloudWatch aws cloudwatch get-metric-statistics \ --namespace AWS/Lambda --metric-name Errors \ --dimensions Name=FunctionName,Value=beak-api \ --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 300 --statistics Sum # Step 3: Promote to 100% (or rollback to previous) aws lambda update-alias --function-name beak-api --name prod \ --function-version NEW_VER --routing-config '{}' # Rollback: restore previous version aws lambda update-alias --function-name beak-api --name prod \ --function-version PREV_VER --routing-config '{}'

🔐 T-JOSH Governance Required

Lambda canary deployments require Josh (T-JOSH) approval. Enter your operator identity to proceed — this will be logged.

⚠ This will modify live Lambda traffic routing. Ensure CloudWatch alarms are armed and rollback is ready.