# Aether Executable Backlog **Built from:** 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition **Date:** 2026-01-12 --- ## Backlog Overview This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API. **Total Scope:** - **Capabilities:** 9 (all complete) - **Features:** 14 - **Issues:** 67 - **Contexts:** 5 - **Implementation Phases:** 4 **Build Order (by value and dependencies):** 1. **Phase 1: Event Sourcing Foundation** (Capabilities 1-3) - Issues: 17 - Enables all other work 2. **Phase 2: Local Event Bus** (Capability 8) - Issues: 9 - Enables local pub/sub before clustering 3. **Phase 3: Cluster Coordination** (Capabilities 5-7) - Issues: 20 - Enables distributed deployment 4. **Phase 4: Namespace & NATS** (Capabilities 4, 9) - Issues: 21 - Enables multi-tenancy and cross-node delivery --- ## Phase 1: Event Sourcing Foundation ### Feature Set 1a: Event Storage with Version Conflict Detection **Capability:** Store Events Durably with Conflict Detection **Description:** Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers. **Success Condition:** Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history. --- #### Issue 1.1: [Command] Implement SaveEvent with monotonic version validation **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P0 **Title:** As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely **User Story** As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately. **Acceptance Criteria** - [ ] SaveEvent accepts event with Version > current for actor - [ ] SaveEvent rejects event with Version <= current (returns VersionConflictError) - [ ] VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion - [ ] First event for new actor must have Version > 0 (typically 1) - [ ] Version gaps are allowed (1, 3, 5 is valid) - [ ] Validation happens before persistence (fail-fast) - [ ] InMemoryEventStore and JetStreamEventStore both implement validation **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Core) **Aggregate:** ActorEventStream (implicit; each actor has independent version sequence) **Command:** SaveEvent(event) **Validation Rules:** - If no events exist for actor: version must be > 0 - If events exist: new version must be > latest version **Success Event:** EventStored (published when SaveEvent succeeds) **Error Event:** VersionConflict (triggered when version validation fails) **Technical Notes** - Version validation is the core invariant; everything else depends on it - Use `GetLatestVersion()` to implement validation - No database-level locks; optimistic validation only - Conflict should fail in <1ms **Test Cases** - New actor, version 1: succeeds - Same actor, version 2 (after 1): succeeds - Same actor, version 2 (after 1, concurrent): second call fails - Same actor, version 1 (duplicate): fails - Same actor, version 0 or negative: fails - Concurrent 100 writers: 99 fail, 1 succeeds **Dependencies** - None (foundation) --- #### Issue 1.2: [Rule] Enforce append-only and immutability invariants **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P0 **Title:** Enforce event immutability and append-only semantics **User Story** As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates. **Acceptance Criteria** - [ ] EventStore interface has no Update or Delete methods - [ ] Events cannot be modified after persistence - [ ] Replay of same events always produces same state - [ ] Corrupted events are reported (not silently skipped) - [ ] JetStream stream configuration prevents deletes (retention policy only) **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Core Invariant) **Aggregate:** ActorEventStream **Invariant:** Events are immutable; stream is append-only; no modifications to EventStore interface **Implementation:** - Event struct has no Setters (only getters) - SaveEvent is the only public persistence method - JetStream streams configured with `NoDelete` policy **Technical Notes** - This is enforced at interface level (no Update/Delete in EventStore) - JetStream configuration prevents accidental deletes - ReplayError allows visibility into corruption without losing good data **Test Cases** - Attempt to modify Event.Data after creation: compile error (if immutable) - Attempt to call UpdateEvent: interface doesn't exist - JetStream stream created with correct retention policy - ReplayError captured when event unmarshaling fails **Dependencies** - Depends on: Issue 1.1 (SaveEvent implementation) --- #### Issue 1.3: [Event] Publish EventStored after successful save **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P0 **Title:** Emit EventStored event for persistence observability **User Story** As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections). **Acceptance Criteria** - [ ] EventStored event published after SaveEvent succeeds - [ ] EventStored contains: EventID, ActorID, Version, Timestamp - [ ] No EventStored published if SaveEvent fails - [ ] EventBus receives EventStored in same transaction context - [ ] Metrics increment for each EventStored **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature **Event:** EventStored(eventID, actorID, version, timestamp) **Triggered by:** Successful SaveEvent call **Consumers:** Metrics collectors, projections, audit systems **Technical Notes** - EventStored is an internal event (Aether infrastructure) - Published to local EventBus (see Phase 2 for cross-node) - Allows observability without coupling application code **Test Cases** - Save event → EventStored published - Version conflict → no EventStored published - Multiple saves → multiple EventStored events in order **Dependencies** - Depends on: Issue 1.1 (SaveEvent) - Depends on: Phase 2, Issue 2.1 (EventBus.Publish) --- #### Issue 1.4: [Event] Publish VersionConflict error with full context **Type:** New Feature **Bounded Context:** Event Sourcing, Optimistic Concurrency Control **Priority:** P0 **Title:** Return detailed version conflict information for retry logic **User Story** As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker). **Acceptance Criteria** - [ ] VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion - [ ] Error message is human-readable with all context - [ ] Errors.Is(err, ErrVersionConflict) returns true for sentinel check - [ ] Errors.As(err, &versionErr) allows unpacking to VersionConflictError - [ ] Application can read CurrentVersion to decide retry strategy **Bounded Context:** Event Sourcing + OCC **DDD Implementation Guidance** **Type:** New Feature **Error Type:** VersionConflictError (wraps ErrVersionConflict sentinel) **Data:** ActorID, AttemptedVersion, CurrentVersion **Use:** Application uses this to implement retry strategies **Technical Notes** - Already implemented in `/aether/event.go` (VersionConflictError struct) - Document standard retry patterns in examples/ **Test Cases** - Conflict with detailed error: ActorID, versions present - Application reads CurrentVersion: succeeds - Errors.Is(err, ErrVersionConflict): true - Errors.As(err, &versionErr): works - Manual test: log the error, see all context **Dependencies** - Depends on: Issue 1.1 (SaveEvent) --- #### Issue 1.5: [Read Model] Implement GetLatestVersion query **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P0 **Title:** Provide efficient version lookup for optimistic locking **User Story** As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead. **Acceptance Criteria** - [ ] GetLatestVersion(actorID) returns latest version or 0 if no events - [ ] Execution time is O(1) or O(log n), not O(n) - [ ] InMemoryEventStore implements with map lookup - [ ] JetStreamEventStore caches latest version per actor - [ ] Cache is invalidated after each SaveEvent - [ ] Multiple calls for same actor within 1s hit cache - [ ] Namespace isolation: GetLatestVersion scoped to namespace **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Query) **Read Model:** ActorVersionIndex **Source Events:** SaveEvent (updates cache) **Data:** ActorID → LatestVersion **Performance:** O(1) lookup after SaveEvent **Technical Notes** - InMemoryEventStore: use map[actorID]int64 - JetStreamEventStore: query JetStream metadata OR maintain cache - Cache invalidation: update after every SaveEvent - Thread-safe with RWMutex (read-heavy) **Test Cases** - New actor: GetLatestVersion returns 0 - After SaveEvent(version: 1): GetLatestVersion returns 1 - After SaveEvent(version: 3): GetLatestVersion returns 3 - Concurrent reads from same actor: all return consistent value - Namespace isolation: "tenant-a" and "tenant-b" have independent versions **Dependencies** - Depends on: Issue 1.1 (SaveEvent) --- ### Feature Set 1b: State Rebuild from Event History **Capability:** Rebuild State from Event History **Description:** Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors. **Success Condition:** GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1). --- #### Issue 1.6: [Command] Implement GetEvents for replay **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P0 **Title:** Load events from store for state replay **User Story** As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state. **Acceptance Criteria** - [ ] GetEvents(actorID, fromVersion) returns []*Event in version order - [ ] Events are ordered by version (ascending) - [ ] fromVersion is inclusive (GetEvents(actorID, 5) includes version 5) - [ ] If no events exist, returns empty slice (not error) - [ ] If actorID has no events >= fromVersion, returns empty slice - [ ] Namespace isolation: GetEvents scoped to namespace - [ ] Large result sets don't cause memory issues (stream if >10k events) **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Query) **Command:** GetEvents(actorID, fromVersion) **Returns:** []*Event ordered by version **Invariant:** Order is deterministic (version order always) **Technical Notes** - InMemoryEventStore: filter and sort by version - JetStreamEventStore: query JetStream subject and order results - Consider pagination for very large actor histories - fromVersion=0 means "start from beginning" **Test Cases** - GetEvents(actorID, 0) with 5 events: returns all 5 in order - GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5 - GetEvents(nonexistent, 0): returns empty slice - GetEvents with gap (versions 1, 3, 5): returns only those 3 - Order is guaranteed (version order, not insertion order) **Dependencies** - Depends on: Issue 1.1 (SaveEvent) --- #### Issue 1.7: [Rule] Define and enforce snapshot validity **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P1 **Title:** Implement snapshot invalidation policy **User Story** As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat. **Acceptance Criteria** - [ ] Snapshot valid until Version + MaxVersionGap (default 1000) - [ ] GetLatestSnapshot returns nil if no snapshot or invalid - [ ] Application can override MaxVersionGap in config - [ ] Snapshot timestamp recorded for debugging - [ ] No automatic cleanup; application calls SaveSnapshot to create - [ ] Tests confirm snapshot invalidation logic **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Policy) **Aggregate:** ActorSnapshot + SnapshotPolicy **Policy:** Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap **Implementation:** - SnapshotStore.GetLatestSnapshot validates before returning - If invalid, returns nil; application must replay **Technical Notes** - This is a safety policy; prevents stale snapshots - Application owns decision to create snapshots (no auto-triggering) - MaxVersionGap is tunable per deployment **Test Cases** - Snapshot at version 10, MaxGap=100, current=50: valid - Snapshot at version 10, MaxGap=100, current=111: invalid - Snapshot at version 10, MaxGap=100, current=110: valid - GetLatestSnapshot returns nil for invalid snapshot **Dependencies** - Depends on: Issue 1.6 (GetEvents) --- #### Issue 1.8: [Event] Publish SnapshotCreated for observability **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P1 **Title:** Emit snapshot creation event for lifecycle tracking **User Story** As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting. **Acceptance Criteria** - [ ] SnapshotCreated event published after SaveSnapshot succeeds - [ ] Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration - [ ] Metrics increment for snapshot creation - [ ] No event if SaveSnapshot fails - [ ] Example: Snapshot created every 1000 versions **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** SnapshotCreated(actorID, version, timestamp, replayDurationMs) **Triggered by:** SaveSnapshot call succeeds **Consumers:** Metrics, monitoring dashboards **Technical Notes** - SnapshotCreated is infrastructure event (like EventStored) - ReplayDuration helps identify slow actors needing snapshots more frequently **Test Cases** - SaveSnapshot succeeds → SnapshotCreated published - SaveSnapshot fails → no event published - ReplayDuration recorded accurately **Dependencies** - Depends on: Issue 1.7 (SnapshotStore interface) --- #### Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P1 **Title:** Handle corrupted events during replay without data loss **User Story** As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events. **Acceptance Criteria** - [ ] GetEventsWithErrors(actorID, fromVersion) returns ReplayResult - [ ] ReplayResult contains: []*Event (good) and []ReplayError (bad) - [ ] Good events are returned in order despite errors - [ ] ReplayError contains: SequenceNumber, RawData, UnmarshalError - [ ] Application decides how to handle corrupted events - [ ] Metrics track corruption frequency **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Query) **Interface:** EventStoreWithErrors extends EventStore **Method:** GetEventsWithErrors(actorID, fromVersion) → ReplayResult **Data:** - ReplayResult.Events: successfully deserialized events - ReplayResult.Errors: corruption records - ReplayResult.HasErrors(): convenience check **Technical Notes** - Already defined in event.go (ReplayError, ReplayResult) - JetStreamEventStore should implement EventStoreWithErrors - Application uses HasErrors() to decide on recovery action **Test Cases** - All good events: ReplayResult.Events populated, no errors - Corrupted event in middle: good events before/after, one error recorded - Multiple corruptions: all recorded with context - Application can inspect RawData for forensics **Dependencies** - Depends on: Issue 1.6 (GetEvents) --- #### Issue 1.10: [Interface] Implement SnapshotStore interface **Type:** New Feature **Bounded Context:** Event Sourcing **Priority:** P0 **Title:** Define snapshot storage contract **User Story** As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3). **Acceptance Criteria** - [ ] SnapshotStore extends EventStore - [ ] GetLatestSnapshot(actorID) returns ActorSnapshot or nil - [ ] SaveSnapshot(snapshot) persists snapshot - [ ] ActorSnapshot contains: ActorID, Version, State, Timestamp - [ ] Namespace isolation: snapshots scoped to namespace - [ ] Tests verify interface contract **Bounded Context:** Event Sourcing **DDD Implementation Guidance** **Type:** New Feature (Interface) **Interface:** SnapshotStore extends EventStore **Methods:** - GetLatestSnapshot(actorID) → (*ActorSnapshot, error) - SaveSnapshot(snapshot) → error **Aggregates:** ActorSnapshot (value object) **Technical Notes** - Already defined in event.go - Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore - Keep snapshots in same store as events (co-located) **Test Cases** - SaveSnapshot persists; GetLatestSnapshot retrieves it - New actor: GetLatestSnapshot returns nil - Multiple snapshots: only latest returned - Namespace isolation: snapshots from tenant-a don't appear in tenant-b **Dependencies** - Depends on: Issue 1.1 (SaveEvent + storage foundation) --- ### Feature Set 1c: Optimistic Concurrency Control **Capability:** Enable Safe Concurrent Writes **Description:** Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy. **Success Condition:** Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks. --- #### Issue 1.11: [Rule] Enforce fail-fast on version conflict **Type:** New Feature **Bounded Context:** Optimistic Concurrency Control **Priority:** P0 **Title:** Fail immediately on version conflict; no auto-retry **User Story** As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.). **Acceptance Criteria** - [ ] SaveEvent returns VersionConflictError immediately on mismatch - [ ] No built-in retry loop in SaveEvent - [ ] No database-level locks held - [ ] Application reads VersionConflictError and decides retry - [ ] Default retry strategy documented (examples/) **Bounded Context:** Optimistic Concurrency Control **DDD Implementation Guidance** **Type:** New Feature (Policy) **Invariant:** Conflicts trigger immediate failure; application owns retry **Implementation:** - SaveEvent: version check, return error if mismatch, done - No loop, no backoff, no retries - Clean error with context for caller **Technical Notes** - This is a design choice: fail-fast enables flexible retry strategies - Application can choose exponential backoff, jitter, circuit-breaker, etc. **Test Cases** - SaveEvent(version: 2) when current=2: fails immediately - No retry attempted by library - Application can retry if desired - Example patterns in examples/retry.go **Dependencies** - Depends on: Issue 1.1 (SaveEvent) --- #### Issue 1.12: [Documentation] Document concurrent write patterns **Type:** New Feature **Bounded Context:** Optimistic Concurrency Control **Priority:** P1 **Title:** Provide retry strategy examples (backoff, circuit-breaker, queue) **User Story** As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application. **Acceptance Criteria** - [ ] examples/retry_exponential_backoff.go - [ ] examples/retry_circuit_breaker.go - [ ] examples/retry_queue_based.go - [ ] examples/concurrent_write_test.go showing patterns - [ ] README mentions OCC patterns - [ ] Each example is >100 lines with explanation **Bounded Context:** Optimistic Concurrency Control **DDD Implementation Guidance** **Type:** Documentation **Artifacts:** - examples/retry_exponential_backoff.go - examples/retry_circuit_breaker.go - examples/retry_queue_based.go - examples/concurrent_write_test.go **Content:** - How to read VersionConflictError - When to retry (idempotent operations) - When not to retry (non-idempotent) - Backoff strategies - Monitoring **Technical Notes** - Real, runnable code (not pseudocode) - Show metrics collection - Show when to give up **Test Cases** - Examples compile without error - Examples use idempotent operations - Test coverage for examples **Dependencies** - Depends on: Issue 1.11 (fail-fast behavior) --- ## Phase 2: Local Event Bus ### Feature Set 2a: Event Routing and Filtering **Capability:** Route and Filter Domain Events **Description:** Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern. **Success Condition:** Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match. --- #### Issue 2.1: [Command] Implement Publish to local subscribers **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P1 **Title:** Publish events to local subscribers **User Story** As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling. **Acceptance Criteria** - [ ] Publish(namespaceID, event) sends to all subscribers of that namespace - [ ] Exact subscribers (namespace="orders") receive event - [ ] Wildcard subscribers (namespace="order*") receive matching events - [ ] Events delivered in-process (no NATS yet) - [ ] Buffered channels (100-event buffer) prevent blocking - [ ] Full subscribers dropped non-blocking (no deadlock) - [ ] Metrics track publish count, receive count, dropped count **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** Publish(namespaceID, event) **Invariant:** All subscribers matching namespace receive event **Implementation:** - Iterate exact subscribers for namespace - Iterate wildcard subscribers matching pattern - Deliver to each (non-blocking, buffered) - Count drops **Technical Notes** - EventBus in eventbus.go already implements this - Ensure buffered channels don't cause memory leaks - Metrics important for observability **Test Cases** - Publish to "orders": exact subscriber of "orders" receives - Publish to "orders.new": wildcard subscriber of "order*" receives - Publish to "payments": subscriber to "orders" does NOT receive - Subscriber with full buffer: event dropped (non-blocking) - 1000 publishes: metrics accurate **Dependencies** - Depends on: Issue 2.2 (Subscribe) --- #### Issue 2.2: [Command] Implement Subscribe with optional filter **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P1 **Title:** Register subscriber with optional event filter **User Story** As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about. **Acceptance Criteria** - [ ] Subscribe(namespacePattern) returns <-chan *Event - [ ] SubscribeWithFilter(namespacePattern, filter) returns filtered channel - [ ] Filter supports EventTypes ([]string) and ActorPattern (string) - [ ] Filters applied client-side (subscriber decides) - [ ] Wildcard patterns work: "*" matches single token, ">" matches multiple - [ ] Subscription channel is buffered (100 events) - [ ] Unsubscribe(namespacePattern, ch) removes subscription **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter) **Invariants:** - Namespace pattern determines which namespaces - Filter determines which events within namespace - Both work together (AND logic) **Filter Types:** - EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"]) - ActorPattern: string (e.g., "order-customer-*") **Technical Notes** - Pattern matching follows NATS conventions - Filters are optional (nil filter = all events) - Client-side filtering is efficient (NATS does server-side) **Test Cases** - Subscribe("orders"): exact match only - Subscribe("order*"): wildcard match - Subscribe("order.*"): NATS-style wildcard - SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works - SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works - Unsubscribe closes channel **Dependencies** - Depends on: Issue 1.1 (events structure) --- #### Issue 2.3: [Rule] Enforce exact subscription isolation **Type:** New Feature **Bounded Context:** Event Bus + Namespace Isolation **Priority:** P1 **Title:** Guarantee exact namespace subscriptions are isolated **User Story** As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level. **Acceptance Criteria** - [ ] Subscriber to "tenant-a" receives events from "tenant-a" only - [ ] Subscriber to "tenant-a" does NOT receive from "tenant-b" - [ ] Wildcard subscriber to "tenant*" receives from both - [ ] Exact match subscribers are isolated from wildcard - [ ] Tests verify isolation with multi-namespace setup - [ ] Documentation warns about wildcard security implications **Bounded Context:** Event Bus + Namespace Isolation **DDD Implementation Guidance** **Type:** New Feature (Policy/Invariant) **Invariant:** Exact subscriptions are isolated **Implementation:** - exactSubscribers map[namespace][]*subscription - Wildcard subscriptions separate collection - Publish checks exact first, then wildcard patterns **Security Note:** Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.) **Technical Notes** - Enforced at EventBus.Publish level - Exact match is simple string equality - Wildcard uses MatchNamespacePattern helper **Test Cases** - Publish to "tenant-a": only "tenant-a" exact subscribers get it - Publish to "tenant-b": only "tenant-b" exact subscribers get it - Publish to "tenant-a": "tenant*" wildcard subscriber gets it - Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it **Dependencies** - Depends on: Issue 2.2 (Subscribe) --- #### Issue 2.4: [Rule] Document wildcard subscription security **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P1 **Title:** Document that wildcard subscriptions bypass isolation **User Story** As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions. **Acceptance Criteria** - [ ] eventbus.go comments explain wildcard behavior - [ ] Security warning in Subscribe godoc - [ ] Example showing wildcard usage for logging - [ ] Example showing why wildcard is dangerous (if not restricted) - [ ] README mentions namespace isolation caveats - [ ] Examples show proper patterns (monitoring, auditing) **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** Documentation **Content:** - Wildcard subscriptions receive all matching events - Use for cross-cutting concerns (logging, monitoring, audit) - Restrict access to trusted components - Never expose wildcard pattern to untrusted users **Examples:** - Monitoring system subscribes to ">" - Audit system subscribes to "tenant-*" - Application logic uses exact subscriptions only **Technical Notes** - Intentional design; not a bug - Different from NATS server-side filtering **Test Cases** - Examples compile - Documentation is clear and accurate **Dependencies** - Depends on: Issue 2.3 (exact isolation) --- #### Issue 2.5: [Event] Publish SubscriptionCreated for tracking **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P2 **Title:** Track subscription lifecycle **User Story** As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues. **Acceptance Criteria** - [ ] SubscriptionCreated event published on Subscribe - [ ] SubscriptionDestroyed event published on Unsubscribe - [ ] Event contains: namespacePattern, filterCriteria, timestamp - [ ] Metrics increment on subscribe/unsubscribe - [ ] SubscriberCount(namespace) returns current count **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** SubscriptionCreated(namespacePattern, filter, timestamp) **Event:** SubscriptionDestroyed(namespacePattern, timestamp) **Metrics:** Subscriber count per namespace **Technical Notes** - SubscriberCount already in eventbus.go - Add events to EventBus.Subscribe and EventBus.Unsubscribe - Internal events (infrastructure) **Test Cases** - Subscribe → metrics increment - Unsubscribe → metrics decrement - SubscriberCount correct **Dependencies** - Depends on: Issue 2.2 (Subscribe/Unsubscribe) --- #### Issue 2.6: [Event] Publish EventPublished for delivery tracking **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P2 **Title:** Record event publication metrics **User Story** As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues. **Acceptance Criteria** - [ ] EventPublished event published on Publish - [ ] Metrics track: published count, delivered count, dropped count per namespace - [ ] Dropped events (full channel) recorded - [ ] Application can query metrics via Metrics() - [ ] Example: 1000 events published, 995 delivered, 5 dropped **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** New Feature (Event/Metrics) **Event:** EventPublished (infrastructure event) **Metrics:** - PublishCount[namespace] - DeliveryCount[namespace] - DroppedCount[namespace] **Implementation:** - RecordPublish(namespace) - RecordReceive(namespace) - RecordDroppedEvent(namespace) **Technical Notes** - Metrics already in DefaultMetricsCollector - RecordDroppedEvent signals subscriber backpressure - Can be used to auto-scale subscribers **Test Cases** - Publish 100 events: metrics show 100 published - All delivered: metrics show 100 delivered - Full subscriber: next event dropped, metrics show 1 dropped - Query via bus.Metrics(): values accurate **Dependencies** - Depends on: Issue 2.1 (Publish) --- #### Issue 2.7: [Read Model] Implement GetSubscriptions query **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P2 **Title:** Query active subscriptions for operational visibility **User Story** As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health. **Acceptance Criteria** - [ ] GetSubscriptions() returns []SubscriptionInfo - [ ] SubscriptionInfo contains: pattern, filter, subscriberID, createdAt - [ ] Works for both exact and wildcard subscriptions - [ ] Metrics accessible via SubscriberCount(namespace) - [ ] Example: "What subscriptions are listening to 'orders'?" **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** New Feature (Query) **Read Model:** SubscriptionRegistry **Data:** - Pattern: namespace pattern (e.g., "tenant-*") - Filter: optional filter criteria - SubscriberID: unique ID for each subscription - CreatedAt: timestamp **Implementation:** - Track subscriptions in eventbus.go - Expose via GetSubscriptions() method **Technical Notes** - Useful for debugging - Optional feature; not critical **Test Cases** - Subscribe to "orders": GetSubscriptions shows it - Subscribe to "order*": GetSubscriptions shows it - Unsubscribe: GetSubscriptions removes it - Multiple subscribers: all listed **Dependencies** - Depends on: Issue 2.2 (Subscribe) --- ### Feature Set 2b: Buffering and Backpressure **Capability:** Route and Filter Domain Events (non-blocking delivery) **Description:** Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed). **Success Condition:** Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher. --- #### Issue 2.8: [Rule] Implement non-blocking event delivery **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P1 **Title:** Ensure event publication never blocks **User Story** As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations. **Acceptance Criteria** - [ ] Publish(namespace, event) returns immediately - [ ] If subscriber channel full, event dropped (non-blocking) - [ ] Dropped events counted in metrics - [ ] Buffered channel size is 100 (tunable) - [ ] Publisher never waits for subscriber - [ ] Metrics alert on high drop rate **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** New Feature (Policy) **Invariant:** Publishers not blocked by slow subscribers **Implementation:** - select { case ch <- event: ... default: ... } - Count drops in default case **Trade-off:** - Pro: Publisher never blocks - Con: Events may be lost if subscriber can't keep up - Mitigation: Metrics alert on drops; subscriber can increase buffer or retry **Technical Notes** - Already implemented in eventbus.go (deliverToSubscriber) - 100-event buffer is reasonable default **Test Cases** - Subscribe, receive 100 events: no drops - Publish 101st event immediately: dropped - Metrics show drop count - Publisher latency < 1ms regardless of subscribers **Dependencies** - Depends on: Issue 2.1 (Publish) --- #### Issue 2.9: [Documentation] Document EventBus backpressure handling **Type:** New Feature **Bounded Context:** Event Bus **Priority:** P2 **Title:** Explain buffer management and recovery from drops **User Story** As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers. **Acceptance Criteria** - [ ] Document buffer size (100 events default) - [ ] Explain what happens on overflow (event dropped) - [ ] Document recovery patterns (subscriber restarts, re-syncs) - [ ] Example: Subscriber catches up from JetStream after restart - [ ] Metrics to monitor (drop rate) - [ ] README section on backpressure **Bounded Context:** Event Bus **DDD Implementation Guidance** **Type:** Documentation **Content:** - Buffer size and behavior - Drop semantics - Recovery patterns - Metrics to monitor - When to increase buffer size **Examples:** - Slow subscriber: increase buffer or fix handler - Network latency: events may be dropped - Handler panics: subscriber must restart and re-sync **Technical Notes** - Events are lost if dropped; only durable via JetStream - Phase 3 (NATS) addresses durability **Test Cases** - Documentation is clear - Examples work **Dependencies** - Depends on: Issue 2.8 (non-blocking delivery) --- ## Phase 3: Cluster Coordination ### Feature Set 3a: Cluster Topology and Leadership **Capability:** Coordinate Cluster Topology **Description:** Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease. **Success Condition:** Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s. --- #### Issue 3.1: [Command] Implement JoinCluster protocol **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P1 **Title:** Enable node discovery via cluster join **User Story** As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically. **Acceptance Criteria** - [ ] JoinCluster() announces node via NATS - [ ] Node info contains: NodeID, Address, Timestamp, Status - [ ] Other nodes receive join announcement - [ ] Cluster topology updated atomically - [ ] Rejoining node detected and updated - [ ] Tests verify multi-node discovery **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** JoinCluster() **Aggregates:** Cluster (group of nodes) **Events:** NodeJoined(nodeID, address, timestamp) **Technical Notes** - NATS subject: "aether.cluster.nodes" - NodeDiscovery subscribes to announcements - ClusterManager.Start() initiates join **Test Cases** - Single node joins: topology = [node-a] - Second node joins: topology = [node-a, node-b] - Third node joins: topology = [node-a, node-b, node-c] - Node rejoins: updates existing entry **Dependencies** - None (first cluster feature) --- #### Issue 3.2: [Command] Implement LeaderElection **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P0 **Title:** Elect single leader via NATS-based voting **User Story** As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing. **Acceptance Criteria** - [ ] LeaderElection holds election every HeartbeatInterval (5s) - [ ] Nodes vote for themselves (no voting logic; first wins) - [ ] One leader elected per term - [ ] Leader holds lease (TTL = 2 * HeartbeatInterval) - [ ] All nodes converge on same leader - [ ] Lease renewal happens automatically **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** ElectLeader() **Aggregates:** LeadershipLease (time-bound authority) **Events:** LeaderElected(leaderID, term, leaseExpiration) **Technical Notes** - NATS subject: "aether.cluster.election" - Each node publishes heartbeat with NodeID, Timestamp - First node to publish becomes leader - Lease expires if no heartbeat for TTL **Test Cases** - Single node: elected immediately - Three nodes: exactly one elected - Leader dies: remaining nodes elect new leader within 2*interval - Leader comes back: may or may not stay leader **Dependencies** - Depends on: Issue 3.1 (node discovery) --- #### Issue 3.3: [Rule] Enforce single leader invariant **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P0 **Title:** Guarantee exactly one leader at any time **User Story** As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict. **Acceptance Criteria** - [ ] At most one leader at any time (lease-based) - [ ] If leader lease expires, no leader until re-election - [ ] All nodes see same leader (or none) - [ ] Tests verify invariant under various failure scenarios - [ ] Split-brain prevented by lease TTL < network latency **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Invariant) **Invariant:** At most one leader (enforced by lease TTL) **Mechanism:** - Leader publishes heartbeat every HeartbeatInterval - Other nodes trust leader if heartbeat < HeartbeatInterval old - If no heartbeat for 2*HeartbeatInterval, lease expired - New election begins **Technical Notes** - Lease-based; not consensus-based (simpler) - Allows temporary split-brain until lease expires - Acceptable for Aether (eventual consistency) **Test Cases** - Simulate leader death: lease expires, new leader elected - Simulate network partition: partition may have >1 leader until lease expires - Verify no coordination during lease expiration **Dependencies** - Depends on: Issue 3.2 (leader election) --- #### Issue 3.4: [Event] Publish LeaderElected on election **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P1 **Title:** Record leadership election outcomes **User Story** As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency. **Acceptance Criteria** - [ ] LeaderElected event published after successful election - [ ] Event contains: LeaderID, Term, LeaseExpiration, Timestamp - [ ] Metrics increment on election - [ ] Helpful for debugging split-brain scenarios - [ ] Track election frequency (ideally < 1 per minute) **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** LeaderElected(leaderID, term, leaseExpiration, timestamp) **Triggered by:** Successful election **Consumers:** Metrics, audit logs **Technical Notes** - Event published locally to all observers - Infrastructure event (not domain event) **Test Cases** - Election happens: event published - Term increments: event reflects new term - Metrics accurate **Dependencies** - Depends on: Issue 3.2 (election) --- #### Issue 3.5: [Event] Publish LeadershipLost on lease expiration **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P2 **Title:** Track leadership transitions **User Story** As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events. **Acceptance Criteria** - [ ] LeadershipLost event published when lease expires - [ ] Event contains: PreviousLeaderID, Timestamp, Reason - [ ] Metrics track leadership transitions - [ ] Helpful for debugging cascading failures **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** LeadershipLost(previousLeaderID, timestamp, reason) **Reason:** "lease_expired", "node_failed", etc. **Technical Notes** - Published when lease TTL expires - Useful for observability **Test Cases** - Leader lease expires: LeadershipLost published - Metrics show transition **Dependencies** - Depends on: Issue 3.2 (election) --- #### Issue 3.6: [Read Model] Implement GetClusterTopology query **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P1 **Title:** Query current cluster members and status **User Story** As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues. **Acceptance Criteria** - [ ] GetNodes() returns map[nodeID]*NodeInfo - [ ] NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs - [ ] Status is: Active, Degraded, Failed - [ ] LastSeen is accurate heartbeat timestamp - [ ] ShardIDs show shard ownership (filled in Phase 3b) - [ ] Example: "node-a is active; node-b failed 30s ago" **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Query) **Read Model:** ClusterTopology **Data:** - NodeID → NodeInfo (status, heartbeat, shards) - LeaderID (current leader) - Term (election term) **Technical Notes** - ClusterManager maintains topology in-memory - Update on each heartbeat/announcement **Test Cases** - GetNodes() returns active nodes - Status accurate (Active, Failed, etc.) - LastSeen updates on heartbeat - Rejoining node updates existing entry **Dependencies** - Depends on: Issue 3.1 (node discovery) --- #### Issue 3.7: [Read Model] Implement GetLeader query **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P0 **Title:** Query current leader **User Story** As a client, I want to know who the leader is, so that I can route coordination requests to the right node. **Acceptance Criteria** - [ ] GetLeader() returns current leader NodeID or "" - [ ] IsLeader() returns true if this node is leader - [ ] Both consistent with LeaderElection state - [ ] Updated immediately on election - [ ] Example: "node-b is leader (term 5)" **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Query) **Read Model:** LeadershipRegistry **Data:** CurrentLeader, CurrentTerm, LeaseExpiration **Implementation:** - LeaderElection maintains this - ClusterManager queries it **Technical Notes** - Critical for routing coordination work - Must be consistent across cluster **Test Cases** - No leader: GetLeader returns "" - Leader elected: GetLeader returns leader ID - IsLeader true on leader, false on others - Changes on re-election **Dependencies** - Depends on: Issue 3.2 (election) --- ### Feature Set 3b: Shard Distribution **Capability:** Distribute Actors Across Cluster Nodes **Description:** Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling. **Success Condition:** 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently. --- #### Issue 3.8: [Command] Implement consistent hash ring **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P1 **Title:** Distribute shards across nodes with minimal reshuffling **User Story** As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling. **Acceptance Criteria** - [ ] ConsistentHashRing(numShards=1024) creates ring - [ ] GetShard(actorID) returns consistent shard [0, 1024) - [ ] AddNode(nodeID) rebalances ~numShards/numNodes shards - [ ] RemoveNode(nodeID) rebalances shards evenly - [ ] Same actor always maps to same shard - [ ] Reshuffling < 40% on node add/remove **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** AssignShards(nodes) **Aggregates:** ConsistentHashRing (distribution algorithm) **Invariants:** - Each shard [0, 1024) assigned to exactly one node - ActorID hashes consistently to shard - Topology changes minimize reassignment **Technical Notes** - hashring.go already implements this - Use crypto/md5 or compatible hash - 1024 shards is tunable (P1 default) **Test Cases** - Single node: all shards assigned to it - Two nodes: ~512 shards each - Three nodes: ~341 shards each - Add fourth node: ~256 shards each (~20% reshuffled) - Remove node: remaining nodes rebalance evenly - Same actor-id always hashes to same shard **Dependencies** - Depends on: Issue 3.1 (node discovery) --- #### Issue 3.9: [Rule] Enforce single shard owner invariant **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P0 **Title:** Guarantee each shard has exactly one owner **User Story** As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically. **Acceptance Criteria** - [ ] ShardMap tracks shard → nodeID assignment - [ ] No shard is unassigned (every shard has owner) - [ ] No shard assigned to multiple nodes - [ ] Reassignment is atomic (no in-between state) - [ ] Tests verify invariant after topology changes **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Invariant) **Invariant:** Each shard [0, 1024) assigned to exactly one active node **Mechanism:** - ShardMap[shardID] = [nodeID] - Maintained by leader - Updated atomically on rebalancing **Technical Notes** - shard.go implements ShardManager - Validated after each rebalancing **Test Cases** - After rebalancing: all shards assigned - No orphaned shards - No multiply-assigned shards - Reassignment is atomic **Dependencies** - Depends on: Issue 3.8 (consistent hashing) --- #### Issue 3.10: [Event] Publish ShardAssigned on assignment **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P2 **Title:** Track shard-to-node assignments **User Story** As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues. **Acceptance Criteria** - [ ] ShardAssigned event published after assignment - [ ] Event contains: ShardID, NodeID, Timestamp - [ ] Metrics track: shards per node, rebalancing frequency - [ ] Example: Shard 42 assigned to node-b **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** ShardAssigned(shardID, nodeID, timestamp) **Triggered by:** AssignShards command succeeds **Metrics:** Shards per node, distribution evenness **Technical Notes** - Infrastructure event - Useful for monitoring load distribution **Test Cases** - Assignment published on rebalancing - Metrics reflect distribution **Dependencies** - Depends on: Issue 3.9 (shard ownership) --- #### Issue 3.11: [Read Model] Implement GetShardAssignments query **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P1 **Title:** Query shard-to-node mapping **User Story** As a client, I want to know which node owns a shard, so that I can route actor requests correctly. **Acceptance Criteria** - [ ] GetShardAssignments() returns ShardMap - [ ] ShardMap[shardID] returns owning nodeID - [ ] GetShard(actorID) returns shard for actor - [ ] Routing decision: actorID → shard → nodeID - [ ] Cached locally; refreshed on each rebalancing **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Query) **Read Model:** ShardMap **Data:** - ShardID → NodeID (primary owner) - Version (incremented on rebalancing) - UpdateTime **Implementation:** - ClusterManager.GetShardMap() - Cached; updated on assignment changes **Technical Notes** - Critical for routing - Must be consistent across cluster - Version helps detect stale caches **Test Cases** - GetShardAssignments returns current map - GetShard(actorID) returns consistent shard - Routing: actor ID → shard → node owner **Dependencies** - Depends on: Issue 3.9 (shard ownership) --- ### Feature Set 3c: Failure Detection and Recovery **Capability:** Recover from Node Failures **Description:** Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes. **Success Condition:** Node dies → failure detected within 90s → shards reassigned → actors replay automatically. --- #### Issue 3.12: [Command] Implement node health checks **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P1 **Title:** Detect node failures via heartbeat timeout **User Story** As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes. **Acceptance Criteria** - [ ] Each node publishes heartbeat every 30s - [ ] Nodes without heartbeat for 90s marked as Failed - [ ] checkNodeHealth() runs every 30s - [ ] Failed node's status updates atomically - [ ] Tests verify failure detection timing - [ ] Failed node can rejoin cluster **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** MarkNodeFailed(nodeID) **Trigger:** monitorNodes detects missing heartbeat **Events:** NodeFailed(nodeID, lastSeenTimestamp) **Technical Notes** - monitorNodes() loop in manager.go - Check LastSeen timestamp - Update status if stale (>90s) **Test Cases** - Active node: status stays Active - No heartbeat for 90s: status → Failed - Rejoin: status → Active - Failure detected < 100s (ideally 90-120s) **Dependencies** - Depends on: Issue 3.1 (node discovery) --- #### Issue 3.13: [Command] Implement RebalanceShards after node failure **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P0 **Title:** Reassign failed node's shards to healthy nodes **User Story** As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes. **Acceptance Criteria** - [ ] Leader detects node failure - [ ] Leader triggers RebalanceShards - [ ] Failed node's shards reassigned evenly - [ ] No shard left orphaned - [ ] ShardMap updated atomically - [ ] Rebalancing completes within 5 seconds **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Command) **Command:** RebalanceShards(failedNodeID) **Aggregates:** ShardMap, ConsistentHashRing **Events:** RebalanceStarted, ShardMigrated **Technical Notes** - Leader only (IsLeader() check) - Use consistent hashing to assign - Calculate new assignments atomically **Test Cases** - Node-a fails with shards [1, 2, 3] - Leader reassigns [1, 2, 3] to remaining nodes - No orphaned shards - Rebalancing < 5s **Dependencies** - Depends on: Issue 3.8 (consistent hashing) - Depends on: Issue 3.12 (failure detection) --- #### Issue 3.14: [Rule] Enforce no-orphan invariant **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P0 **Title:** Guarantee all shards have owners after rebalancing **User Story** As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable. **Acceptance Criteria** - [ ] Before rebalancing: verify no orphaned shards - [ ] After rebalancing: verify all shards assigned - [ ] Tests fail if invariant violated - [ ] Rebalancing aborted if invariant would be violated **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Invariant) **Invariant:** All shards [0, 1024) have owners after any rebalancing **Check:** - Count assigned shards - Verify = 1024 - Abort if not **Technical Notes** - Validate before committing ShardMap - Logs errors but doesn't assert (graceful degradation) **Test Cases** - Rebalancing completes: all shards assigned - Orphaned shard detected: rebalancing rolled back - Tests verify count = 1024 **Dependencies** - Depends on: Issue 3.13 (rebalancing) --- #### Issue 3.15: [Event] Publish NodeFailed on failure detection **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P2 **Title:** Record node failure for observability **User Story** As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues. **Acceptance Criteria** - [ ] NodeFailed event published when failure detected - [ ] Event contains: NodeID, LastSeenTimestamp, AffectedShards - [ ] Metrics track failure frequency - [ ] Example: "node-a failed; 341 shards affected" **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs) **Triggered by:** checkNodeHealth marks node failed **Consumers:** Metrics, alerts, audit logs **Technical Notes** - Infrastructure event - AffectedShards helps assess impact **Test Cases** - Node failure detected: event published - Metrics show affected shard count **Dependencies** - Depends on: Issue 3.12 (failure detection) --- #### Issue 3.16: [Event] Publish ShardMigrated on shard movement **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P2 **Title:** Track shard migrations **User Story** As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations. **Acceptance Criteria** - [ ] ShardMigrated event published on each shard movement - [ ] Event contains: ShardID, FromNodeID, ToNodeID, Status - [ ] Status: "Started", "InProgress", "Completed", "Failed" - [ ] Metrics track migration count and duration - [ ] Example: "Shard 42 migrated from node-a to node-b (2.3s)" **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs) **Status:** Started → InProgress → Completed **Consumers:** Metrics, progress tracking **Technical Notes** - Published for each shard move - Helps track rebalancing progress - Useful for SLO monitoring **Test Cases** - Shard moves: event published - Metrics track duration - Status transitions correct **Dependencies** - Depends on: Issue 3.13 (rebalancing) --- #### Issue 3.17: [Documentation] Document actor migration and replay **Type:** New Feature **Bounded Context:** Cluster Coordination **Priority:** P2 **Title:** Explain how actors move and recover state **User Story** As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application. **Acceptance Criteria** - [ ] Design doc: cluster/ACTOR_MIGRATION.md - [ ] Explain shard reassignment process - [ ] Explain state rebuild via GetEvents + replay - [ ] Explain snapshot optimization - [ ] Example: Shard 42 moves to new node; 1000-event actor replays in <100ms - [ ] Explain out-of-order message handling **Bounded Context:** Cluster Coordination **DDD Implementation Guidance** **Type:** Documentation **Content:** - Shard assignment (consistent hashing) - Actor discovery (routing via shard map) - State rebuild (replay from JetStream) - Snapshots (optional optimization) - In-flight messages (may arrive before replay completes) **Examples:** - Manual failover: reassign shards manually - Auto failover: leader initiates on failure detection **Technical Notes** - Complex topic; good documentation prevents bugs **Test Cases** - Documentation is clear - Examples correct **Dependencies** - Depends on: Issue 3.13 (rebalancing) - Depends on: Phase 1 (event replay) --- ## Phase 4: Namespace Isolation and NATS Event Delivery ### Feature Set 4a: Namespace Storage Isolation **Capability:** Isolate Logical Domains Using Namespaces **Description:** Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer. **Success Condition:** Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries. --- #### Issue 4.1: [Rule] Enforce namespace-based stream naming **Type:** New Feature **Bounded Context:** Namespace Isolation **Priority:** P1 **Title:** Use namespace prefixes in JetStream stream names **User Story** As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage. **Acceptance Criteria** - [ ] Namespace "tenant-a" → stream "tenant-a_events" - [ ] Namespace "tenant-b" → stream "tenant-b_events" - [ ] Empty namespace → stream "events" (default) - [ ] JetStreamConfig.Namespace sets prefix - [ ] NewJetStreamEventStoreWithNamespace convenience function - [ ] Tests verify stream names have namespace prefix **Bounded Context:** Namespace Isolation **DDD Implementation Guidance** **Type:** New Feature (Configuration) **Value Object:** Namespace (string identifier) **Implementation:** - JetStreamConfig.Namespace field - StreamName = namespace + "_events" if namespace set - StreamName = "events" if namespace empty **Technical Notes** - Already partially implemented in jetstream.go - Ensure safe characters (sanitize spaces, dots, wildcards) **Test Cases** - NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events" - NewJetStreamEventStoreWithNamespace(""): creates stream "events" - Stream name verified **Dependencies** - None (orthogonal to other contexts) --- #### Issue 4.2: [Rule] Enforce storage-level namespace isolation **Type:** New Feature **Bounded Context:** Namespace Isolation **Priority:** P0 **Title:** Prevent cross-namespace data leakage at storage layer **User Story** As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems. **Acceptance Criteria** - [ ] SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events" - [ ] GetEvents("tenant-a") queries "tenant-a_events" stream only - [ ] No possibility of accidental cross-namespace leakage - [ ] JetStream subject filtering enforces isolation - [ ] Integration tests verify with multiple namespaces **Bounded Context:** Namespace Isolation **DDD Implementation Guidance** **Type:** New Feature (Invariant) **Invariant:** Events from namespace X are invisible to namespace Y **Mechanism:** - Separate JetStream streams per namespace - Subject prefixing: "tenant-a.events.actor-123" - Subscribe filters by subject prefix **Technical Notes** - jetstream.go: SubscribeToActorEvents uses subject prefix - Consumer created with subject filter matching namespace **Test Cases** - SaveEvent to tenant-a: visible in tenant-a queries - Same event invisible to tenant-b queries - GetLatestVersion scoped to namespace - GetEvents scoped to namespace - Multi-namespace integration test **Dependencies** - Depends on: Issue 4.1 (stream naming) --- #### Issue 4.3: [Documentation] Document namespace design patterns **Type:** New Feature **Bounded Context:** Namespace Isolation **Priority:** P1 **Title:** Provide guidance on namespace naming and use **User Story** As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system. **Acceptance Criteria** - [ ] Design doc: NAMESPACE_DESIGN_PATTERNS.md - [ ] Pattern 1: "tenant-{id}" (per-customer) - [ ] Pattern 2: "env.domain" (per-env, per-bounded-context) - [ ] Pattern 3: "env.domain.customer" (most granular) - [ ] Examples of each pattern - [ ] Guidance on choosing granularity - [ ] Anti-patterns (wildcards, spaces, dots) **Bounded Context:** Namespace Isolation **DDD Implementation Guidance** **Type:** Documentation **Content:** - Multi-tenant patterns - Granularity decisions - Namespace naming rules - Examples - Anti-patterns - Performance implications **Examples:** - SaaS: "tenant-uuid" - Microservices: "service.orders" - Complex: "env.service.tenant" **Technical Notes** - No hard restrictions; naming is flexible - Sanitization (spaces → underscores) **Test Cases** - Documentation is clear - Examples valid **Dependencies** - Depends on: Issue 4.1 (stream naming) --- #### Issue 4.4: [Validation] Add namespace format validation (P2) **Type:** New Feature **Bounded Context:** Namespace Isolation **Priority:** P2 **Title:** Validate namespace names to prevent invalid streams **User Story** As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names. **Acceptance Criteria** - [ ] ValidateNamespace(ns string) returns error for invalid names - [ ] Rejects: "tenant-*", "tenant a", "tenant." - [ ] Accepts: "tenant-abc", "prod.orders", "tenant_123" - [ ] Called on NewJetStreamEventStoreWithNamespace - [ ] Clear error messages - [ ] Tests verify validation rules **Bounded Context:** Namespace Isolation **DDD Implementation Guidance** **Type:** New Feature (Validation) **Validation Rules:** - No wildcards (*, >) - No spaces - No leading/trailing dots - Alphanumeric, hyphens, underscores, dots only **Implementation:** - ValidateNamespace regex - Called before stream creation **Technical Notes** - Nice-to-have; currently strings accepted as-is - Could sanitize instead of rejecting (replace _ for spaces) **Test Cases** - Valid: "tenant-abc", "prod.orders" - Invalid: "tenant-*", "tenant a", ".prod" - Error messages clear **Dependencies** - Depends on: Issue 4.1 (stream naming) --- ### Feature Set 4b: Cross-Node Event Delivery via NATS **Capability:** Deliver Events Across Cluster Nodes **Description:** Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering. **Success Condition:** Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS). --- #### Issue 4.5: [Command] Implement NATSEventBus wrapper **Type:** New Feature **Bounded Context:** Event Bus (with NATS) **Priority:** P1 **Title:** Extend EventBus with NATS-native pub/sub **User Story** As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations. **Acceptance Criteria** - [ ] NATSEventBus embeds EventBus - [ ] Publish(namespace, event) sends to local EventBus AND NATS - [ ] NATS subject: "aether.events.{namespace}" - [ ] SubscribeWithFilter works across nodes - [ ] Self-published events not re-delivered (avoid loops) - [ ] Tests verify cross-node delivery **Bounded Context:** Event Bus (NATS extension) **DDD Implementation Guidance** **Type:** New Feature (Extension) **Aggregate:** EventBus extended with NATSEventBus **Commands:** Publish(namespace, event) [same interface, distributed] **Implementation:** - NATSEventBus composes EventBus - Override Publish to also publish to NATS - Subscribe to NATS subjects matching namespace **Technical Notes** - nats_eventbus.go already partially implemented - NATS subject: "aether.events.orders" for namespace "orders" - Include sourceNodeID in event to prevent redelivery **Test Cases** - Publish on node-a: local subscribers on node-a receive - Same publish: node-b subscribers receive via NATS - Self-loop prevented: node-a doesn't re-receive own publish - Multi-node: all nodes converge on same events **Dependencies** - Depends on: Issue 2.1 (EventBus.Publish) - Depends on: Issue 3.1 (cluster setup for multi-node tests) --- #### Issue 4.6: [Rule] Enforce exactly-once delivery across cluster **Type:** New Feature **Bounded Context:** Event Bus (NATS) **Priority:** P1 **Title:** Guarantee events delivered to all cluster subscribers **User Story** As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events. **Acceptance Criteria** - [ ] Event published to NATS with JetStream consumer - [ ] Consumer acknowledges delivery - [ ] Redelivery on network failure (JetStream handles) - [ ] No duplicate delivery to same subscriber - [ ] All nodes see same events in same order **Bounded Context:** Event Bus (NATS) **DDD Implementation Guidance** **Type:** New Feature (Invariant) **Invariant:** Exactly-once delivery to each subscriber **Mechanism:** - JetStream consumer per subscriber group - Acknowledgment on delivery - Automatic redelivery on timeout **Technical Notes** - JetStream handles durability and ordering - Consumer name = subscriber ID - Push consumer model (events pushed to subscriber) **Test Cases** - Publish event: all subscribers receive once - Network failure: redelivery after timeout - No duplicates on subscriber - Order preserved across nodes **Dependencies** - Depends on: Issue 4.5 (NATSEventBus) --- #### Issue 4.7: [Event] Publish EventPublished (via NATS) **Type:** New Feature **Bounded Context:** Event Bus (NATS) **Priority:** P2 **Title:** Route published events to NATS subjects **User Story** As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks. **Acceptance Criteria** - [ ] EventPublished event published to NATS - [ ] Subject: "aether.events.{namespace}.published" - [ ] Message contains: eventID, timestamp, sourceNodeID - [ ] Metrics track: events published, delivered, dropped - [ ] Helps identify partition/latency issues **Bounded Context:** Event Bus (NATS) **DDD Implementation Guidance** **Type:** New Feature (Event) **Event:** EventPublished (infrastructure) **Subject:** aether.events.{namespace}.published **Consumers:** Metrics, monitoring **Technical Notes** - Published after NATS publish succeeds - Separate from local EventPublished (for clarity) **Test Cases** - Publish event: EventPublished message on NATS - Metrics count delivery - Cross-node visibility works **Dependencies** - Depends on: Issue 4.5 (NATSEventBus) --- #### Issue 4.8: [Read Model] Implement cross-node subscription **Type:** New Feature **Bounded Context:** Event Bus (NATS) **Priority:** P1 **Title:** Receive events from other nodes via NATS **User Story** As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows. **Acceptance Criteria** - [ ] NATSEventBus.Subscribe(namespace) receives local + NATS events - [ ] SubscribeWithFilter works with NATS - [ ] Events from local node: delivered via local EventBus - [ ] Events from remote nodes: delivered via NATS consumer - [ ] Subscriber sees unified stream (no duplication) **Bounded Context:** Event Bus (NATS) **DDD Implementation Guidance** **Type:** New Feature (Query/Subscription) **Read Model:** UnifiedEventStream (local + remote) **Implementation:** - Subscribe creates local channel - NATSEventBus subscribes to NATS subject - Both feed into subscriber channel **Technical Notes** - Unified view is transparent to subscriber - No need to know if event is local or remote **Test Cases** - Subscribe to namespace: receive local events - Subscribe to namespace: receive remote events - Filter works across both sources - No duplication **Dependencies** - Depends on: Issue 4.5 (NATSEventBus) --- ## Summary This backlog contains **67 executable issues** across **5 bounded contexts** organized into **4 implementation phases**. Each issue: - Is decomposed using DDD-informed order (commands → rules → events → reads) - References domain concepts (aggregates, commands, events, value objects) - Includes acceptance criteria (testable, specific) - States dependencies (enabling parallel work) - Is sized to 1-3 days of work **Recommended Build Order:** 1. **Phase 1** (17 issues): Event Sourcing Foundation - everything depends on this 2. **Phase 2** (9 issues): Local Event Bus - enables observability before clustering 3. **Phase 3** (20 issues): Cluster Coordination - enables distributed deployment 4. **Phase 4** (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery **Total Scope:** ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team) --- ## Next Steps 1. Create Gitea issues from this backlog 2. Assign to team members 3. Set up dependency tracking in Gitea 4. Use `/spawn-issues` skill to parallelize implementation 5. Iterate on acceptance criteria with domain experts See `/issue-writing` skill for proper issue formatting in Gitea.