Files
aether/.product-strategy/BACKLOG.md
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

2622 lines
69 KiB
Markdown

# Aether Executable Backlog
**Built from:** 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition
**Date:** 2026-01-12
---
## Backlog Overview
This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API.
**Total Scope:**
- **Capabilities:** 9 (all complete)
- **Features:** 14
- **Issues:** 67
- **Contexts:** 5
- **Implementation Phases:** 4
**Build Order (by value and dependencies):**
1. **Phase 1: Event Sourcing Foundation** (Capabilities 1-3)
- Issues: 17
- Enables all other work
2. **Phase 2: Local Event Bus** (Capability 8)
- Issues: 9
- Enables local pub/sub before clustering
3. **Phase 3: Cluster Coordination** (Capabilities 5-7)
- Issues: 20
- Enables distributed deployment
4. **Phase 4: Namespace & NATS** (Capabilities 4, 9)
- Issues: 21
- Enables multi-tenancy and cross-node delivery
---
## Phase 1: Event Sourcing Foundation
### Feature Set 1a: Event Storage with Version Conflict Detection
**Capability:** Store Events Durably with Conflict Detection
**Description:** Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers.
**Success Condition:** Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history.
---
#### Issue 1.1: [Command] Implement SaveEvent with monotonic version validation
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0
**Title:** As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely
**User Story**
As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately.
**Acceptance Criteria**
- [ ] SaveEvent accepts event with Version > current for actor
- [ ] SaveEvent rejects event with Version <= current (returns VersionConflictError)
- [ ] VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion
- [ ] First event for new actor must have Version > 0 (typically 1)
- [ ] Version gaps are allowed (1, 3, 5 is valid)
- [ ] Validation happens before persistence (fail-fast)
- [ ] InMemoryEventStore and JetStreamEventStore both implement validation
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Core)
**Aggregate:** ActorEventStream (implicit; each actor has independent version sequence)
**Command:** SaveEvent(event)
**Validation Rules:**
- If no events exist for actor: version must be > 0
- If events exist: new version must be > latest version
**Success Event:** EventStored (published when SaveEvent succeeds)
**Error Event:** VersionConflict (triggered when version validation fails)
**Technical Notes**
- Version validation is the core invariant; everything else depends on it
- Use `GetLatestVersion()` to implement validation
- No database-level locks; optimistic validation only
- Conflict should fail in <1ms
**Test Cases**
- New actor, version 1: succeeds
- Same actor, version 2 (after 1): succeeds
- Same actor, version 2 (after 1, concurrent): second call fails
- Same actor, version 1 (duplicate): fails
- Same actor, version 0 or negative: fails
- Concurrent 100 writers: 99 fail, 1 succeeds
**Dependencies**
- None (foundation)
---
#### Issue 1.2: [Rule] Enforce append-only and immutability invariants
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0
**Title:** Enforce event immutability and append-only semantics
**User Story**
As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates.
**Acceptance Criteria**
- [ ] EventStore interface has no Update or Delete methods
- [ ] Events cannot be modified after persistence
- [ ] Replay of same events always produces same state
- [ ] Corrupted events are reported (not silently skipped)
- [ ] JetStream stream configuration prevents deletes (retention policy only)
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Core Invariant)
**Aggregate:** ActorEventStream
**Invariant:** Events are immutable; stream is append-only; no modifications to EventStore interface
**Implementation:**
- Event struct has no Setters (only getters)
- SaveEvent is the only public persistence method
- JetStream streams configured with `NoDelete` policy
**Technical Notes**
- This is enforced at interface level (no Update/Delete in EventStore)
- JetStream configuration prevents accidental deletes
- ReplayError allows visibility into corruption without losing good data
**Test Cases**
- Attempt to modify Event.Data after creation: compile error (if immutable)
- Attempt to call UpdateEvent: interface doesn't exist
- JetStream stream created with correct retention policy
- ReplayError captured when event unmarshaling fails
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent implementation)
---
#### Issue 1.3: [Event] Publish EventStored after successful save
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0
**Title:** Emit EventStored event for persistence observability
**User Story**
As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections).
**Acceptance Criteria**
- [ ] EventStored event published after SaveEvent succeeds
- [ ] EventStored contains: EventID, ActorID, Version, Timestamp
- [ ] No EventStored published if SaveEvent fails
- [ ] EventBus receives EventStored in same transaction context
- [ ] Metrics increment for each EventStored
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature
**Event:** EventStored(eventID, actorID, version, timestamp)
**Triggered by:** Successful SaveEvent call
**Consumers:** Metrics collectors, projections, audit systems
**Technical Notes**
- EventStored is an internal event (Aether infrastructure)
- Published to local EventBus (see Phase 2 for cross-node)
- Allows observability without coupling application code
**Test Cases**
- Save event → EventStored published
- Version conflict → no EventStored published
- Multiple saves → multiple EventStored events in order
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent)
- Depends on: Phase 2, Issue 2.1 (EventBus.Publish)
---
#### Issue 1.4: [Event] Publish VersionConflict error with full context
**Type:** New Feature
**Bounded Context:** Event Sourcing, Optimistic Concurrency Control
**Priority:** P0
**Title:** Return detailed version conflict information for retry logic
**User Story**
As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker).
**Acceptance Criteria**
- [ ] VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion
- [ ] Error message is human-readable with all context
- [ ] Errors.Is(err, ErrVersionConflict) returns true for sentinel check
- [ ] Errors.As(err, &versionErr) allows unpacking to VersionConflictError
- [ ] Application can read CurrentVersion to decide retry strategy
**Bounded Context:** Event Sourcing + OCC
**DDD Implementation Guidance**
**Type:** New Feature
**Error Type:** VersionConflictError (wraps ErrVersionConflict sentinel)
**Data:** ActorID, AttemptedVersion, CurrentVersion
**Use:** Application uses this to implement retry strategies
**Technical Notes**
- Already implemented in `/aether/event.go` (VersionConflictError struct)
- Document standard retry patterns in examples/
**Test Cases**
- Conflict with detailed error: ActorID, versions present
- Application reads CurrentVersion: succeeds
- Errors.Is(err, ErrVersionConflict): true
- Errors.As(err, &versionErr): works
- Manual test: log the error, see all context
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent)
---
#### Issue 1.5: [Read Model] Implement GetLatestVersion query
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0
**Title:** Provide efficient version lookup for optimistic locking
**User Story**
As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead.
**Acceptance Criteria**
- [ ] GetLatestVersion(actorID) returns latest version or 0 if no events
- [ ] Execution time is O(1) or O(log n), not O(n)
- [ ] InMemoryEventStore implements with map lookup
- [ ] JetStreamEventStore caches latest version per actor
- [ ] Cache is invalidated after each SaveEvent
- [ ] Multiple calls for same actor within 1s hit cache
- [ ] Namespace isolation: GetLatestVersion scoped to namespace
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Read Model:** ActorVersionIndex
**Source Events:** SaveEvent (updates cache)
**Data:** ActorID → LatestVersion
**Performance:** O(1) lookup after SaveEvent
**Technical Notes**
- InMemoryEventStore: use map[actorID]int64
- JetStreamEventStore: query JetStream metadata OR maintain cache
- Cache invalidation: update after every SaveEvent
- Thread-safe with RWMutex (read-heavy)
**Test Cases**
- New actor: GetLatestVersion returns 0
- After SaveEvent(version: 1): GetLatestVersion returns 1
- After SaveEvent(version: 3): GetLatestVersion returns 3
- Concurrent reads from same actor: all return consistent value
- Namespace isolation: "tenant-a" and "tenant-b" have independent versions
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent)
---
### Feature Set 1b: State Rebuild from Event History
**Capability:** Rebuild State from Event History
**Description:** Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors.
**Success Condition:** GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1).
---
#### Issue 1.6: [Command] Implement GetEvents for replay
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0
**Title:** Load events from store for state replay
**User Story**
As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state.
**Acceptance Criteria**
- [ ] GetEvents(actorID, fromVersion) returns []*Event in version order
- [ ] Events are ordered by version (ascending)
- [ ] fromVersion is inclusive (GetEvents(actorID, 5) includes version 5)
- [ ] If no events exist, returns empty slice (not error)
- [ ] If actorID has no events >= fromVersion, returns empty slice
- [ ] Namespace isolation: GetEvents scoped to namespace
- [ ] Large result sets don't cause memory issues (stream if >10k events)
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Command:** GetEvents(actorID, fromVersion)
**Returns:** []*Event ordered by version
**Invariant:** Order is deterministic (version order always)
**Technical Notes**
- InMemoryEventStore: filter and sort by version
- JetStreamEventStore: query JetStream subject and order results
- Consider pagination for very large actor histories
- fromVersion=0 means "start from beginning"
**Test Cases**
- GetEvents(actorID, 0) with 5 events: returns all 5 in order
- GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5
- GetEvents(nonexistent, 0): returns empty slice
- GetEvents with gap (versions 1, 3, 5): returns only those 3
- Order is guaranteed (version order, not insertion order)
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent)
---
#### Issue 1.7: [Rule] Define and enforce snapshot validity
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P1
**Title:** Implement snapshot invalidation policy
**User Story**
As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat.
**Acceptance Criteria**
- [ ] Snapshot valid until Version + MaxVersionGap (default 1000)
- [ ] GetLatestSnapshot returns nil if no snapshot or invalid
- [ ] Application can override MaxVersionGap in config
- [ ] Snapshot timestamp recorded for debugging
- [ ] No automatic cleanup; application calls SaveSnapshot to create
- [ ] Tests confirm snapshot invalidation logic
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Policy)
**Aggregate:** ActorSnapshot + SnapshotPolicy
**Policy:** Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap
**Implementation:**
- SnapshotStore.GetLatestSnapshot validates before returning
- If invalid, returns nil; application must replay
**Technical Notes**
- This is a safety policy; prevents stale snapshots
- Application owns decision to create snapshots (no auto-triggering)
- MaxVersionGap is tunable per deployment
**Test Cases**
- Snapshot at version 10, MaxGap=100, current=50: valid
- Snapshot at version 10, MaxGap=100, current=111: invalid
- Snapshot at version 10, MaxGap=100, current=110: valid
- GetLatestSnapshot returns nil for invalid snapshot
**Dependencies**
- Depends on: Issue 1.6 (GetEvents)
---
#### Issue 1.8: [Event] Publish SnapshotCreated for observability
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P1
**Title:** Emit snapshot creation event for lifecycle tracking
**User Story**
As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting.
**Acceptance Criteria**
- [ ] SnapshotCreated event published after SaveSnapshot succeeds
- [ ] Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration
- [ ] Metrics increment for snapshot creation
- [ ] No event if SaveSnapshot fails
- [ ] Example: Snapshot created every 1000 versions
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** SnapshotCreated(actorID, version, timestamp, replayDurationMs)
**Triggered by:** SaveSnapshot call succeeds
**Consumers:** Metrics, monitoring dashboards
**Technical Notes**
- SnapshotCreated is infrastructure event (like EventStored)
- ReplayDuration helps identify slow actors needing snapshots more frequently
**Test Cases**
- SaveSnapshot succeeds → SnapshotCreated published
- SaveSnapshot fails → no event published
- ReplayDuration recorded accurately
**Dependencies**
- Depends on: Issue 1.7 (SnapshotStore interface)
---
#### Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P1
**Title:** Handle corrupted events during replay without data loss
**User Story**
As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events.
**Acceptance Criteria**
- [ ] GetEventsWithErrors(actorID, fromVersion) returns ReplayResult
- [ ] ReplayResult contains: []*Event (good) and []ReplayError (bad)
- [ ] Good events are returned in order despite errors
- [ ] ReplayError contains: SequenceNumber, RawData, UnmarshalError
- [ ] Application decides how to handle corrupted events
- [ ] Metrics track corruption frequency
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Interface:** EventStoreWithErrors extends EventStore
**Method:** GetEventsWithErrors(actorID, fromVersion) → ReplayResult
**Data:**
- ReplayResult.Events: successfully deserialized events
- ReplayResult.Errors: corruption records
- ReplayResult.HasErrors(): convenience check
**Technical Notes**
- Already defined in event.go (ReplayError, ReplayResult)
- JetStreamEventStore should implement EventStoreWithErrors
- Application uses HasErrors() to decide on recovery action
**Test Cases**
- All good events: ReplayResult.Events populated, no errors
- Corrupted event in middle: good events before/after, one error recorded
- Multiple corruptions: all recorded with context
- Application can inspect RawData for forensics
**Dependencies**
- Depends on: Issue 1.6 (GetEvents)
---
#### Issue 1.10: [Interface] Implement SnapshotStore interface
**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0
**Title:** Define snapshot storage contract
**User Story**
As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3).
**Acceptance Criteria**
- [ ] SnapshotStore extends EventStore
- [ ] GetLatestSnapshot(actorID) returns ActorSnapshot or nil
- [ ] SaveSnapshot(snapshot) persists snapshot
- [ ] ActorSnapshot contains: ActorID, Version, State, Timestamp
- [ ] Namespace isolation: snapshots scoped to namespace
- [ ] Tests verify interface contract
**Bounded Context:** Event Sourcing
**DDD Implementation Guidance**
**Type:** New Feature (Interface)
**Interface:** SnapshotStore extends EventStore
**Methods:**
- GetLatestSnapshot(actorID) → (*ActorSnapshot, error)
- SaveSnapshot(snapshot) → error
**Aggregates:** ActorSnapshot (value object)
**Technical Notes**
- Already defined in event.go
- Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore
- Keep snapshots in same store as events (co-located)
**Test Cases**
- SaveSnapshot persists; GetLatestSnapshot retrieves it
- New actor: GetLatestSnapshot returns nil
- Multiple snapshots: only latest returned
- Namespace isolation: snapshots from tenant-a don't appear in tenant-b
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent + storage foundation)
---
### Feature Set 1c: Optimistic Concurrency Control
**Capability:** Enable Safe Concurrent Writes
**Description:** Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy.
**Success Condition:** Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks.
---
#### Issue 1.11: [Rule] Enforce fail-fast on version conflict
**Type:** New Feature
**Bounded Context:** Optimistic Concurrency Control
**Priority:** P0
**Title:** Fail immediately on version conflict; no auto-retry
**User Story**
As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.).
**Acceptance Criteria**
- [ ] SaveEvent returns VersionConflictError immediately on mismatch
- [ ] No built-in retry loop in SaveEvent
- [ ] No database-level locks held
- [ ] Application reads VersionConflictError and decides retry
- [ ] Default retry strategy documented (examples/)
**Bounded Context:** Optimistic Concurrency Control
**DDD Implementation Guidance**
**Type:** New Feature (Policy)
**Invariant:** Conflicts trigger immediate failure; application owns retry
**Implementation:**
- SaveEvent: version check, return error if mismatch, done
- No loop, no backoff, no retries
- Clean error with context for caller
**Technical Notes**
- This is a design choice: fail-fast enables flexible retry strategies
- Application can choose exponential backoff, jitter, circuit-breaker, etc.
**Test Cases**
- SaveEvent(version: 2) when current=2: fails immediately
- No retry attempted by library
- Application can retry if desired
- Example patterns in examples/retry.go
**Dependencies**
- Depends on: Issue 1.1 (SaveEvent)
---
#### Issue 1.12: [Documentation] Document concurrent write patterns
**Type:** New Feature
**Bounded Context:** Optimistic Concurrency Control
**Priority:** P1
**Title:** Provide retry strategy examples (backoff, circuit-breaker, queue)
**User Story**
As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application.
**Acceptance Criteria**
- [ ] examples/retry_exponential_backoff.go
- [ ] examples/retry_circuit_breaker.go
- [ ] examples/retry_queue_based.go
- [ ] examples/concurrent_write_test.go showing patterns
- [ ] README mentions OCC patterns
- [ ] Each example is >100 lines with explanation
**Bounded Context:** Optimistic Concurrency Control
**DDD Implementation Guidance**
**Type:** Documentation
**Artifacts:**
- examples/retry_exponential_backoff.go
- examples/retry_circuit_breaker.go
- examples/retry_queue_based.go
- examples/concurrent_write_test.go
**Content:**
- How to read VersionConflictError
- When to retry (idempotent operations)
- When not to retry (non-idempotent)
- Backoff strategies
- Monitoring
**Technical Notes**
- Real, runnable code (not pseudocode)
- Show metrics collection
- Show when to give up
**Test Cases**
- Examples compile without error
- Examples use idempotent operations
- Test coverage for examples
**Dependencies**
- Depends on: Issue 1.11 (fail-fast behavior)
---
## Phase 2: Local Event Bus
### Feature Set 2a: Event Routing and Filtering
**Capability:** Route and Filter Domain Events
**Description:** Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern.
**Success Condition:** Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match.
---
#### Issue 2.1: [Command] Implement Publish to local subscribers
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1
**Title:** Publish events to local subscribers
**User Story**
As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling.
**Acceptance Criteria**
- [ ] Publish(namespaceID, event) sends to all subscribers of that namespace
- [ ] Exact subscribers (namespace="orders") receive event
- [ ] Wildcard subscribers (namespace="order*") receive matching events
- [ ] Events delivered in-process (no NATS yet)
- [ ] Buffered channels (100-event buffer) prevent blocking
- [ ] Full subscribers dropped non-blocking (no deadlock)
- [ ] Metrics track publish count, receive count, dropped count
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** Publish(namespaceID, event)
**Invariant:** All subscribers matching namespace receive event
**Implementation:**
- Iterate exact subscribers for namespace
- Iterate wildcard subscribers matching pattern
- Deliver to each (non-blocking, buffered)
- Count drops
**Technical Notes**
- EventBus in eventbus.go already implements this
- Ensure buffered channels don't cause memory leaks
- Metrics important for observability
**Test Cases**
- Publish to "orders": exact subscriber of "orders" receives
- Publish to "orders.new": wildcard subscriber of "order*" receives
- Publish to "payments": subscriber to "orders" does NOT receive
- Subscriber with full buffer: event dropped (non-blocking)
- 1000 publishes: metrics accurate
**Dependencies**
- Depends on: Issue 2.2 (Subscribe)
---
#### Issue 2.2: [Command] Implement Subscribe with optional filter
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1
**Title:** Register subscriber with optional event filter
**User Story**
As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about.
**Acceptance Criteria**
- [ ] Subscribe(namespacePattern) returns <-chan *Event
- [ ] SubscribeWithFilter(namespacePattern, filter) returns filtered channel
- [ ] Filter supports EventTypes ([]string) and ActorPattern (string)
- [ ] Filters applied client-side (subscriber decides)
- [ ] Wildcard patterns work: "*" matches single token, ">" matches multiple
- [ ] Subscription channel is buffered (100 events)
- [ ] Unsubscribe(namespacePattern, ch) removes subscription
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter)
**Invariants:**
- Namespace pattern determines which namespaces
- Filter determines which events within namespace
- Both work together (AND logic)
**Filter Types:**
- EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"])
- ActorPattern: string (e.g., "order-customer-*")
**Technical Notes**
- Pattern matching follows NATS conventions
- Filters are optional (nil filter = all events)
- Client-side filtering is efficient (NATS does server-side)
**Test Cases**
- Subscribe("orders"): exact match only
- Subscribe("order*"): wildcard match
- Subscribe("order.*"): NATS-style wildcard
- SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works
- SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works
- Unsubscribe closes channel
**Dependencies**
- Depends on: Issue 1.1 (events structure)
---
#### Issue 2.3: [Rule] Enforce exact subscription isolation
**Type:** New Feature
**Bounded Context:** Event Bus + Namespace Isolation
**Priority:** P1
**Title:** Guarantee exact namespace subscriptions are isolated
**User Story**
As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level.
**Acceptance Criteria**
- [ ] Subscriber to "tenant-a" receives events from "tenant-a" only
- [ ] Subscriber to "tenant-a" does NOT receive from "tenant-b"
- [ ] Wildcard subscriber to "tenant*" receives from both
- [ ] Exact match subscribers are isolated from wildcard
- [ ] Tests verify isolation with multi-namespace setup
- [ ] Documentation warns about wildcard security implications
**Bounded Context:** Event Bus + Namespace Isolation
**DDD Implementation Guidance**
**Type:** New Feature (Policy/Invariant)
**Invariant:** Exact subscriptions are isolated
**Implementation:**
- exactSubscribers map[namespace][]*subscription
- Wildcard subscriptions separate collection
- Publish checks exact first, then wildcard patterns
**Security Note:** Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.)
**Technical Notes**
- Enforced at EventBus.Publish level
- Exact match is simple string equality
- Wildcard uses MatchNamespacePattern helper
**Test Cases**
- Publish to "tenant-a": only "tenant-a" exact subscribers get it
- Publish to "tenant-b": only "tenant-b" exact subscribers get it
- Publish to "tenant-a": "tenant*" wildcard subscriber gets it
- Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it
**Dependencies**
- Depends on: Issue 2.2 (Subscribe)
---
#### Issue 2.4: [Rule] Document wildcard subscription security
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1
**Title:** Document that wildcard subscriptions bypass isolation
**User Story**
As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions.
**Acceptance Criteria**
- [ ] eventbus.go comments explain wildcard behavior
- [ ] Security warning in Subscribe godoc
- [ ] Example showing wildcard usage for logging
- [ ] Example showing why wildcard is dangerous (if not restricted)
- [ ] README mentions namespace isolation caveats
- [ ] Examples show proper patterns (monitoring, auditing)
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** Documentation
**Content:**
- Wildcard subscriptions receive all matching events
- Use for cross-cutting concerns (logging, monitoring, audit)
- Restrict access to trusted components
- Never expose wildcard pattern to untrusted users
**Examples:**
- Monitoring system subscribes to ">"
- Audit system subscribes to "tenant-*"
- Application logic uses exact subscriptions only
**Technical Notes**
- Intentional design; not a bug
- Different from NATS server-side filtering
**Test Cases**
- Examples compile
- Documentation is clear and accurate
**Dependencies**
- Depends on: Issue 2.3 (exact isolation)
---
#### Issue 2.5: [Event] Publish SubscriptionCreated for tracking
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2
**Title:** Track subscription lifecycle
**User Story**
As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues.
**Acceptance Criteria**
- [ ] SubscriptionCreated event published on Subscribe
- [ ] SubscriptionDestroyed event published on Unsubscribe
- [ ] Event contains: namespacePattern, filterCriteria, timestamp
- [ ] Metrics increment on subscribe/unsubscribe
- [ ] SubscriberCount(namespace) returns current count
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** SubscriptionCreated(namespacePattern, filter, timestamp)
**Event:** SubscriptionDestroyed(namespacePattern, timestamp)
**Metrics:** Subscriber count per namespace
**Technical Notes**
- SubscriberCount already in eventbus.go
- Add events to EventBus.Subscribe and EventBus.Unsubscribe
- Internal events (infrastructure)
**Test Cases**
- Subscribe → metrics increment
- Unsubscribe → metrics decrement
- SubscriberCount correct
**Dependencies**
- Depends on: Issue 2.2 (Subscribe/Unsubscribe)
---
#### Issue 2.6: [Event] Publish EventPublished for delivery tracking
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2
**Title:** Record event publication metrics
**User Story**
As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues.
**Acceptance Criteria**
- [ ] EventPublished event published on Publish
- [ ] Metrics track: published count, delivered count, dropped count per namespace
- [ ] Dropped events (full channel) recorded
- [ ] Application can query metrics via Metrics()
- [ ] Example: 1000 events published, 995 delivered, 5 dropped
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** New Feature (Event/Metrics)
**Event:** EventPublished (infrastructure event)
**Metrics:**
- PublishCount[namespace]
- DeliveryCount[namespace]
- DroppedCount[namespace]
**Implementation:**
- RecordPublish(namespace)
- RecordReceive(namespace)
- RecordDroppedEvent(namespace)
**Technical Notes**
- Metrics already in DefaultMetricsCollector
- RecordDroppedEvent signals subscriber backpressure
- Can be used to auto-scale subscribers
**Test Cases**
- Publish 100 events: metrics show 100 published
- All delivered: metrics show 100 delivered
- Full subscriber: next event dropped, metrics show 1 dropped
- Query via bus.Metrics(): values accurate
**Dependencies**
- Depends on: Issue 2.1 (Publish)
---
#### Issue 2.7: [Read Model] Implement GetSubscriptions query
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2
**Title:** Query active subscriptions for operational visibility
**User Story**
As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health.
**Acceptance Criteria**
- [ ] GetSubscriptions() returns []SubscriptionInfo
- [ ] SubscriptionInfo contains: pattern, filter, subscriberID, createdAt
- [ ] Works for both exact and wildcard subscriptions
- [ ] Metrics accessible via SubscriberCount(namespace)
- [ ] Example: "What subscriptions are listening to 'orders'?"
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Read Model:** SubscriptionRegistry
**Data:**
- Pattern: namespace pattern (e.g., "tenant-*")
- Filter: optional filter criteria
- SubscriberID: unique ID for each subscription
- CreatedAt: timestamp
**Implementation:**
- Track subscriptions in eventbus.go
- Expose via GetSubscriptions() method
**Technical Notes**
- Useful for debugging
- Optional feature; not critical
**Test Cases**
- Subscribe to "orders": GetSubscriptions shows it
- Subscribe to "order*": GetSubscriptions shows it
- Unsubscribe: GetSubscriptions removes it
- Multiple subscribers: all listed
**Dependencies**
- Depends on: Issue 2.2 (Subscribe)
---
### Feature Set 2b: Buffering and Backpressure
**Capability:** Route and Filter Domain Events (non-blocking delivery)
**Description:** Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed).
**Success Condition:** Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher.
---
#### Issue 2.8: [Rule] Implement non-blocking event delivery
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1
**Title:** Ensure event publication never blocks
**User Story**
As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations.
**Acceptance Criteria**
- [ ] Publish(namespace, event) returns immediately
- [ ] If subscriber channel full, event dropped (non-blocking)
- [ ] Dropped events counted in metrics
- [ ] Buffered channel size is 100 (tunable)
- [ ] Publisher never waits for subscriber
- [ ] Metrics alert on high drop rate
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** New Feature (Policy)
**Invariant:** Publishers not blocked by slow subscribers
**Implementation:**
- select { case ch <- event: ... default: ... }
- Count drops in default case
**Trade-off:**
- Pro: Publisher never blocks
- Con: Events may be lost if subscriber can't keep up
- Mitigation: Metrics alert on drops; subscriber can increase buffer or retry
**Technical Notes**
- Already implemented in eventbus.go (deliverToSubscriber)
- 100-event buffer is reasonable default
**Test Cases**
- Subscribe, receive 100 events: no drops
- Publish 101st event immediately: dropped
- Metrics show drop count
- Publisher latency < 1ms regardless of subscribers
**Dependencies**
- Depends on: Issue 2.1 (Publish)
---
#### Issue 2.9: [Documentation] Document EventBus backpressure handling
**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2
**Title:** Explain buffer management and recovery from drops
**User Story**
As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers.
**Acceptance Criteria**
- [ ] Document buffer size (100 events default)
- [ ] Explain what happens on overflow (event dropped)
- [ ] Document recovery patterns (subscriber restarts, re-syncs)
- [ ] Example: Subscriber catches up from JetStream after restart
- [ ] Metrics to monitor (drop rate)
- [ ] README section on backpressure
**Bounded Context:** Event Bus
**DDD Implementation Guidance**
**Type:** Documentation
**Content:**
- Buffer size and behavior
- Drop semantics
- Recovery patterns
- Metrics to monitor
- When to increase buffer size
**Examples:**
- Slow subscriber: increase buffer or fix handler
- Network latency: events may be dropped
- Handler panics: subscriber must restart and re-sync
**Technical Notes**
- Events are lost if dropped; only durable via JetStream
- Phase 3 (NATS) addresses durability
**Test Cases**
- Documentation is clear
- Examples work
**Dependencies**
- Depends on: Issue 2.8 (non-blocking delivery)
---
## Phase 3: Cluster Coordination
### Feature Set 3a: Cluster Topology and Leadership
**Capability:** Coordinate Cluster Topology
**Description:** Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease.
**Success Condition:** Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s.
---
#### Issue 3.1: [Command] Implement JoinCluster protocol
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1
**Title:** Enable node discovery via cluster join
**User Story**
As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically.
**Acceptance Criteria**
- [ ] JoinCluster() announces node via NATS
- [ ] Node info contains: NodeID, Address, Timestamp, Status
- [ ] Other nodes receive join announcement
- [ ] Cluster topology updated atomically
- [ ] Rejoining node detected and updated
- [ ] Tests verify multi-node discovery
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** JoinCluster()
**Aggregates:** Cluster (group of nodes)
**Events:** NodeJoined(nodeID, address, timestamp)
**Technical Notes**
- NATS subject: "aether.cluster.nodes"
- NodeDiscovery subscribes to announcements
- ClusterManager.Start() initiates join
**Test Cases**
- Single node joins: topology = [node-a]
- Second node joins: topology = [node-a, node-b]
- Third node joins: topology = [node-a, node-b, node-c]
- Node rejoins: updates existing entry
**Dependencies**
- None (first cluster feature)
---
#### Issue 3.2: [Command] Implement LeaderElection
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0
**Title:** Elect single leader via NATS-based voting
**User Story**
As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing.
**Acceptance Criteria**
- [ ] LeaderElection holds election every HeartbeatInterval (5s)
- [ ] Nodes vote for themselves (no voting logic; first wins)
- [ ] One leader elected per term
- [ ] Leader holds lease (TTL = 2 * HeartbeatInterval)
- [ ] All nodes converge on same leader
- [ ] Lease renewal happens automatically
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** ElectLeader()
**Aggregates:** LeadershipLease (time-bound authority)
**Events:** LeaderElected(leaderID, term, leaseExpiration)
**Technical Notes**
- NATS subject: "aether.cluster.election"
- Each node publishes heartbeat with NodeID, Timestamp
- First node to publish becomes leader
- Lease expires if no heartbeat for TTL
**Test Cases**
- Single node: elected immediately
- Three nodes: exactly one elected
- Leader dies: remaining nodes elect new leader within 2*interval
- Leader comes back: may or may not stay leader
**Dependencies**
- Depends on: Issue 3.1 (node discovery)
---
#### Issue 3.3: [Rule] Enforce single leader invariant
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0
**Title:** Guarantee exactly one leader at any time
**User Story**
As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict.
**Acceptance Criteria**
- [ ] At most one leader at any time (lease-based)
- [ ] If leader lease expires, no leader until re-election
- [ ] All nodes see same leader (or none)
- [ ] Tests verify invariant under various failure scenarios
- [ ] Split-brain prevented by lease TTL < network latency
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Invariant)
**Invariant:** At most one leader (enforced by lease TTL)
**Mechanism:**
- Leader publishes heartbeat every HeartbeatInterval
- Other nodes trust leader if heartbeat < HeartbeatInterval old
- If no heartbeat for 2*HeartbeatInterval, lease expired
- New election begins
**Technical Notes**
- Lease-based; not consensus-based (simpler)
- Allows temporary split-brain until lease expires
- Acceptable for Aether (eventual consistency)
**Test Cases**
- Simulate leader death: lease expires, new leader elected
- Simulate network partition: partition may have >1 leader until lease expires
- Verify no coordination during lease expiration
**Dependencies**
- Depends on: Issue 3.2 (leader election)
---
#### Issue 3.4: [Event] Publish LeaderElected on election
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1
**Title:** Record leadership election outcomes
**User Story**
As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency.
**Acceptance Criteria**
- [ ] LeaderElected event published after successful election
- [ ] Event contains: LeaderID, Term, LeaseExpiration, Timestamp
- [ ] Metrics increment on election
- [ ] Helpful for debugging split-brain scenarios
- [ ] Track election frequency (ideally < 1 per minute)
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** LeaderElected(leaderID, term, leaseExpiration, timestamp)
**Triggered by:** Successful election
**Consumers:** Metrics, audit logs
**Technical Notes**
- Event published locally to all observers
- Infrastructure event (not domain event)
**Test Cases**
- Election happens: event published
- Term increments: event reflects new term
- Metrics accurate
**Dependencies**
- Depends on: Issue 3.2 (election)
---
#### Issue 3.5: [Event] Publish LeadershipLost on lease expiration
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2
**Title:** Track leadership transitions
**User Story**
As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events.
**Acceptance Criteria**
- [ ] LeadershipLost event published when lease expires
- [ ] Event contains: PreviousLeaderID, Timestamp, Reason
- [ ] Metrics track leadership transitions
- [ ] Helpful for debugging cascading failures
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** LeadershipLost(previousLeaderID, timestamp, reason)
**Reason:** "lease_expired", "node_failed", etc.
**Technical Notes**
- Published when lease TTL expires
- Useful for observability
**Test Cases**
- Leader lease expires: LeadershipLost published
- Metrics show transition
**Dependencies**
- Depends on: Issue 3.2 (election)
---
#### Issue 3.6: [Read Model] Implement GetClusterTopology query
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1
**Title:** Query current cluster members and status
**User Story**
As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues.
**Acceptance Criteria**
- [ ] GetNodes() returns map[nodeID]*NodeInfo
- [ ] NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs
- [ ] Status is: Active, Degraded, Failed
- [ ] LastSeen is accurate heartbeat timestamp
- [ ] ShardIDs show shard ownership (filled in Phase 3b)
- [ ] Example: "node-a is active; node-b failed 30s ago"
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Read Model:** ClusterTopology
**Data:**
- NodeID → NodeInfo (status, heartbeat, shards)
- LeaderID (current leader)
- Term (election term)
**Technical Notes**
- ClusterManager maintains topology in-memory
- Update on each heartbeat/announcement
**Test Cases**
- GetNodes() returns active nodes
- Status accurate (Active, Failed, etc.)
- LastSeen updates on heartbeat
- Rejoining node updates existing entry
**Dependencies**
- Depends on: Issue 3.1 (node discovery)
---
#### Issue 3.7: [Read Model] Implement GetLeader query
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0
**Title:** Query current leader
**User Story**
As a client, I want to know who the leader is, so that I can route coordination requests to the right node.
**Acceptance Criteria**
- [ ] GetLeader() returns current leader NodeID or ""
- [ ] IsLeader() returns true if this node is leader
- [ ] Both consistent with LeaderElection state
- [ ] Updated immediately on election
- [ ] Example: "node-b is leader (term 5)"
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Read Model:** LeadershipRegistry
**Data:** CurrentLeader, CurrentTerm, LeaseExpiration
**Implementation:**
- LeaderElection maintains this
- ClusterManager queries it
**Technical Notes**
- Critical for routing coordination work
- Must be consistent across cluster
**Test Cases**
- No leader: GetLeader returns ""
- Leader elected: GetLeader returns leader ID
- IsLeader true on leader, false on others
- Changes on re-election
**Dependencies**
- Depends on: Issue 3.2 (election)
---
### Feature Set 3b: Shard Distribution
**Capability:** Distribute Actors Across Cluster Nodes
**Description:** Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling.
**Success Condition:** 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently.
---
#### Issue 3.8: [Command] Implement consistent hash ring
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1
**Title:** Distribute shards across nodes with minimal reshuffling
**User Story**
As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling.
**Acceptance Criteria**
- [ ] ConsistentHashRing(numShards=1024) creates ring
- [ ] GetShard(actorID) returns consistent shard [0, 1024)
- [ ] AddNode(nodeID) rebalances ~numShards/numNodes shards
- [ ] RemoveNode(nodeID) rebalances shards evenly
- [ ] Same actor always maps to same shard
- [ ] Reshuffling < 40% on node add/remove
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** AssignShards(nodes)
**Aggregates:** ConsistentHashRing (distribution algorithm)
**Invariants:**
- Each shard [0, 1024) assigned to exactly one node
- ActorID hashes consistently to shard
- Topology changes minimize reassignment
**Technical Notes**
- hashring.go already implements this
- Use crypto/md5 or compatible hash
- 1024 shards is tunable (P1 default)
**Test Cases**
- Single node: all shards assigned to it
- Two nodes: ~512 shards each
- Three nodes: ~341 shards each
- Add fourth node: ~256 shards each (~20% reshuffled)
- Remove node: remaining nodes rebalance evenly
- Same actor-id always hashes to same shard
**Dependencies**
- Depends on: Issue 3.1 (node discovery)
---
#### Issue 3.9: [Rule] Enforce single shard owner invariant
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0
**Title:** Guarantee each shard has exactly one owner
**User Story**
As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically.
**Acceptance Criteria**
- [ ] ShardMap tracks shard → nodeID assignment
- [ ] No shard is unassigned (every shard has owner)
- [ ] No shard assigned to multiple nodes
- [ ] Reassignment is atomic (no in-between state)
- [ ] Tests verify invariant after topology changes
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Invariant)
**Invariant:** Each shard [0, 1024) assigned to exactly one active node
**Mechanism:**
- ShardMap[shardID] = [nodeID]
- Maintained by leader
- Updated atomically on rebalancing
**Technical Notes**
- shard.go implements ShardManager
- Validated after each rebalancing
**Test Cases**
- After rebalancing: all shards assigned
- No orphaned shards
- No multiply-assigned shards
- Reassignment is atomic
**Dependencies**
- Depends on: Issue 3.8 (consistent hashing)
---
#### Issue 3.10: [Event] Publish ShardAssigned on assignment
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2
**Title:** Track shard-to-node assignments
**User Story**
As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues.
**Acceptance Criteria**
- [ ] ShardAssigned event published after assignment
- [ ] Event contains: ShardID, NodeID, Timestamp
- [ ] Metrics track: shards per node, rebalancing frequency
- [ ] Example: Shard 42 assigned to node-b
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** ShardAssigned(shardID, nodeID, timestamp)
**Triggered by:** AssignShards command succeeds
**Metrics:** Shards per node, distribution evenness
**Technical Notes**
- Infrastructure event
- Useful for monitoring load distribution
**Test Cases**
- Assignment published on rebalancing
- Metrics reflect distribution
**Dependencies**
- Depends on: Issue 3.9 (shard ownership)
---
#### Issue 3.11: [Read Model] Implement GetShardAssignments query
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1
**Title:** Query shard-to-node mapping
**User Story**
As a client, I want to know which node owns a shard, so that I can route actor requests correctly.
**Acceptance Criteria**
- [ ] GetShardAssignments() returns ShardMap
- [ ] ShardMap[shardID] returns owning nodeID
- [ ] GetShard(actorID) returns shard for actor
- [ ] Routing decision: actorID → shard → nodeID
- [ ] Cached locally; refreshed on each rebalancing
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Query)
**Read Model:** ShardMap
**Data:**
- ShardID → NodeID (primary owner)
- Version (incremented on rebalancing)
- UpdateTime
**Implementation:**
- ClusterManager.GetShardMap()
- Cached; updated on assignment changes
**Technical Notes**
- Critical for routing
- Must be consistent across cluster
- Version helps detect stale caches
**Test Cases**
- GetShardAssignments returns current map
- GetShard(actorID) returns consistent shard
- Routing: actor ID → shard → node owner
**Dependencies**
- Depends on: Issue 3.9 (shard ownership)
---
### Feature Set 3c: Failure Detection and Recovery
**Capability:** Recover from Node Failures
**Description:** Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes.
**Success Condition:** Node dies → failure detected within 90s → shards reassigned → actors replay automatically.
---
#### Issue 3.12: [Command] Implement node health checks
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1
**Title:** Detect node failures via heartbeat timeout
**User Story**
As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes.
**Acceptance Criteria**
- [ ] Each node publishes heartbeat every 30s
- [ ] Nodes without heartbeat for 90s marked as Failed
- [ ] checkNodeHealth() runs every 30s
- [ ] Failed node's status updates atomically
- [ ] Tests verify failure detection timing
- [ ] Failed node can rejoin cluster
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** MarkNodeFailed(nodeID)
**Trigger:** monitorNodes detects missing heartbeat
**Events:** NodeFailed(nodeID, lastSeenTimestamp)
**Technical Notes**
- monitorNodes() loop in manager.go
- Check LastSeen timestamp
- Update status if stale (>90s)
**Test Cases**
- Active node: status stays Active
- No heartbeat for 90s: status → Failed
- Rejoin: status → Active
- Failure detected < 100s (ideally 90-120s)
**Dependencies**
- Depends on: Issue 3.1 (node discovery)
---
#### Issue 3.13: [Command] Implement RebalanceShards after node failure
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0
**Title:** Reassign failed node's shards to healthy nodes
**User Story**
As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes.
**Acceptance Criteria**
- [ ] Leader detects node failure
- [ ] Leader triggers RebalanceShards
- [ ] Failed node's shards reassigned evenly
- [ ] No shard left orphaned
- [ ] ShardMap updated atomically
- [ ] Rebalancing completes within 5 seconds
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Command)
**Command:** RebalanceShards(failedNodeID)
**Aggregates:** ShardMap, ConsistentHashRing
**Events:** RebalanceStarted, ShardMigrated
**Technical Notes**
- Leader only (IsLeader() check)
- Use consistent hashing to assign
- Calculate new assignments atomically
**Test Cases**
- Node-a fails with shards [1, 2, 3]
- Leader reassigns [1, 2, 3] to remaining nodes
- No orphaned shards
- Rebalancing < 5s
**Dependencies**
- Depends on: Issue 3.8 (consistent hashing)
- Depends on: Issue 3.12 (failure detection)
---
#### Issue 3.14: [Rule] Enforce no-orphan invariant
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0
**Title:** Guarantee all shards have owners after rebalancing
**User Story**
As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable.
**Acceptance Criteria**
- [ ] Before rebalancing: verify no orphaned shards
- [ ] After rebalancing: verify all shards assigned
- [ ] Tests fail if invariant violated
- [ ] Rebalancing aborted if invariant would be violated
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Invariant)
**Invariant:** All shards [0, 1024) have owners after any rebalancing
**Check:**
- Count assigned shards
- Verify = 1024
- Abort if not
**Technical Notes**
- Validate before committing ShardMap
- Logs errors but doesn't assert (graceful degradation)
**Test Cases**
- Rebalancing completes: all shards assigned
- Orphaned shard detected: rebalancing rolled back
- Tests verify count = 1024
**Dependencies**
- Depends on: Issue 3.13 (rebalancing)
---
#### Issue 3.15: [Event] Publish NodeFailed on failure detection
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2
**Title:** Record node failure for observability
**User Story**
As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues.
**Acceptance Criteria**
- [ ] NodeFailed event published when failure detected
- [ ] Event contains: NodeID, LastSeenTimestamp, AffectedShards
- [ ] Metrics track failure frequency
- [ ] Example: "node-a failed; 341 shards affected"
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs)
**Triggered by:** checkNodeHealth marks node failed
**Consumers:** Metrics, alerts, audit logs
**Technical Notes**
- Infrastructure event
- AffectedShards helps assess impact
**Test Cases**
- Node failure detected: event published
- Metrics show affected shard count
**Dependencies**
- Depends on: Issue 3.12 (failure detection)
---
#### Issue 3.16: [Event] Publish ShardMigrated on shard movement
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2
**Title:** Track shard migrations
**User Story**
As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations.
**Acceptance Criteria**
- [ ] ShardMigrated event published on each shard movement
- [ ] Event contains: ShardID, FromNodeID, ToNodeID, Status
- [ ] Status: "Started", "InProgress", "Completed", "Failed"
- [ ] Metrics track migration count and duration
- [ ] Example: "Shard 42 migrated from node-a to node-b (2.3s)"
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs)
**Status:** Started → InProgress → Completed
**Consumers:** Metrics, progress tracking
**Technical Notes**
- Published for each shard move
- Helps track rebalancing progress
- Useful for SLO monitoring
**Test Cases**
- Shard moves: event published
- Metrics track duration
- Status transitions correct
**Dependencies**
- Depends on: Issue 3.13 (rebalancing)
---
#### Issue 3.17: [Documentation] Document actor migration and replay
**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2
**Title:** Explain how actors move and recover state
**User Story**
As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application.
**Acceptance Criteria**
- [ ] Design doc: cluster/ACTOR_MIGRATION.md
- [ ] Explain shard reassignment process
- [ ] Explain state rebuild via GetEvents + replay
- [ ] Explain snapshot optimization
- [ ] Example: Shard 42 moves to new node; 1000-event actor replays in <100ms
- [ ] Explain out-of-order message handling
**Bounded Context:** Cluster Coordination
**DDD Implementation Guidance**
**Type:** Documentation
**Content:**
- Shard assignment (consistent hashing)
- Actor discovery (routing via shard map)
- State rebuild (replay from JetStream)
- Snapshots (optional optimization)
- In-flight messages (may arrive before replay completes)
**Examples:**
- Manual failover: reassign shards manually
- Auto failover: leader initiates on failure detection
**Technical Notes**
- Complex topic; good documentation prevents bugs
**Test Cases**
- Documentation is clear
- Examples correct
**Dependencies**
- Depends on: Issue 3.13 (rebalancing)
- Depends on: Phase 1 (event replay)
---
## Phase 4: Namespace Isolation and NATS Event Delivery
### Feature Set 4a: Namespace Storage Isolation
**Capability:** Isolate Logical Domains Using Namespaces
**Description:** Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer.
**Success Condition:** Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries.
---
#### Issue 4.1: [Rule] Enforce namespace-based stream naming
**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P1
**Title:** Use namespace prefixes in JetStream stream names
**User Story**
As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage.
**Acceptance Criteria**
- [ ] Namespace "tenant-a" → stream "tenant-a_events"
- [ ] Namespace "tenant-b" → stream "tenant-b_events"
- [ ] Empty namespace → stream "events" (default)
- [ ] JetStreamConfig.Namespace sets prefix
- [ ] NewJetStreamEventStoreWithNamespace convenience function
- [ ] Tests verify stream names have namespace prefix
**Bounded Context:** Namespace Isolation
**DDD Implementation Guidance**
**Type:** New Feature (Configuration)
**Value Object:** Namespace (string identifier)
**Implementation:**
- JetStreamConfig.Namespace field
- StreamName = namespace + "_events" if namespace set
- StreamName = "events" if namespace empty
**Technical Notes**
- Already partially implemented in jetstream.go
- Ensure safe characters (sanitize spaces, dots, wildcards)
**Test Cases**
- NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events"
- NewJetStreamEventStoreWithNamespace(""): creates stream "events"
- Stream name verified
**Dependencies**
- None (orthogonal to other contexts)
---
#### Issue 4.2: [Rule] Enforce storage-level namespace isolation
**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P0
**Title:** Prevent cross-namespace data leakage at storage layer
**User Story**
As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems.
**Acceptance Criteria**
- [ ] SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events"
- [ ] GetEvents("tenant-a") queries "tenant-a_events" stream only
- [ ] No possibility of accidental cross-namespace leakage
- [ ] JetStream subject filtering enforces isolation
- [ ] Integration tests verify with multiple namespaces
**Bounded Context:** Namespace Isolation
**DDD Implementation Guidance**
**Type:** New Feature (Invariant)
**Invariant:** Events from namespace X are invisible to namespace Y
**Mechanism:**
- Separate JetStream streams per namespace
- Subject prefixing: "tenant-a.events.actor-123"
- Subscribe filters by subject prefix
**Technical Notes**
- jetstream.go: SubscribeToActorEvents uses subject prefix
- Consumer created with subject filter matching namespace
**Test Cases**
- SaveEvent to tenant-a: visible in tenant-a queries
- Same event invisible to tenant-b queries
- GetLatestVersion scoped to namespace
- GetEvents scoped to namespace
- Multi-namespace integration test
**Dependencies**
- Depends on: Issue 4.1 (stream naming)
---
#### Issue 4.3: [Documentation] Document namespace design patterns
**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P1
**Title:** Provide guidance on namespace naming and use
**User Story**
As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system.
**Acceptance Criteria**
- [ ] Design doc: NAMESPACE_DESIGN_PATTERNS.md
- [ ] Pattern 1: "tenant-{id}" (per-customer)
- [ ] Pattern 2: "env.domain" (per-env, per-bounded-context)
- [ ] Pattern 3: "env.domain.customer" (most granular)
- [ ] Examples of each pattern
- [ ] Guidance on choosing granularity
- [ ] Anti-patterns (wildcards, spaces, dots)
**Bounded Context:** Namespace Isolation
**DDD Implementation Guidance**
**Type:** Documentation
**Content:**
- Multi-tenant patterns
- Granularity decisions
- Namespace naming rules
- Examples
- Anti-patterns
- Performance implications
**Examples:**
- SaaS: "tenant-uuid"
- Microservices: "service.orders"
- Complex: "env.service.tenant"
**Technical Notes**
- No hard restrictions; naming is flexible
- Sanitization (spaces → underscores)
**Test Cases**
- Documentation is clear
- Examples valid
**Dependencies**
- Depends on: Issue 4.1 (stream naming)
---
#### Issue 4.4: [Validation] Add namespace format validation (P2)
**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P2
**Title:** Validate namespace names to prevent invalid streams
**User Story**
As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names.
**Acceptance Criteria**
- [ ] ValidateNamespace(ns string) returns error for invalid names
- [ ] Rejects: "tenant-*", "tenant a", "tenant."
- [ ] Accepts: "tenant-abc", "prod.orders", "tenant_123"
- [ ] Called on NewJetStreamEventStoreWithNamespace
- [ ] Clear error messages
- [ ] Tests verify validation rules
**Bounded Context:** Namespace Isolation
**DDD Implementation Guidance**
**Type:** New Feature (Validation)
**Validation Rules:**
- No wildcards (*, >)
- No spaces
- No leading/trailing dots
- Alphanumeric, hyphens, underscores, dots only
**Implementation:**
- ValidateNamespace regex
- Called before stream creation
**Technical Notes**
- Nice-to-have; currently strings accepted as-is
- Could sanitize instead of rejecting (replace _ for spaces)
**Test Cases**
- Valid: "tenant-abc", "prod.orders"
- Invalid: "tenant-*", "tenant a", ".prod"
- Error messages clear
**Dependencies**
- Depends on: Issue 4.1 (stream naming)
---
### Feature Set 4b: Cross-Node Event Delivery via NATS
**Capability:** Deliver Events Across Cluster Nodes
**Description:** Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering.
**Success Condition:** Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS).
---
#### Issue 4.5: [Command] Implement NATSEventBus wrapper
**Type:** New Feature
**Bounded Context:** Event Bus (with NATS)
**Priority:** P1
**Title:** Extend EventBus with NATS-native pub/sub
**User Story**
As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations.
**Acceptance Criteria**
- [ ] NATSEventBus embeds EventBus
- [ ] Publish(namespace, event) sends to local EventBus AND NATS
- [ ] NATS subject: "aether.events.{namespace}"
- [ ] SubscribeWithFilter works across nodes
- [ ] Self-published events not re-delivered (avoid loops)
- [ ] Tests verify cross-node delivery
**Bounded Context:** Event Bus (NATS extension)
**DDD Implementation Guidance**
**Type:** New Feature (Extension)
**Aggregate:** EventBus extended with NATSEventBus
**Commands:** Publish(namespace, event) [same interface, distributed]
**Implementation:**
- NATSEventBus composes EventBus
- Override Publish to also publish to NATS
- Subscribe to NATS subjects matching namespace
**Technical Notes**
- nats_eventbus.go already partially implemented
- NATS subject: "aether.events.orders" for namespace "orders"
- Include sourceNodeID in event to prevent redelivery
**Test Cases**
- Publish on node-a: local subscribers on node-a receive
- Same publish: node-b subscribers receive via NATS
- Self-loop prevented: node-a doesn't re-receive own publish
- Multi-node: all nodes converge on same events
**Dependencies**
- Depends on: Issue 2.1 (EventBus.Publish)
- Depends on: Issue 3.1 (cluster setup for multi-node tests)
---
#### Issue 4.6: [Rule] Enforce exactly-once delivery across cluster
**Type:** New Feature
**Bounded Context:** Event Bus (NATS)
**Priority:** P1
**Title:** Guarantee events delivered to all cluster subscribers
**User Story**
As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events.
**Acceptance Criteria**
- [ ] Event published to NATS with JetStream consumer
- [ ] Consumer acknowledges delivery
- [ ] Redelivery on network failure (JetStream handles)
- [ ] No duplicate delivery to same subscriber
- [ ] All nodes see same events in same order
**Bounded Context:** Event Bus (NATS)
**DDD Implementation Guidance**
**Type:** New Feature (Invariant)
**Invariant:** Exactly-once delivery to each subscriber
**Mechanism:**
- JetStream consumer per subscriber group
- Acknowledgment on delivery
- Automatic redelivery on timeout
**Technical Notes**
- JetStream handles durability and ordering
- Consumer name = subscriber ID
- Push consumer model (events pushed to subscriber)
**Test Cases**
- Publish event: all subscribers receive once
- Network failure: redelivery after timeout
- No duplicates on subscriber
- Order preserved across nodes
**Dependencies**
- Depends on: Issue 4.5 (NATSEventBus)
---
#### Issue 4.7: [Event] Publish EventPublished (via NATS)
**Type:** New Feature
**Bounded Context:** Event Bus (NATS)
**Priority:** P2
**Title:** Route published events to NATS subjects
**User Story**
As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks.
**Acceptance Criteria**
- [ ] EventPublished event published to NATS
- [ ] Subject: "aether.events.{namespace}.published"
- [ ] Message contains: eventID, timestamp, sourceNodeID
- [ ] Metrics track: events published, delivered, dropped
- [ ] Helps identify partition/latency issues
**Bounded Context:** Event Bus (NATS)
**DDD Implementation Guidance**
**Type:** New Feature (Event)
**Event:** EventPublished (infrastructure)
**Subject:** aether.events.{namespace}.published
**Consumers:** Metrics, monitoring
**Technical Notes**
- Published after NATS publish succeeds
- Separate from local EventPublished (for clarity)
**Test Cases**
- Publish event: EventPublished message on NATS
- Metrics count delivery
- Cross-node visibility works
**Dependencies**
- Depends on: Issue 4.5 (NATSEventBus)
---
#### Issue 4.8: [Read Model] Implement cross-node subscription
**Type:** New Feature
**Bounded Context:** Event Bus (NATS)
**Priority:** P1
**Title:** Receive events from other nodes via NATS
**User Story**
As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows.
**Acceptance Criteria**
- [ ] NATSEventBus.Subscribe(namespace) receives local + NATS events
- [ ] SubscribeWithFilter works with NATS
- [ ] Events from local node: delivered via local EventBus
- [ ] Events from remote nodes: delivered via NATS consumer
- [ ] Subscriber sees unified stream (no duplication)
**Bounded Context:** Event Bus (NATS)
**DDD Implementation Guidance**
**Type:** New Feature (Query/Subscription)
**Read Model:** UnifiedEventStream (local + remote)
**Implementation:**
- Subscribe creates local channel
- NATSEventBus subscribes to NATS subject
- Both feed into subscriber channel
**Technical Notes**
- Unified view is transparent to subscriber
- No need to know if event is local or remote
**Test Cases**
- Subscribe to namespace: receive local events
- Subscribe to namespace: receive remote events
- Filter works across both sources
- No duplication
**Dependencies**
- Depends on: Issue 4.5 (NATSEventBus)
---
## Summary
This backlog contains **67 executable issues** across **5 bounded contexts** organized into **4 implementation phases**. Each issue:
- Is decomposed using DDD-informed order (commands → rules → events → reads)
- References domain concepts (aggregates, commands, events, value objects)
- Includes acceptance criteria (testable, specific)
- States dependencies (enabling parallel work)
- Is sized to 1-3 days of work
**Recommended Build Order:**
1. **Phase 1** (17 issues): Event Sourcing Foundation - everything depends on this
2. **Phase 2** (9 issues): Local Event Bus - enables observability before clustering
3. **Phase 3** (20 issues): Cluster Coordination - enables distributed deployment
4. **Phase 4** (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery
**Total Scope:** ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team)
---
## Next Steps
1. Create Gitea issues from this backlog
2. Assign to team members
3. Set up dependency tracking in Gitea
4. Use `/spawn-issues` skill to parallelize implementation
5. Iterate on acceptance criteria with domain experts
See `/issue-writing` skill for proper issue formatting in Gitea.