Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
69 KiB
Aether Executable Backlog
Built from: 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition
Date: 2026-01-12
Backlog Overview
This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API.
Total Scope:
- Capabilities: 9 (all complete)
- Features: 14
- Issues: 67
- Contexts: 5
- Implementation Phases: 4
Build Order (by value and dependencies):
-
Phase 1: Event Sourcing Foundation (Capabilities 1-3)
- Issues: 17
- Enables all other work
-
Phase 2: Local Event Bus (Capability 8)
- Issues: 9
- Enables local pub/sub before clustering
-
Phase 3: Cluster Coordination (Capabilities 5-7)
- Issues: 20
- Enables distributed deployment
-
Phase 4: Namespace & NATS (Capabilities 4, 9)
- Issues: 21
- Enables multi-tenancy and cross-node delivery
Phase 1: Event Sourcing Foundation
Feature Set 1a: Event Storage with Version Conflict Detection
Capability: Store Events Durably with Conflict Detection
Description: Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers.
Success Condition: Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history.
Issue 1.1: [Command] Implement SaveEvent with monotonic version validation
Type: New Feature Bounded Context: Event Sourcing Priority: P0
Title: As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely
User Story
As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately.
Acceptance Criteria
- SaveEvent accepts event with Version > current for actor
- SaveEvent rejects event with Version <= current (returns VersionConflictError)
- VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion
- First event for new actor must have Version > 0 (typically 1)
- Version gaps are allowed (1, 3, 5 is valid)
- Validation happens before persistence (fail-fast)
- InMemoryEventStore and JetStreamEventStore both implement validation
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Core)
Aggregate: ActorEventStream (implicit; each actor has independent version sequence)
Command: SaveEvent(event)
Validation Rules:
- If no events exist for actor: version must be > 0
- If events exist: new version must be > latest version
Success Event: EventStored (published when SaveEvent succeeds)
Error Event: VersionConflict (triggered when version validation fails)
Technical Notes
- Version validation is the core invariant; everything else depends on it
- Use
GetLatestVersion()to implement validation - No database-level locks; optimistic validation only
- Conflict should fail in <1ms
Test Cases
- New actor, version 1: succeeds
- Same actor, version 2 (after 1): succeeds
- Same actor, version 2 (after 1, concurrent): second call fails
- Same actor, version 1 (duplicate): fails
- Same actor, version 0 or negative: fails
- Concurrent 100 writers: 99 fail, 1 succeeds
Dependencies
- None (foundation)
Issue 1.2: [Rule] Enforce append-only and immutability invariants
Type: New Feature Bounded Context: Event Sourcing Priority: P0
Title: Enforce event immutability and append-only semantics
User Story
As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates.
Acceptance Criteria
- EventStore interface has no Update or Delete methods
- Events cannot be modified after persistence
- Replay of same events always produces same state
- Corrupted events are reported (not silently skipped)
- JetStream stream configuration prevents deletes (retention policy only)
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Core Invariant)
Aggregate: ActorEventStream
Invariant: Events are immutable; stream is append-only; no modifications to EventStore interface
Implementation:
- Event struct has no Setters (only getters)
- SaveEvent is the only public persistence method
- JetStream streams configured with
NoDeletepolicy
Technical Notes
- This is enforced at interface level (no Update/Delete in EventStore)
- JetStream configuration prevents accidental deletes
- ReplayError allows visibility into corruption without losing good data
Test Cases
- Attempt to modify Event.Data after creation: compile error (if immutable)
- Attempt to call UpdateEvent: interface doesn't exist
- JetStream stream created with correct retention policy
- ReplayError captured when event unmarshaling fails
Dependencies
- Depends on: Issue 1.1 (SaveEvent implementation)
Issue 1.3: [Event] Publish EventStored after successful save
Type: New Feature Bounded Context: Event Sourcing Priority: P0
Title: Emit EventStored event for persistence observability
User Story
As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections).
Acceptance Criteria
- EventStored event published after SaveEvent succeeds
- EventStored contains: EventID, ActorID, Version, Timestamp
- No EventStored published if SaveEvent fails
- EventBus receives EventStored in same transaction context
- Metrics increment for each EventStored
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature
Event: EventStored(eventID, actorID, version, timestamp)
Triggered by: Successful SaveEvent call
Consumers: Metrics collectors, projections, audit systems
Technical Notes
- EventStored is an internal event (Aether infrastructure)
- Published to local EventBus (see Phase 2 for cross-node)
- Allows observability without coupling application code
Test Cases
- Save event → EventStored published
- Version conflict → no EventStored published
- Multiple saves → multiple EventStored events in order
Dependencies
- Depends on: Issue 1.1 (SaveEvent)
- Depends on: Phase 2, Issue 2.1 (EventBus.Publish)
Issue 1.4: [Event] Publish VersionConflict error with full context
Type: New Feature Bounded Context: Event Sourcing, Optimistic Concurrency Control Priority: P0
Title: Return detailed version conflict information for retry logic
User Story
As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker).
Acceptance Criteria
- VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion
- Error message is human-readable with all context
- Errors.Is(err, ErrVersionConflict) returns true for sentinel check
- Errors.As(err, &versionErr) allows unpacking to VersionConflictError
- Application can read CurrentVersion to decide retry strategy
Bounded Context: Event Sourcing + OCC
DDD Implementation Guidance
Type: New Feature
Error Type: VersionConflictError (wraps ErrVersionConflict sentinel)
Data: ActorID, AttemptedVersion, CurrentVersion
Use: Application uses this to implement retry strategies
Technical Notes
- Already implemented in
/aether/event.go(VersionConflictError struct) - Document standard retry patterns in examples/
Test Cases
- Conflict with detailed error: ActorID, versions present
- Application reads CurrentVersion: succeeds
- Errors.Is(err, ErrVersionConflict): true
- Errors.As(err, &versionErr): works
- Manual test: log the error, see all context
Dependencies
- Depends on: Issue 1.1 (SaveEvent)
Issue 1.5: [Read Model] Implement GetLatestVersion query
Type: New Feature Bounded Context: Event Sourcing Priority: P0
Title: Provide efficient version lookup for optimistic locking
User Story
As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead.
Acceptance Criteria
- GetLatestVersion(actorID) returns latest version or 0 if no events
- Execution time is O(1) or O(log n), not O(n)
- InMemoryEventStore implements with map lookup
- JetStreamEventStore caches latest version per actor
- Cache is invalidated after each SaveEvent
- Multiple calls for same actor within 1s hit cache
- Namespace isolation: GetLatestVersion scoped to namespace
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Query)
Read Model: ActorVersionIndex
Source Events: SaveEvent (updates cache)
Data: ActorID → LatestVersion
Performance: O(1) lookup after SaveEvent
Technical Notes
- InMemoryEventStore: use map[actorID]int64
- JetStreamEventStore: query JetStream metadata OR maintain cache
- Cache invalidation: update after every SaveEvent
- Thread-safe with RWMutex (read-heavy)
Test Cases
- New actor: GetLatestVersion returns 0
- After SaveEvent(version: 1): GetLatestVersion returns 1
- After SaveEvent(version: 3): GetLatestVersion returns 3
- Concurrent reads from same actor: all return consistent value
- Namespace isolation: "tenant-a" and "tenant-b" have independent versions
Dependencies
- Depends on: Issue 1.1 (SaveEvent)
Feature Set 1b: State Rebuild from Event History
Capability: Rebuild State from Event History
Description: Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors.
Success Condition: GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1).
Issue 1.6: [Command] Implement GetEvents for replay
Type: New Feature Bounded Context: Event Sourcing Priority: P0
Title: Load events from store for state replay
User Story
As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state.
Acceptance Criteria
- GetEvents(actorID, fromVersion) returns []*Event in version order
- Events are ordered by version (ascending)
- fromVersion is inclusive (GetEvents(actorID, 5) includes version 5)
- If no events exist, returns empty slice (not error)
- If actorID has no events >= fromVersion, returns empty slice
- Namespace isolation: GetEvents scoped to namespace
- Large result sets don't cause memory issues (stream if >10k events)
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Query)
Command: GetEvents(actorID, fromVersion)
Returns: []*Event ordered by version
Invariant: Order is deterministic (version order always)
Technical Notes
- InMemoryEventStore: filter and sort by version
- JetStreamEventStore: query JetStream subject and order results
- Consider pagination for very large actor histories
- fromVersion=0 means "start from beginning"
Test Cases
- GetEvents(actorID, 0) with 5 events: returns all 5 in order
- GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5
- GetEvents(nonexistent, 0): returns empty slice
- GetEvents with gap (versions 1, 3, 5): returns only those 3
- Order is guaranteed (version order, not insertion order)
Dependencies
- Depends on: Issue 1.1 (SaveEvent)
Issue 1.7: [Rule] Define and enforce snapshot validity
Type: New Feature Bounded Context: Event Sourcing Priority: P1
Title: Implement snapshot invalidation policy
User Story
As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat.
Acceptance Criteria
- Snapshot valid until Version + MaxVersionGap (default 1000)
- GetLatestSnapshot returns nil if no snapshot or invalid
- Application can override MaxVersionGap in config
- Snapshot timestamp recorded for debugging
- No automatic cleanup; application calls SaveSnapshot to create
- Tests confirm snapshot invalidation logic
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Policy)
Aggregate: ActorSnapshot + SnapshotPolicy
Policy: Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap
Implementation:
- SnapshotStore.GetLatestSnapshot validates before returning
- If invalid, returns nil; application must replay
Technical Notes
- This is a safety policy; prevents stale snapshots
- Application owns decision to create snapshots (no auto-triggering)
- MaxVersionGap is tunable per deployment
Test Cases
- Snapshot at version 10, MaxGap=100, current=50: valid
- Snapshot at version 10, MaxGap=100, current=111: invalid
- Snapshot at version 10, MaxGap=100, current=110: valid
- GetLatestSnapshot returns nil for invalid snapshot
Dependencies
- Depends on: Issue 1.6 (GetEvents)
Issue 1.8: [Event] Publish SnapshotCreated for observability
Type: New Feature Bounded Context: Event Sourcing Priority: P1
Title: Emit snapshot creation event for lifecycle tracking
User Story
As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting.
Acceptance Criteria
- SnapshotCreated event published after SaveSnapshot succeeds
- Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration
- Metrics increment for snapshot creation
- No event if SaveSnapshot fails
- Example: Snapshot created every 1000 versions
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Event)
Event: SnapshotCreated(actorID, version, timestamp, replayDurationMs)
Triggered by: SaveSnapshot call succeeds
Consumers: Metrics, monitoring dashboards
Technical Notes
- SnapshotCreated is infrastructure event (like EventStored)
- ReplayDuration helps identify slow actors needing snapshots more frequently
Test Cases
- SaveSnapshot succeeds → SnapshotCreated published
- SaveSnapshot fails → no event published
- ReplayDuration recorded accurately
Dependencies
- Depends on: Issue 1.7 (SnapshotStore interface)
Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay
Type: New Feature Bounded Context: Event Sourcing Priority: P1
Title: Handle corrupted events during replay without data loss
User Story
As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events.
Acceptance Criteria
- GetEventsWithErrors(actorID, fromVersion) returns ReplayResult
- ReplayResult contains: []*Event (good) and []ReplayError (bad)
- Good events are returned in order despite errors
- ReplayError contains: SequenceNumber, RawData, UnmarshalError
- Application decides how to handle corrupted events
- Metrics track corruption frequency
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Query)
Interface: EventStoreWithErrors extends EventStore
Method: GetEventsWithErrors(actorID, fromVersion) → ReplayResult
Data:
- ReplayResult.Events: successfully deserialized events
- ReplayResult.Errors: corruption records
- ReplayResult.HasErrors(): convenience check
Technical Notes
- Already defined in event.go (ReplayError, ReplayResult)
- JetStreamEventStore should implement EventStoreWithErrors
- Application uses HasErrors() to decide on recovery action
Test Cases
- All good events: ReplayResult.Events populated, no errors
- Corrupted event in middle: good events before/after, one error recorded
- Multiple corruptions: all recorded with context
- Application can inspect RawData for forensics
Dependencies
- Depends on: Issue 1.6 (GetEvents)
Issue 1.10: [Interface] Implement SnapshotStore interface
Type: New Feature Bounded Context: Event Sourcing Priority: P0
Title: Define snapshot storage contract
User Story
As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3).
Acceptance Criteria
- SnapshotStore extends EventStore
- GetLatestSnapshot(actorID) returns ActorSnapshot or nil
- SaveSnapshot(snapshot) persists snapshot
- ActorSnapshot contains: ActorID, Version, State, Timestamp
- Namespace isolation: snapshots scoped to namespace
- Tests verify interface contract
Bounded Context: Event Sourcing
DDD Implementation Guidance
Type: New Feature (Interface)
Interface: SnapshotStore extends EventStore
Methods:
- GetLatestSnapshot(actorID) → (*ActorSnapshot, error)
- SaveSnapshot(snapshot) → error
Aggregates: ActorSnapshot (value object)
Technical Notes
- Already defined in event.go
- Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore
- Keep snapshots in same store as events (co-located)
Test Cases
- SaveSnapshot persists; GetLatestSnapshot retrieves it
- New actor: GetLatestSnapshot returns nil
- Multiple snapshots: only latest returned
- Namespace isolation: snapshots from tenant-a don't appear in tenant-b
Dependencies
- Depends on: Issue 1.1 (SaveEvent + storage foundation)
Feature Set 1c: Optimistic Concurrency Control
Capability: Enable Safe Concurrent Writes
Description: Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy.
Success Condition: Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks.
Issue 1.11: [Rule] Enforce fail-fast on version conflict
Type: New Feature Bounded Context: Optimistic Concurrency Control Priority: P0
Title: Fail immediately on version conflict; no auto-retry
User Story
As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.).
Acceptance Criteria
- SaveEvent returns VersionConflictError immediately on mismatch
- No built-in retry loop in SaveEvent
- No database-level locks held
- Application reads VersionConflictError and decides retry
- Default retry strategy documented (examples/)
Bounded Context: Optimistic Concurrency Control
DDD Implementation Guidance
Type: New Feature (Policy)
Invariant: Conflicts trigger immediate failure; application owns retry
Implementation:
- SaveEvent: version check, return error if mismatch, done
- No loop, no backoff, no retries
- Clean error with context for caller
Technical Notes
- This is a design choice: fail-fast enables flexible retry strategies
- Application can choose exponential backoff, jitter, circuit-breaker, etc.
Test Cases
- SaveEvent(version: 2) when current=2: fails immediately
- No retry attempted by library
- Application can retry if desired
- Example patterns in examples/retry.go
Dependencies
- Depends on: Issue 1.1 (SaveEvent)
Issue 1.12: [Documentation] Document concurrent write patterns
Type: New Feature Bounded Context: Optimistic Concurrency Control Priority: P1
Title: Provide retry strategy examples (backoff, circuit-breaker, queue)
User Story
As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application.
Acceptance Criteria
- examples/retry_exponential_backoff.go
- examples/retry_circuit_breaker.go
- examples/retry_queue_based.go
- examples/concurrent_write_test.go showing patterns
- README mentions OCC patterns
- Each example is >100 lines with explanation
Bounded Context: Optimistic Concurrency Control
DDD Implementation Guidance
Type: Documentation
Artifacts:
- examples/retry_exponential_backoff.go
- examples/retry_circuit_breaker.go
- examples/retry_queue_based.go
- examples/concurrent_write_test.go
Content:
- How to read VersionConflictError
- When to retry (idempotent operations)
- When not to retry (non-idempotent)
- Backoff strategies
- Monitoring
Technical Notes
- Real, runnable code (not pseudocode)
- Show metrics collection
- Show when to give up
Test Cases
- Examples compile without error
- Examples use idempotent operations
- Test coverage for examples
Dependencies
- Depends on: Issue 1.11 (fail-fast behavior)
Phase 2: Local Event Bus
Feature Set 2a: Event Routing and Filtering
Capability: Route and Filter Domain Events
Description: Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern.
Success Condition: Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match.
Issue 2.1: [Command] Implement Publish to local subscribers
Type: New Feature Bounded Context: Event Bus Priority: P1
Title: Publish events to local subscribers
User Story
As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling.
Acceptance Criteria
- Publish(namespaceID, event) sends to all subscribers of that namespace
- Exact subscribers (namespace="orders") receive event
- Wildcard subscribers (namespace="order*") receive matching events
- Events delivered in-process (no NATS yet)
- Buffered channels (100-event buffer) prevent blocking
- Full subscribers dropped non-blocking (no deadlock)
- Metrics track publish count, receive count, dropped count
Bounded Context: Event Bus
DDD Implementation Guidance
Type: New Feature (Command)
Command: Publish(namespaceID, event)
Invariant: All subscribers matching namespace receive event
Implementation:
- Iterate exact subscribers for namespace
- Iterate wildcard subscribers matching pattern
- Deliver to each (non-blocking, buffered)
- Count drops
Technical Notes
- EventBus in eventbus.go already implements this
- Ensure buffered channels don't cause memory leaks
- Metrics important for observability
Test Cases
- Publish to "orders": exact subscriber of "orders" receives
- Publish to "orders.new": wildcard subscriber of "order*" receives
- Publish to "payments": subscriber to "orders" does NOT receive
- Subscriber with full buffer: event dropped (non-blocking)
- 1000 publishes: metrics accurate
Dependencies
- Depends on: Issue 2.2 (Subscribe)
Issue 2.2: [Command] Implement Subscribe with optional filter
Type: New Feature Bounded Context: Event Bus Priority: P1
Title: Register subscriber with optional event filter
User Story
As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about.
Acceptance Criteria
- Subscribe(namespacePattern) returns <-chan *Event
- SubscribeWithFilter(namespacePattern, filter) returns filtered channel
- Filter supports EventTypes ([]string) and ActorPattern (string)
- Filters applied client-side (subscriber decides)
- Wildcard patterns work: "*" matches single token, ">" matches multiple
- Subscription channel is buffered (100 events)
- Unsubscribe(namespacePattern, ch) removes subscription
Bounded Context: Event Bus
DDD Implementation Guidance
Type: New Feature (Command)
Command: Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter)
Invariants:
- Namespace pattern determines which namespaces
- Filter determines which events within namespace
- Both work together (AND logic)
Filter Types:
- EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"])
- ActorPattern: string (e.g., "order-customer-*")
Technical Notes
- Pattern matching follows NATS conventions
- Filters are optional (nil filter = all events)
- Client-side filtering is efficient (NATS does server-side)
Test Cases
- Subscribe("orders"): exact match only
- Subscribe("order*"): wildcard match
- Subscribe("order.*"): NATS-style wildcard
- SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works
- SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works
- Unsubscribe closes channel
Dependencies
- Depends on: Issue 1.1 (events structure)
Issue 2.3: [Rule] Enforce exact subscription isolation
Type: New Feature Bounded Context: Event Bus + Namespace Isolation Priority: P1
Title: Guarantee exact namespace subscriptions are isolated
User Story
As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level.
Acceptance Criteria
- Subscriber to "tenant-a" receives events from "tenant-a" only
- Subscriber to "tenant-a" does NOT receive from "tenant-b"
- Wildcard subscriber to "tenant*" receives from both
- Exact match subscribers are isolated from wildcard
- Tests verify isolation with multi-namespace setup
- Documentation warns about wildcard security implications
Bounded Context: Event Bus + Namespace Isolation
DDD Implementation Guidance
Type: New Feature (Policy/Invariant)
Invariant: Exact subscriptions are isolated
Implementation:
- exactSubscribers map[namespace][]*subscription
- Wildcard subscriptions separate collection
- Publish checks exact first, then wildcard patterns
Security Note: Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.)
Technical Notes
- Enforced at EventBus.Publish level
- Exact match is simple string equality
- Wildcard uses MatchNamespacePattern helper
Test Cases
- Publish to "tenant-a": only "tenant-a" exact subscribers get it
- Publish to "tenant-b": only "tenant-b" exact subscribers get it
- Publish to "tenant-a": "tenant*" wildcard subscriber gets it
- Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it
Dependencies
- Depends on: Issue 2.2 (Subscribe)
Issue 2.4: [Rule] Document wildcard subscription security
Type: New Feature Bounded Context: Event Bus Priority: P1
Title: Document that wildcard subscriptions bypass isolation
User Story
As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions.
Acceptance Criteria
- eventbus.go comments explain wildcard behavior
- Security warning in Subscribe godoc
- Example showing wildcard usage for logging
- Example showing why wildcard is dangerous (if not restricted)
- README mentions namespace isolation caveats
- Examples show proper patterns (monitoring, auditing)
Bounded Context: Event Bus
DDD Implementation Guidance
Type: Documentation
Content:
- Wildcard subscriptions receive all matching events
- Use for cross-cutting concerns (logging, monitoring, audit)
- Restrict access to trusted components
- Never expose wildcard pattern to untrusted users
Examples:
- Monitoring system subscribes to ">"
- Audit system subscribes to "tenant-*"
- Application logic uses exact subscriptions only
Technical Notes
- Intentional design; not a bug
- Different from NATS server-side filtering
Test Cases
- Examples compile
- Documentation is clear and accurate
Dependencies
- Depends on: Issue 2.3 (exact isolation)
Issue 2.5: [Event] Publish SubscriptionCreated for tracking
Type: New Feature Bounded Context: Event Bus Priority: P2
Title: Track subscription lifecycle
User Story
As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues.
Acceptance Criteria
- SubscriptionCreated event published on Subscribe
- SubscriptionDestroyed event published on Unsubscribe
- Event contains: namespacePattern, filterCriteria, timestamp
- Metrics increment on subscribe/unsubscribe
- SubscriberCount(namespace) returns current count
Bounded Context: Event Bus
DDD Implementation Guidance
Type: New Feature (Event)
Event: SubscriptionCreated(namespacePattern, filter, timestamp)
Event: SubscriptionDestroyed(namespacePattern, timestamp)
Metrics: Subscriber count per namespace
Technical Notes
- SubscriberCount already in eventbus.go
- Add events to EventBus.Subscribe and EventBus.Unsubscribe
- Internal events (infrastructure)
Test Cases
- Subscribe → metrics increment
- Unsubscribe → metrics decrement
- SubscriberCount correct
Dependencies
- Depends on: Issue 2.2 (Subscribe/Unsubscribe)
Issue 2.6: [Event] Publish EventPublished for delivery tracking
Type: New Feature Bounded Context: Event Bus Priority: P2
Title: Record event publication metrics
User Story
As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues.
Acceptance Criteria
- EventPublished event published on Publish
- Metrics track: published count, delivered count, dropped count per namespace
- Dropped events (full channel) recorded
- Application can query metrics via Metrics()
- Example: 1000 events published, 995 delivered, 5 dropped
Bounded Context: Event Bus
DDD Implementation Guidance
Type: New Feature (Event/Metrics)
Event: EventPublished (infrastructure event)
Metrics:
- PublishCount[namespace]
- DeliveryCount[namespace]
- DroppedCount[namespace]
Implementation:
- RecordPublish(namespace)
- RecordReceive(namespace)
- RecordDroppedEvent(namespace)
Technical Notes
- Metrics already in DefaultMetricsCollector
- RecordDroppedEvent signals subscriber backpressure
- Can be used to auto-scale subscribers
Test Cases
- Publish 100 events: metrics show 100 published
- All delivered: metrics show 100 delivered
- Full subscriber: next event dropped, metrics show 1 dropped
- Query via bus.Metrics(): values accurate
Dependencies
- Depends on: Issue 2.1 (Publish)
Issue 2.7: [Read Model] Implement GetSubscriptions query
Type: New Feature Bounded Context: Event Bus Priority: P2
Title: Query active subscriptions for operational visibility
User Story
As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health.
Acceptance Criteria
- GetSubscriptions() returns []SubscriptionInfo
- SubscriptionInfo contains: pattern, filter, subscriberID, createdAt
- Works for both exact and wildcard subscriptions
- Metrics accessible via SubscriberCount(namespace)
- Example: "What subscriptions are listening to 'orders'?"
Bounded Context: Event Bus
DDD Implementation Guidance
Type: New Feature (Query)
Read Model: SubscriptionRegistry
Data:
- Pattern: namespace pattern (e.g., "tenant-*")
- Filter: optional filter criteria
- SubscriberID: unique ID for each subscription
- CreatedAt: timestamp
Implementation:
- Track subscriptions in eventbus.go
- Expose via GetSubscriptions() method
Technical Notes
- Useful for debugging
- Optional feature; not critical
Test Cases
- Subscribe to "orders": GetSubscriptions shows it
- Subscribe to "order*": GetSubscriptions shows it
- Unsubscribe: GetSubscriptions removes it
- Multiple subscribers: all listed
Dependencies
- Depends on: Issue 2.2 (Subscribe)
Feature Set 2b: Buffering and Backpressure
Capability: Route and Filter Domain Events (non-blocking delivery)
Description: Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed).
Success Condition: Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher.
Issue 2.8: [Rule] Implement non-blocking event delivery
Type: New Feature Bounded Context: Event Bus Priority: P1
Title: Ensure event publication never blocks
User Story
As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations.
Acceptance Criteria
- Publish(namespace, event) returns immediately
- If subscriber channel full, event dropped (non-blocking)
- Dropped events counted in metrics
- Buffered channel size is 100 (tunable)
- Publisher never waits for subscriber
- Metrics alert on high drop rate
Bounded Context: Event Bus
DDD Implementation Guidance
Type: New Feature (Policy)
Invariant: Publishers not blocked by slow subscribers
Implementation:
- select { case ch <- event: ... default: ... }
- Count drops in default case
Trade-off:
- Pro: Publisher never blocks
- Con: Events may be lost if subscriber can't keep up
- Mitigation: Metrics alert on drops; subscriber can increase buffer or retry
Technical Notes
- Already implemented in eventbus.go (deliverToSubscriber)
- 100-event buffer is reasonable default
Test Cases
- Subscribe, receive 100 events: no drops
- Publish 101st event immediately: dropped
- Metrics show drop count
- Publisher latency < 1ms regardless of subscribers
Dependencies
- Depends on: Issue 2.1 (Publish)
Issue 2.9: [Documentation] Document EventBus backpressure handling
Type: New Feature Bounded Context: Event Bus Priority: P2
Title: Explain buffer management and recovery from drops
User Story
As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers.
Acceptance Criteria
- Document buffer size (100 events default)
- Explain what happens on overflow (event dropped)
- Document recovery patterns (subscriber restarts, re-syncs)
- Example: Subscriber catches up from JetStream after restart
- Metrics to monitor (drop rate)
- README section on backpressure
Bounded Context: Event Bus
DDD Implementation Guidance
Type: Documentation
Content:
- Buffer size and behavior
- Drop semantics
- Recovery patterns
- Metrics to monitor
- When to increase buffer size
Examples:
- Slow subscriber: increase buffer or fix handler
- Network latency: events may be dropped
- Handler panics: subscriber must restart and re-sync
Technical Notes
- Events are lost if dropped; only durable via JetStream
- Phase 3 (NATS) addresses durability
Test Cases
- Documentation is clear
- Examples work
Dependencies
- Depends on: Issue 2.8 (non-blocking delivery)
Phase 3: Cluster Coordination
Feature Set 3a: Cluster Topology and Leadership
Capability: Coordinate Cluster Topology
Description: Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease.
Success Condition: Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s.
Issue 3.1: [Command] Implement JoinCluster protocol
Type: New Feature Bounded Context: Cluster Coordination Priority: P1
Title: Enable node discovery via cluster join
User Story
As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically.
Acceptance Criteria
- JoinCluster() announces node via NATS
- Node info contains: NodeID, Address, Timestamp, Status
- Other nodes receive join announcement
- Cluster topology updated atomically
- Rejoining node detected and updated
- Tests verify multi-node discovery
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Command)
Command: JoinCluster()
Aggregates: Cluster (group of nodes)
Events: NodeJoined(nodeID, address, timestamp)
Technical Notes
- NATS subject: "aether.cluster.nodes"
- NodeDiscovery subscribes to announcements
- ClusterManager.Start() initiates join
Test Cases
- Single node joins: topology = [node-a]
- Second node joins: topology = [node-a, node-b]
- Third node joins: topology = [node-a, node-b, node-c]
- Node rejoins: updates existing entry
Dependencies
- None (first cluster feature)
Issue 3.2: [Command] Implement LeaderElection
Type: New Feature Bounded Context: Cluster Coordination Priority: P0
Title: Elect single leader via NATS-based voting
User Story
As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing.
Acceptance Criteria
- LeaderElection holds election every HeartbeatInterval (5s)
- Nodes vote for themselves (no voting logic; first wins)
- One leader elected per term
- Leader holds lease (TTL = 2 * HeartbeatInterval)
- All nodes converge on same leader
- Lease renewal happens automatically
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Command)
Command: ElectLeader()
Aggregates: LeadershipLease (time-bound authority)
Events: LeaderElected(leaderID, term, leaseExpiration)
Technical Notes
- NATS subject: "aether.cluster.election"
- Each node publishes heartbeat with NodeID, Timestamp
- First node to publish becomes leader
- Lease expires if no heartbeat for TTL
Test Cases
- Single node: elected immediately
- Three nodes: exactly one elected
- Leader dies: remaining nodes elect new leader within 2*interval
- Leader comes back: may or may not stay leader
Dependencies
- Depends on: Issue 3.1 (node discovery)
Issue 3.3: [Rule] Enforce single leader invariant
Type: New Feature Bounded Context: Cluster Coordination Priority: P0
Title: Guarantee exactly one leader at any time
User Story
As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict.
Acceptance Criteria
- At most one leader at any time (lease-based)
- If leader lease expires, no leader until re-election
- All nodes see same leader (or none)
- Tests verify invariant under various failure scenarios
- Split-brain prevented by lease TTL < network latency
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Invariant)
Invariant: At most one leader (enforced by lease TTL)
Mechanism:
- Leader publishes heartbeat every HeartbeatInterval
- Other nodes trust leader if heartbeat < HeartbeatInterval old
- If no heartbeat for 2*HeartbeatInterval, lease expired
- New election begins
Technical Notes
- Lease-based; not consensus-based (simpler)
- Allows temporary split-brain until lease expires
- Acceptable for Aether (eventual consistency)
Test Cases
- Simulate leader death: lease expires, new leader elected
- Simulate network partition: partition may have >1 leader until lease expires
- Verify no coordination during lease expiration
Dependencies
- Depends on: Issue 3.2 (leader election)
Issue 3.4: [Event] Publish LeaderElected on election
Type: New Feature Bounded Context: Cluster Coordination Priority: P1
Title: Record leadership election outcomes
User Story
As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency.
Acceptance Criteria
- LeaderElected event published after successful election
- Event contains: LeaderID, Term, LeaseExpiration, Timestamp
- Metrics increment on election
- Helpful for debugging split-brain scenarios
- Track election frequency (ideally < 1 per minute)
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Event)
Event: LeaderElected(leaderID, term, leaseExpiration, timestamp)
Triggered by: Successful election
Consumers: Metrics, audit logs
Technical Notes
- Event published locally to all observers
- Infrastructure event (not domain event)
Test Cases
- Election happens: event published
- Term increments: event reflects new term
- Metrics accurate
Dependencies
- Depends on: Issue 3.2 (election)
Issue 3.5: [Event] Publish LeadershipLost on lease expiration
Type: New Feature Bounded Context: Cluster Coordination Priority: P2
Title: Track leadership transitions
User Story
As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events.
Acceptance Criteria
- LeadershipLost event published when lease expires
- Event contains: PreviousLeaderID, Timestamp, Reason
- Metrics track leadership transitions
- Helpful for debugging cascading failures
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Event)
Event: LeadershipLost(previousLeaderID, timestamp, reason)
Reason: "lease_expired", "node_failed", etc.
Technical Notes
- Published when lease TTL expires
- Useful for observability
Test Cases
- Leader lease expires: LeadershipLost published
- Metrics show transition
Dependencies
- Depends on: Issue 3.2 (election)
Issue 3.6: [Read Model] Implement GetClusterTopology query
Type: New Feature Bounded Context: Cluster Coordination Priority: P1
Title: Query current cluster members and status
User Story
As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues.
Acceptance Criteria
- GetNodes() returns map[nodeID]*NodeInfo
- NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs
- Status is: Active, Degraded, Failed
- LastSeen is accurate heartbeat timestamp
- ShardIDs show shard ownership (filled in Phase 3b)
- Example: "node-a is active; node-b failed 30s ago"
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Query)
Read Model: ClusterTopology
Data:
- NodeID → NodeInfo (status, heartbeat, shards)
- LeaderID (current leader)
- Term (election term)
Technical Notes
- ClusterManager maintains topology in-memory
- Update on each heartbeat/announcement
Test Cases
- GetNodes() returns active nodes
- Status accurate (Active, Failed, etc.)
- LastSeen updates on heartbeat
- Rejoining node updates existing entry
Dependencies
- Depends on: Issue 3.1 (node discovery)
Issue 3.7: [Read Model] Implement GetLeader query
Type: New Feature Bounded Context: Cluster Coordination Priority: P0
Title: Query current leader
User Story
As a client, I want to know who the leader is, so that I can route coordination requests to the right node.
Acceptance Criteria
- GetLeader() returns current leader NodeID or ""
- IsLeader() returns true if this node is leader
- Both consistent with LeaderElection state
- Updated immediately on election
- Example: "node-b is leader (term 5)"
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Query)
Read Model: LeadershipRegistry
Data: CurrentLeader, CurrentTerm, LeaseExpiration
Implementation:
- LeaderElection maintains this
- ClusterManager queries it
Technical Notes
- Critical for routing coordination work
- Must be consistent across cluster
Test Cases
- No leader: GetLeader returns ""
- Leader elected: GetLeader returns leader ID
- IsLeader true on leader, false on others
- Changes on re-election
Dependencies
- Depends on: Issue 3.2 (election)
Feature Set 3b: Shard Distribution
Capability: Distribute Actors Across Cluster Nodes
Description: Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling.
Success Condition: 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently.
Issue 3.8: [Command] Implement consistent hash ring
Type: New Feature Bounded Context: Cluster Coordination Priority: P1
Title: Distribute shards across nodes with minimal reshuffling
User Story
As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling.
Acceptance Criteria
- ConsistentHashRing(numShards=1024) creates ring
- GetShard(actorID) returns consistent shard [0, 1024)
- AddNode(nodeID) rebalances ~numShards/numNodes shards
- RemoveNode(nodeID) rebalances shards evenly
- Same actor always maps to same shard
- Reshuffling < 40% on node add/remove
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Command)
Command: AssignShards(nodes)
Aggregates: ConsistentHashRing (distribution algorithm)
Invariants:
- Each shard [0, 1024) assigned to exactly one node
- ActorID hashes consistently to shard
- Topology changes minimize reassignment
Technical Notes
- hashring.go already implements this
- Use crypto/md5 or compatible hash
- 1024 shards is tunable (P1 default)
Test Cases
- Single node: all shards assigned to it
- Two nodes: ~512 shards each
- Three nodes: ~341 shards each
- Add fourth node: ~256 shards each (~20% reshuffled)
- Remove node: remaining nodes rebalance evenly
- Same actor-id always hashes to same shard
Dependencies
- Depends on: Issue 3.1 (node discovery)
Issue 3.9: [Rule] Enforce single shard owner invariant
Type: New Feature Bounded Context: Cluster Coordination Priority: P0
Title: Guarantee each shard has exactly one owner
User Story
As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically.
Acceptance Criteria
- ShardMap tracks shard → nodeID assignment
- No shard is unassigned (every shard has owner)
- No shard assigned to multiple nodes
- Reassignment is atomic (no in-between state)
- Tests verify invariant after topology changes
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Invariant)
Invariant: Each shard [0, 1024) assigned to exactly one active node
Mechanism:
- ShardMap[shardID] = [nodeID]
- Maintained by leader
- Updated atomically on rebalancing
Technical Notes
- shard.go implements ShardManager
- Validated after each rebalancing
Test Cases
- After rebalancing: all shards assigned
- No orphaned shards
- No multiply-assigned shards
- Reassignment is atomic
Dependencies
- Depends on: Issue 3.8 (consistent hashing)
Issue 3.10: [Event] Publish ShardAssigned on assignment
Type: New Feature Bounded Context: Cluster Coordination Priority: P2
Title: Track shard-to-node assignments
User Story
As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues.
Acceptance Criteria
- ShardAssigned event published after assignment
- Event contains: ShardID, NodeID, Timestamp
- Metrics track: shards per node, rebalancing frequency
- Example: Shard 42 assigned to node-b
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Event)
Event: ShardAssigned(shardID, nodeID, timestamp)
Triggered by: AssignShards command succeeds
Metrics: Shards per node, distribution evenness
Technical Notes
- Infrastructure event
- Useful for monitoring load distribution
Test Cases
- Assignment published on rebalancing
- Metrics reflect distribution
Dependencies
- Depends on: Issue 3.9 (shard ownership)
Issue 3.11: [Read Model] Implement GetShardAssignments query
Type: New Feature Bounded Context: Cluster Coordination Priority: P1
Title: Query shard-to-node mapping
User Story
As a client, I want to know which node owns a shard, so that I can route actor requests correctly.
Acceptance Criteria
- GetShardAssignments() returns ShardMap
- ShardMap[shardID] returns owning nodeID
- GetShard(actorID) returns shard for actor
- Routing decision: actorID → shard → nodeID
- Cached locally; refreshed on each rebalancing
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Query)
Read Model: ShardMap
Data:
- ShardID → NodeID (primary owner)
- Version (incremented on rebalancing)
- UpdateTime
Implementation:
- ClusterManager.GetShardMap()
- Cached; updated on assignment changes
Technical Notes
- Critical for routing
- Must be consistent across cluster
- Version helps detect stale caches
Test Cases
- GetShardAssignments returns current map
- GetShard(actorID) returns consistent shard
- Routing: actor ID → shard → node owner
Dependencies
- Depends on: Issue 3.9 (shard ownership)
Feature Set 3c: Failure Detection and Recovery
Capability: Recover from Node Failures
Description: Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes.
Success Condition: Node dies → failure detected within 90s → shards reassigned → actors replay automatically.
Issue 3.12: [Command] Implement node health checks
Type: New Feature Bounded Context: Cluster Coordination Priority: P1
Title: Detect node failures via heartbeat timeout
User Story
As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes.
Acceptance Criteria
- Each node publishes heartbeat every 30s
- Nodes without heartbeat for 90s marked as Failed
- checkNodeHealth() runs every 30s
- Failed node's status updates atomically
- Tests verify failure detection timing
- Failed node can rejoin cluster
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Command)
Command: MarkNodeFailed(nodeID)
Trigger: monitorNodes detects missing heartbeat
Events: NodeFailed(nodeID, lastSeenTimestamp)
Technical Notes
- monitorNodes() loop in manager.go
- Check LastSeen timestamp
- Update status if stale (>90s)
Test Cases
- Active node: status stays Active
- No heartbeat for 90s: status → Failed
- Rejoin: status → Active
- Failure detected < 100s (ideally 90-120s)
Dependencies
- Depends on: Issue 3.1 (node discovery)
Issue 3.13: [Command] Implement RebalanceShards after node failure
Type: New Feature Bounded Context: Cluster Coordination Priority: P0
Title: Reassign failed node's shards to healthy nodes
User Story
As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes.
Acceptance Criteria
- Leader detects node failure
- Leader triggers RebalanceShards
- Failed node's shards reassigned evenly
- No shard left orphaned
- ShardMap updated atomically
- Rebalancing completes within 5 seconds
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Command)
Command: RebalanceShards(failedNodeID)
Aggregates: ShardMap, ConsistentHashRing
Events: RebalanceStarted, ShardMigrated
Technical Notes
- Leader only (IsLeader() check)
- Use consistent hashing to assign
- Calculate new assignments atomically
Test Cases
- Node-a fails with shards [1, 2, 3]
- Leader reassigns [1, 2, 3] to remaining nodes
- No orphaned shards
- Rebalancing < 5s
Dependencies
- Depends on: Issue 3.8 (consistent hashing)
- Depends on: Issue 3.12 (failure detection)
Issue 3.14: [Rule] Enforce no-orphan invariant
Type: New Feature Bounded Context: Cluster Coordination Priority: P0
Title: Guarantee all shards have owners after rebalancing
User Story
As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable.
Acceptance Criteria
- Before rebalancing: verify no orphaned shards
- After rebalancing: verify all shards assigned
- Tests fail if invariant violated
- Rebalancing aborted if invariant would be violated
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Invariant)
Invariant: All shards [0, 1024) have owners after any rebalancing
Check:
- Count assigned shards
- Verify = 1024
- Abort if not
Technical Notes
- Validate before committing ShardMap
- Logs errors but doesn't assert (graceful degradation)
Test Cases
- Rebalancing completes: all shards assigned
- Orphaned shard detected: rebalancing rolled back
- Tests verify count = 1024
Dependencies
- Depends on: Issue 3.13 (rebalancing)
Issue 3.15: [Event] Publish NodeFailed on failure detection
Type: New Feature Bounded Context: Cluster Coordination Priority: P2
Title: Record node failure for observability
User Story
As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues.
Acceptance Criteria
- NodeFailed event published when failure detected
- Event contains: NodeID, LastSeenTimestamp, AffectedShards
- Metrics track failure frequency
- Example: "node-a failed; 341 shards affected"
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Event)
Event: NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs)
Triggered by: checkNodeHealth marks node failed
Consumers: Metrics, alerts, audit logs
Technical Notes
- Infrastructure event
- AffectedShards helps assess impact
Test Cases
- Node failure detected: event published
- Metrics show affected shard count
Dependencies
- Depends on: Issue 3.12 (failure detection)
Issue 3.16: [Event] Publish ShardMigrated on shard movement
Type: New Feature Bounded Context: Cluster Coordination Priority: P2
Title: Track shard migrations
User Story
As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations.
Acceptance Criteria
- ShardMigrated event published on each shard movement
- Event contains: ShardID, FromNodeID, ToNodeID, Status
- Status: "Started", "InProgress", "Completed", "Failed"
- Metrics track migration count and duration
- Example: "Shard 42 migrated from node-a to node-b (2.3s)"
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: New Feature (Event)
Event: ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs)
Status: Started → InProgress → Completed
Consumers: Metrics, progress tracking
Technical Notes
- Published for each shard move
- Helps track rebalancing progress
- Useful for SLO monitoring
Test Cases
- Shard moves: event published
- Metrics track duration
- Status transitions correct
Dependencies
- Depends on: Issue 3.13 (rebalancing)
Issue 3.17: [Documentation] Document actor migration and replay
Type: New Feature Bounded Context: Cluster Coordination Priority: P2
Title: Explain how actors move and recover state
User Story
As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application.
Acceptance Criteria
- Design doc: cluster/ACTOR_MIGRATION.md
- Explain shard reassignment process
- Explain state rebuild via GetEvents + replay
- Explain snapshot optimization
- Example: Shard 42 moves to new node; 1000-event actor replays in <100ms
- Explain out-of-order message handling
Bounded Context: Cluster Coordination
DDD Implementation Guidance
Type: Documentation
Content:
- Shard assignment (consistent hashing)
- Actor discovery (routing via shard map)
- State rebuild (replay from JetStream)
- Snapshots (optional optimization)
- In-flight messages (may arrive before replay completes)
Examples:
- Manual failover: reassign shards manually
- Auto failover: leader initiates on failure detection
Technical Notes
- Complex topic; good documentation prevents bugs
Test Cases
- Documentation is clear
- Examples correct
Dependencies
- Depends on: Issue 3.13 (rebalancing)
- Depends on: Phase 1 (event replay)
Phase 4: Namespace Isolation and NATS Event Delivery
Feature Set 4a: Namespace Storage Isolation
Capability: Isolate Logical Domains Using Namespaces
Description: Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer.
Success Condition: Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries.
Issue 4.1: [Rule] Enforce namespace-based stream naming
Type: New Feature Bounded Context: Namespace Isolation Priority: P1
Title: Use namespace prefixes in JetStream stream names
User Story
As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage.
Acceptance Criteria
- Namespace "tenant-a" → stream "tenant-a_events"
- Namespace "tenant-b" → stream "tenant-b_events"
- Empty namespace → stream "events" (default)
- JetStreamConfig.Namespace sets prefix
- NewJetStreamEventStoreWithNamespace convenience function
- Tests verify stream names have namespace prefix
Bounded Context: Namespace Isolation
DDD Implementation Guidance
Type: New Feature (Configuration)
Value Object: Namespace (string identifier)
Implementation:
- JetStreamConfig.Namespace field
- StreamName = namespace + "_events" if namespace set
- StreamName = "events" if namespace empty
Technical Notes
- Already partially implemented in jetstream.go
- Ensure safe characters (sanitize spaces, dots, wildcards)
Test Cases
- NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events"
- NewJetStreamEventStoreWithNamespace(""): creates stream "events"
- Stream name verified
Dependencies
- None (orthogonal to other contexts)
Issue 4.2: [Rule] Enforce storage-level namespace isolation
Type: New Feature Bounded Context: Namespace Isolation Priority: P0
Title: Prevent cross-namespace data leakage at storage layer
User Story
As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems.
Acceptance Criteria
- SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events"
- GetEvents("tenant-a") queries "tenant-a_events" stream only
- No possibility of accidental cross-namespace leakage
- JetStream subject filtering enforces isolation
- Integration tests verify with multiple namespaces
Bounded Context: Namespace Isolation
DDD Implementation Guidance
Type: New Feature (Invariant)
Invariant: Events from namespace X are invisible to namespace Y
Mechanism:
- Separate JetStream streams per namespace
- Subject prefixing: "tenant-a.events.actor-123"
- Subscribe filters by subject prefix
Technical Notes
- jetstream.go: SubscribeToActorEvents uses subject prefix
- Consumer created with subject filter matching namespace
Test Cases
- SaveEvent to tenant-a: visible in tenant-a queries
- Same event invisible to tenant-b queries
- GetLatestVersion scoped to namespace
- GetEvents scoped to namespace
- Multi-namespace integration test
Dependencies
- Depends on: Issue 4.1 (stream naming)
Issue 4.3: [Documentation] Document namespace design patterns
Type: New Feature Bounded Context: Namespace Isolation Priority: P1
Title: Provide guidance on namespace naming and use
User Story
As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system.
Acceptance Criteria
- Design doc: NAMESPACE_DESIGN_PATTERNS.md
- Pattern 1: "tenant-{id}" (per-customer)
- Pattern 2: "env.domain" (per-env, per-bounded-context)
- Pattern 3: "env.domain.customer" (most granular)
- Examples of each pattern
- Guidance on choosing granularity
- Anti-patterns (wildcards, spaces, dots)
Bounded Context: Namespace Isolation
DDD Implementation Guidance
Type: Documentation
Content:
- Multi-tenant patterns
- Granularity decisions
- Namespace naming rules
- Examples
- Anti-patterns
- Performance implications
Examples:
- SaaS: "tenant-uuid"
- Microservices: "service.orders"
- Complex: "env.service.tenant"
Technical Notes
- No hard restrictions; naming is flexible
- Sanitization (spaces → underscores)
Test Cases
- Documentation is clear
- Examples valid
Dependencies
- Depends on: Issue 4.1 (stream naming)
Issue 4.4: [Validation] Add namespace format validation (P2)
Type: New Feature Bounded Context: Namespace Isolation Priority: P2
Title: Validate namespace names to prevent invalid streams
User Story
As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names.
Acceptance Criteria
- ValidateNamespace(ns string) returns error for invalid names
- Rejects: "tenant-*", "tenant a", "tenant."
- Accepts: "tenant-abc", "prod.orders", "tenant_123"
- Called on NewJetStreamEventStoreWithNamespace
- Clear error messages
- Tests verify validation rules
Bounded Context: Namespace Isolation
DDD Implementation Guidance
Type: New Feature (Validation)
Validation Rules:
- No wildcards (*, >)
- No spaces
- No leading/trailing dots
- Alphanumeric, hyphens, underscores, dots only
Implementation:
- ValidateNamespace regex
- Called before stream creation
Technical Notes
- Nice-to-have; currently strings accepted as-is
- Could sanitize instead of rejecting (replace _ for spaces)
Test Cases
- Valid: "tenant-abc", "prod.orders"
- Invalid: "tenant-*", "tenant a", ".prod"
- Error messages clear
Dependencies
- Depends on: Issue 4.1 (stream naming)
Feature Set 4b: Cross-Node Event Delivery via NATS
Capability: Deliver Events Across Cluster Nodes
Description: Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering.
Success Condition: Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS).
Issue 4.5: [Command] Implement NATSEventBus wrapper
Type: New Feature Bounded Context: Event Bus (with NATS) Priority: P1
Title: Extend EventBus with NATS-native pub/sub
User Story
As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations.
Acceptance Criteria
- NATSEventBus embeds EventBus
- Publish(namespace, event) sends to local EventBus AND NATS
- NATS subject: "aether.events.{namespace}"
- SubscribeWithFilter works across nodes
- Self-published events not re-delivered (avoid loops)
- Tests verify cross-node delivery
Bounded Context: Event Bus (NATS extension)
DDD Implementation Guidance
Type: New Feature (Extension)
Aggregate: EventBus extended with NATSEventBus
Commands: Publish(namespace, event) [same interface, distributed]
Implementation:
- NATSEventBus composes EventBus
- Override Publish to also publish to NATS
- Subscribe to NATS subjects matching namespace
Technical Notes
- nats_eventbus.go already partially implemented
- NATS subject: "aether.events.orders" for namespace "orders"
- Include sourceNodeID in event to prevent redelivery
Test Cases
- Publish on node-a: local subscribers on node-a receive
- Same publish: node-b subscribers receive via NATS
- Self-loop prevented: node-a doesn't re-receive own publish
- Multi-node: all nodes converge on same events
Dependencies
- Depends on: Issue 2.1 (EventBus.Publish)
- Depends on: Issue 3.1 (cluster setup for multi-node tests)
Issue 4.6: [Rule] Enforce exactly-once delivery across cluster
Type: New Feature Bounded Context: Event Bus (NATS) Priority: P1
Title: Guarantee events delivered to all cluster subscribers
User Story
As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events.
Acceptance Criteria
- Event published to NATS with JetStream consumer
- Consumer acknowledges delivery
- Redelivery on network failure (JetStream handles)
- No duplicate delivery to same subscriber
- All nodes see same events in same order
Bounded Context: Event Bus (NATS)
DDD Implementation Guidance
Type: New Feature (Invariant)
Invariant: Exactly-once delivery to each subscriber
Mechanism:
- JetStream consumer per subscriber group
- Acknowledgment on delivery
- Automatic redelivery on timeout
Technical Notes
- JetStream handles durability and ordering
- Consumer name = subscriber ID
- Push consumer model (events pushed to subscriber)
Test Cases
- Publish event: all subscribers receive once
- Network failure: redelivery after timeout
- No duplicates on subscriber
- Order preserved across nodes
Dependencies
- Depends on: Issue 4.5 (NATSEventBus)
Issue 4.7: [Event] Publish EventPublished (via NATS)
Type: New Feature Bounded Context: Event Bus (NATS) Priority: P2
Title: Route published events to NATS subjects
User Story
As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks.
Acceptance Criteria
- EventPublished event published to NATS
- Subject: "aether.events.{namespace}.published"
- Message contains: eventID, timestamp, sourceNodeID
- Metrics track: events published, delivered, dropped
- Helps identify partition/latency issues
Bounded Context: Event Bus (NATS)
DDD Implementation Guidance
Type: New Feature (Event)
Event: EventPublished (infrastructure)
Subject: aether.events.{namespace}.published
Consumers: Metrics, monitoring
Technical Notes
- Published after NATS publish succeeds
- Separate from local EventPublished (for clarity)
Test Cases
- Publish event: EventPublished message on NATS
- Metrics count delivery
- Cross-node visibility works
Dependencies
- Depends on: Issue 4.5 (NATSEventBus)
Issue 4.8: [Read Model] Implement cross-node subscription
Type: New Feature Bounded Context: Event Bus (NATS) Priority: P1
Title: Receive events from other nodes via NATS
User Story
As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows.
Acceptance Criteria
- NATSEventBus.Subscribe(namespace) receives local + NATS events
- SubscribeWithFilter works with NATS
- Events from local node: delivered via local EventBus
- Events from remote nodes: delivered via NATS consumer
- Subscriber sees unified stream (no duplication)
Bounded Context: Event Bus (NATS)
DDD Implementation Guidance
Type: New Feature (Query/Subscription)
Read Model: UnifiedEventStream (local + remote)
Implementation:
- Subscribe creates local channel
- NATSEventBus subscribes to NATS subject
- Both feed into subscriber channel
Technical Notes
- Unified view is transparent to subscriber
- No need to know if event is local or remote
Test Cases
- Subscribe to namespace: receive local events
- Subscribe to namespace: receive remote events
- Filter works across both sources
- No duplication
Dependencies
- Depends on: Issue 4.5 (NATSEventBus)
Summary
This backlog contains 67 executable issues across 5 bounded contexts organized into 4 implementation phases. Each issue:
- Is decomposed using DDD-informed order (commands → rules → events → reads)
- References domain concepts (aggregates, commands, events, value objects)
- Includes acceptance criteria (testable, specific)
- States dependencies (enabling parallel work)
- Is sized to 1-3 days of work
Recommended Build Order:
- Phase 1 (17 issues): Event Sourcing Foundation - everything depends on this
- Phase 2 (9 issues): Local Event Bus - enables observability before clustering
- Phase 3 (20 issues): Cluster Coordination - enables distributed deployment
- Phase 4 (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery
Total Scope: ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team)
Next Steps
- Create Gitea issues from this backlog
- Assign to team members
- Set up dependency tracking in Gitea
- Use
/spawn-issuesskill to parallelize implementation - Iterate on acceptance criteria with domain experts
See /issue-writing skill for proper issue formatting in Gitea.