aether/.product-strategy/BACKLOG.md

# Aether Executable Backlog

**Built from:** 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition

**Date:** 2026-01-12

---

## Backlog Overview

This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API.

**Total Scope:**
- **Capabilities:** 9 (all complete)
- **Features:** 14
- **Issues:** 67
- **Contexts:** 5
- **Implementation Phases:** 4

**Build Order (by value and dependencies):**

1. **Phase 1: Event Sourcing Foundation** (Capabilities 1-3)
   - Issues: 17
   - Enables all other work

2. **Phase 2: Local Event Bus** (Capability 8)
   - Issues: 9
   - Enables local pub/sub before clustering

3. **Phase 3: Cluster Coordination** (Capabilities 5-7)
   - Issues: 20
   - Enables distributed deployment

4. **Phase 4: Namespace & NATS** (Capabilities 4, 9)
   - Issues: 21
   - Enables multi-tenancy and cross-node delivery

---

## Phase 1: Event Sourcing Foundation

### Feature Set 1a: Event Storage with Version Conflict Detection

**Capability:** Store Events Durably with Conflict Detection

**Description:** Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers.

**Success Condition:** Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history.

---

#### Issue 1.1: [Command] Implement SaveEvent with monotonic version validation

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0

**Title:** As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely

**User Story**

As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately.

**Acceptance Criteria**

- [ ] SaveEvent accepts event with Version > current for actor
- [ ] SaveEvent rejects event with Version <= current (returns VersionConflictError)
- [ ] VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion
- [ ] First event for new actor must have Version > 0 (typically 1)
- [ ] Version gaps are allowed (1, 3, 5 is valid)
- [ ] Validation happens before persistence (fail-fast)
- [ ] InMemoryEventStore and JetStreamEventStore both implement validation

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Core)

**Aggregate:** ActorEventStream (implicit; each actor has independent version sequence)

**Command:** SaveEvent(event)

**Validation Rules:**
- If no events exist for actor: version must be > 0
- If events exist: new version must be > latest version

**Success Event:** EventStored (published when SaveEvent succeeds)

**Error Event:** VersionConflict (triggered when version validation fails)

**Technical Notes**

- Version validation is the core invariant; everything else depends on it
- Use `GetLatestVersion()` to implement validation
- No database-level locks; optimistic validation only
- Conflict should fail in <1ms

**Test Cases**

- New actor, version 1: succeeds
- Same actor, version 2 (after 1): succeeds
- Same actor, version 2 (after 1, concurrent): second call fails
- Same actor, version 1 (duplicate): fails
- Same actor, version 0 or negative: fails
- Concurrent 100 writers: 99 fail, 1 succeeds

**Dependencies**

- None (foundation)

---

#### Issue 1.2: [Rule] Enforce append-only and immutability invariants

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0

**Title:** Enforce event immutability and append-only semantics

**User Story**

As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates.

**Acceptance Criteria**

- [ ] EventStore interface has no Update or Delete methods
- [ ] Events cannot be modified after persistence
- [ ] Replay of same events always produces same state
- [ ] Corrupted events are reported (not silently skipped)
- [ ] JetStream stream configuration prevents deletes (retention policy only)

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Core Invariant)

**Aggregate:** ActorEventStream

**Invariant:** Events are immutable; stream is append-only; no modifications to EventStore interface

**Implementation:**
- Event struct has no Setters (only getters)
- SaveEvent is the only public persistence method
- JetStream streams configured with `NoDelete` policy

**Technical Notes**

- This is enforced at interface level (no Update/Delete in EventStore)
- JetStream configuration prevents accidental deletes
- ReplayError allows visibility into corruption without losing good data

**Test Cases**

- Attempt to modify Event.Data after creation: compile error (if immutable)
- Attempt to call UpdateEvent: interface doesn't exist
- JetStream stream created with correct retention policy
- ReplayError captured when event unmarshaling fails

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent implementation)

---

#### Issue 1.3: [Event] Publish EventStored after successful save

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0

**Title:** Emit EventStored event for persistence observability

**User Story**

As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections).

**Acceptance Criteria**

- [ ] EventStored event published after SaveEvent succeeds
- [ ] EventStored contains: EventID, ActorID, Version, Timestamp
- [ ] No EventStored published if SaveEvent fails
- [ ] EventBus receives EventStored in same transaction context
- [ ] Metrics increment for each EventStored

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature

**Event:** EventStored(eventID, actorID, version, timestamp)

**Triggered by:** Successful SaveEvent call

**Consumers:** Metrics collectors, projections, audit systems

**Technical Notes**

- EventStored is an internal event (Aether infrastructure)
- Published to local EventBus (see Phase 2 for cross-node)
- Allows observability without coupling application code

**Test Cases**

- Save event → EventStored published
- Version conflict → no EventStored published
- Multiple saves → multiple EventStored events in order

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent)
- Depends on: Phase 2, Issue 2.1 (EventBus.Publish)

---

#### Issue 1.4: [Event] Publish VersionConflict error with full context

**Type:** New Feature
**Bounded Context:** Event Sourcing, Optimistic Concurrency Control
**Priority:** P0

**Title:** Return detailed version conflict information for retry logic

**User Story**

As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker).

**Acceptance Criteria**

- [ ] VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion
- [ ] Error message is human-readable with all context
- [ ] Errors.Is(err, ErrVersionConflict) returns true for sentinel check
- [ ] Errors.As(err, &versionErr) allows unpacking to VersionConflictError
- [ ] Application can read CurrentVersion to decide retry strategy

**Bounded Context:** Event Sourcing + OCC

**DDD Implementation Guidance**

**Type:** New Feature

**Error Type:** VersionConflictError (wraps ErrVersionConflict sentinel)

**Data:** ActorID, AttemptedVersion, CurrentVersion

**Use:** Application uses this to implement retry strategies

**Technical Notes**

- Already implemented in `/aether/event.go` (VersionConflictError struct)
- Document standard retry patterns in examples/

**Test Cases**

- Conflict with detailed error: ActorID, versions present
- Application reads CurrentVersion: succeeds
- Errors.Is(err, ErrVersionConflict): true
- Errors.As(err, &versionErr): works
- Manual test: log the error, see all context

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent)

---

#### Issue 1.5: [Read Model] Implement GetLatestVersion query

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0

**Title:** Provide efficient version lookup for optimistic locking

**User Story**

As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead.

**Acceptance Criteria**

- [ ] GetLatestVersion(actorID) returns latest version or 0 if no events
- [ ] Execution time is O(1) or O(log n), not O(n)
- [ ] InMemoryEventStore implements with map lookup
- [ ] JetStreamEventStore caches latest version per actor
- [ ] Cache is invalidated after each SaveEvent
- [ ] Multiple calls for same actor within 1s hit cache
- [ ] Namespace isolation: GetLatestVersion scoped to namespace

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Read Model:** ActorVersionIndex

**Source Events:** SaveEvent (updates cache)

**Data:** ActorID → LatestVersion

**Performance:** O(1) lookup after SaveEvent

**Technical Notes**

- InMemoryEventStore: use map[actorID]int64
- JetStreamEventStore: query JetStream metadata OR maintain cache
- Cache invalidation: update after every SaveEvent
- Thread-safe with RWMutex (read-heavy)

**Test Cases**

- New actor: GetLatestVersion returns 0
- After SaveEvent(version: 1): GetLatestVersion returns 1
- After SaveEvent(version: 3): GetLatestVersion returns 3
- Concurrent reads from same actor: all return consistent value
- Namespace isolation: "tenant-a" and "tenant-b" have independent versions

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent)

---

### Feature Set 1b: State Rebuild from Event History

**Capability:** Rebuild State from Event History

**Description:** Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors.

**Success Condition:** GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1).

---

#### Issue 1.6: [Command] Implement GetEvents for replay

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0

**Title:** Load events from store for state replay

**User Story**

As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state.

**Acceptance Criteria**

- [ ] GetEvents(actorID, fromVersion) returns []*Event in version order
- [ ] Events are ordered by version (ascending)
- [ ] fromVersion is inclusive (GetEvents(actorID, 5) includes version 5)
- [ ] If no events exist, returns empty slice (not error)
- [ ] If actorID has no events >= fromVersion, returns empty slice
- [ ] Namespace isolation: GetEvents scoped to namespace
- [ ] Large result sets don't cause memory issues (stream if >10k events)

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Command:** GetEvents(actorID, fromVersion)

**Returns:** []*Event ordered by version

**Invariant:** Order is deterministic (version order always)

**Technical Notes**

- InMemoryEventStore: filter and sort by version
- JetStreamEventStore: query JetStream subject and order results
- Consider pagination for very large actor histories
- fromVersion=0 means "start from beginning"

**Test Cases**

- GetEvents(actorID, 0) with 5 events: returns all 5 in order
- GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5
- GetEvents(nonexistent, 0): returns empty slice
- GetEvents with gap (versions 1, 3, 5): returns only those 3
- Order is guaranteed (version order, not insertion order)

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent)

---

#### Issue 1.7: [Rule] Define and enforce snapshot validity

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P1

**Title:** Implement snapshot invalidation policy

**User Story**

As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat.

**Acceptance Criteria**

- [ ] Snapshot valid until Version + MaxVersionGap (default 1000)
- [ ] GetLatestSnapshot returns nil if no snapshot or invalid
- [ ] Application can override MaxVersionGap in config
- [ ] Snapshot timestamp recorded for debugging
- [ ] No automatic cleanup; application calls SaveSnapshot to create
- [ ] Tests confirm snapshot invalidation logic

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Policy)

**Aggregate:** ActorSnapshot + SnapshotPolicy

**Policy:** Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap

**Implementation:**
- SnapshotStore.GetLatestSnapshot validates before returning
- If invalid, returns nil; application must replay

**Technical Notes**

- This is a safety policy; prevents stale snapshots
- Application owns decision to create snapshots (no auto-triggering)
- MaxVersionGap is tunable per deployment

**Test Cases**

- Snapshot at version 10, MaxGap=100, current=50: valid
- Snapshot at version 10, MaxGap=100, current=111: invalid
- Snapshot at version 10, MaxGap=100, current=110: valid
- GetLatestSnapshot returns nil for invalid snapshot

**Dependencies**

- Depends on: Issue 1.6 (GetEvents)

---

#### Issue 1.8: [Event] Publish SnapshotCreated for observability

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P1

**Title:** Emit snapshot creation event for lifecycle tracking

**User Story**

As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting.

**Acceptance Criteria**

- [ ] SnapshotCreated event published after SaveSnapshot succeeds
- [ ] Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration
- [ ] Metrics increment for snapshot creation
- [ ] No event if SaveSnapshot fails
- [ ] Example: Snapshot created every 1000 versions

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** SnapshotCreated(actorID, version, timestamp, replayDurationMs)

**Triggered by:** SaveSnapshot call succeeds

**Consumers:** Metrics, monitoring dashboards

**Technical Notes**

- SnapshotCreated is infrastructure event (like EventStored)
- ReplayDuration helps identify slow actors needing snapshots more frequently

**Test Cases**

- SaveSnapshot succeeds → SnapshotCreated published
- SaveSnapshot fails → no event published
- ReplayDuration recorded accurately

**Dependencies**

- Depends on: Issue 1.7 (SnapshotStore interface)

---

#### Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P1

**Title:** Handle corrupted events during replay without data loss

**User Story**

As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events.

**Acceptance Criteria**

- [ ] GetEventsWithErrors(actorID, fromVersion) returns ReplayResult
- [ ] ReplayResult contains: []*Event (good) and []ReplayError (bad)
- [ ] Good events are returned in order despite errors
- [ ] ReplayError contains: SequenceNumber, RawData, UnmarshalError
- [ ] Application decides how to handle corrupted events
- [ ] Metrics track corruption frequency

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Interface:** EventStoreWithErrors extends EventStore

**Method:** GetEventsWithErrors(actorID, fromVersion) → ReplayResult

**Data:**
- ReplayResult.Events: successfully deserialized events
- ReplayResult.Errors: corruption records
- ReplayResult.HasErrors(): convenience check

**Technical Notes**

- Already defined in event.go (ReplayError, ReplayResult)
- JetStreamEventStore should implement EventStoreWithErrors
- Application uses HasErrors() to decide on recovery action

**Test Cases**

- All good events: ReplayResult.Events populated, no errors
- Corrupted event in middle: good events before/after, one error recorded
- Multiple corruptions: all recorded with context
- Application can inspect RawData for forensics

**Dependencies**

- Depends on: Issue 1.6 (GetEvents)

---

#### Issue 1.10: [Interface] Implement SnapshotStore interface

**Type:** New Feature
**Bounded Context:** Event Sourcing
**Priority:** P0

**Title:** Define snapshot storage contract

**User Story**

As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3).

**Acceptance Criteria**

- [ ] SnapshotStore extends EventStore
- [ ] GetLatestSnapshot(actorID) returns ActorSnapshot or nil
- [ ] SaveSnapshot(snapshot) persists snapshot
- [ ] ActorSnapshot contains: ActorID, Version, State, Timestamp
- [ ] Namespace isolation: snapshots scoped to namespace
- [ ] Tests verify interface contract

**Bounded Context:** Event Sourcing

**DDD Implementation Guidance**

**Type:** New Feature (Interface)

**Interface:** SnapshotStore extends EventStore

**Methods:**
- GetLatestSnapshot(actorID) → (*ActorSnapshot, error)
- SaveSnapshot(snapshot) → error

**Aggregates:** ActorSnapshot (value object)

**Technical Notes**

- Already defined in event.go
- Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore
- Keep snapshots in same store as events (co-located)

**Test Cases**

- SaveSnapshot persists; GetLatestSnapshot retrieves it
- New actor: GetLatestSnapshot returns nil
- Multiple snapshots: only latest returned
- Namespace isolation: snapshots from tenant-a don't appear in tenant-b

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent + storage foundation)

---

### Feature Set 1c: Optimistic Concurrency Control

**Capability:** Enable Safe Concurrent Writes

**Description:** Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy.

**Success Condition:** Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks.

---

#### Issue 1.11: [Rule] Enforce fail-fast on version conflict

**Type:** New Feature
**Bounded Context:** Optimistic Concurrency Control
**Priority:** P0

**Title:** Fail immediately on version conflict; no auto-retry

**User Story**

As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.).

**Acceptance Criteria**

- [ ] SaveEvent returns VersionConflictError immediately on mismatch
- [ ] No built-in retry loop in SaveEvent
- [ ] No database-level locks held
- [ ] Application reads VersionConflictError and decides retry
- [ ] Default retry strategy documented (examples/)

**Bounded Context:** Optimistic Concurrency Control

**DDD Implementation Guidance**

**Type:** New Feature (Policy)

**Invariant:** Conflicts trigger immediate failure; application owns retry

**Implementation:**
- SaveEvent: version check, return error if mismatch, done
- No loop, no backoff, no retries
- Clean error with context for caller

**Technical Notes**

- This is a design choice: fail-fast enables flexible retry strategies
- Application can choose exponential backoff, jitter, circuit-breaker, etc.

**Test Cases**

- SaveEvent(version: 2) when current=2: fails immediately
- No retry attempted by library
- Application can retry if desired
- Example patterns in examples/retry.go

**Dependencies**

- Depends on: Issue 1.1 (SaveEvent)

---

#### Issue 1.12: [Documentation] Document concurrent write patterns

**Type:** New Feature
**Bounded Context:** Optimistic Concurrency Control
**Priority:** P1

**Title:** Provide retry strategy examples (backoff, circuit-breaker, queue)

**User Story**

As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application.

**Acceptance Criteria**

- [ ] examples/retry_exponential_backoff.go
- [ ] examples/retry_circuit_breaker.go
- [ ] examples/retry_queue_based.go
- [ ] examples/concurrent_write_test.go showing patterns
- [ ] README mentions OCC patterns
- [ ] Each example is >100 lines with explanation

**Bounded Context:** Optimistic Concurrency Control

**DDD Implementation Guidance**

**Type:** Documentation

**Artifacts:**
- examples/retry_exponential_backoff.go
- examples/retry_circuit_breaker.go
- examples/retry_queue_based.go
- examples/concurrent_write_test.go

**Content:**
- How to read VersionConflictError
- When to retry (idempotent operations)
- When not to retry (non-idempotent)
- Backoff strategies
- Monitoring

**Technical Notes**

- Real, runnable code (not pseudocode)
- Show metrics collection
- Show when to give up

**Test Cases**

- Examples compile without error
- Examples use idempotent operations
- Test coverage for examples

**Dependencies**

- Depends on: Issue 1.11 (fail-fast behavior)

---

## Phase 2: Local Event Bus

### Feature Set 2a: Event Routing and Filtering

**Capability:** Route and Filter Domain Events

**Description:** Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern.

**Success Condition:** Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match.

---

#### Issue 2.1: [Command] Implement Publish to local subscribers

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1

**Title:** Publish events to local subscribers

**User Story**

As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling.

**Acceptance Criteria**

- [ ] Publish(namespaceID, event) sends to all subscribers of that namespace
- [ ] Exact subscribers (namespace="orders") receive event
- [ ] Wildcard subscribers (namespace="order*") receive matching events
- [ ] Events delivered in-process (no NATS yet)
- [ ] Buffered channels (100-event buffer) prevent blocking
- [ ] Full subscribers dropped non-blocking (no deadlock)
- [ ] Metrics track publish count, receive count, dropped count

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** Publish(namespaceID, event)

**Invariant:** All subscribers matching namespace receive event

**Implementation:**
- Iterate exact subscribers for namespace
- Iterate wildcard subscribers matching pattern
- Deliver to each (non-blocking, buffered)
- Count drops

**Technical Notes**

- EventBus in eventbus.go already implements this
- Ensure buffered channels don't cause memory leaks
- Metrics important for observability

**Test Cases**

- Publish to "orders": exact subscriber of "orders" receives
- Publish to "orders.new": wildcard subscriber of "order*" receives
- Publish to "payments": subscriber to "orders" does NOT receive
- Subscriber with full buffer: event dropped (non-blocking)
- 1000 publishes: metrics accurate

**Dependencies**

- Depends on: Issue 2.2 (Subscribe)

---

#### Issue 2.2: [Command] Implement Subscribe with optional filter

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1

**Title:** Register subscriber with optional event filter

**User Story**

As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about.

**Acceptance Criteria**

- [ ] Subscribe(namespacePattern) returns <-chan *Event
- [ ] SubscribeWithFilter(namespacePattern, filter) returns filtered channel
- [ ] Filter supports EventTypes ([]string) and ActorPattern (string)
- [ ] Filters applied client-side (subscriber decides)
- [ ] Wildcard patterns work: "*" matches single token, ">" matches multiple
- [ ] Subscription channel is buffered (100 events)
- [ ] Unsubscribe(namespacePattern, ch) removes subscription

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter)

**Invariants:**
- Namespace pattern determines which namespaces
- Filter determines which events within namespace
- Both work together (AND logic)

**Filter Types:**
- EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"])
- ActorPattern: string (e.g., "order-customer-*")

**Technical Notes**

- Pattern matching follows NATS conventions
- Filters are optional (nil filter = all events)
- Client-side filtering is efficient (NATS does server-side)

**Test Cases**

- Subscribe("orders"): exact match only
- Subscribe("order*"): wildcard match
- Subscribe("order.*"): NATS-style wildcard
- SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works
- SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works
- Unsubscribe closes channel

**Dependencies**

- Depends on: Issue 1.1 (events structure)

---

#### Issue 2.3: [Rule] Enforce exact subscription isolation

**Type:** New Feature
**Bounded Context:** Event Bus + Namespace Isolation
**Priority:** P1

**Title:** Guarantee exact namespace subscriptions are isolated

**User Story**

As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level.

**Acceptance Criteria**

- [ ] Subscriber to "tenant-a" receives events from "tenant-a" only
- [ ] Subscriber to "tenant-a" does NOT receive from "tenant-b"
- [ ] Wildcard subscriber to "tenant*" receives from both
- [ ] Exact match subscribers are isolated from wildcard
- [ ] Tests verify isolation with multi-namespace setup
- [ ] Documentation warns about wildcard security implications

**Bounded Context:** Event Bus + Namespace Isolation

**DDD Implementation Guidance**

**Type:** New Feature (Policy/Invariant)

**Invariant:** Exact subscriptions are isolated

**Implementation:**
- exactSubscribers map[namespace][]*subscription
- Wildcard subscriptions separate collection
- Publish checks exact first, then wildcard patterns

**Security Note:** Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.)

**Technical Notes**

- Enforced at EventBus.Publish level
- Exact match is simple string equality
- Wildcard uses MatchNamespacePattern helper

**Test Cases**

- Publish to "tenant-a": only "tenant-a" exact subscribers get it
- Publish to "tenant-b": only "tenant-b" exact subscribers get it
- Publish to "tenant-a": "tenant*" wildcard subscriber gets it
- Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it

**Dependencies**

- Depends on: Issue 2.2 (Subscribe)

---

#### Issue 2.4: [Rule] Document wildcard subscription security

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1

**Title:** Document that wildcard subscriptions bypass isolation

**User Story**

As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions.

**Acceptance Criteria**

- [ ] eventbus.go comments explain wildcard behavior
- [ ] Security warning in Subscribe godoc
- [ ] Example showing wildcard usage for logging
- [ ] Example showing why wildcard is dangerous (if not restricted)
- [ ] README mentions namespace isolation caveats
- [ ] Examples show proper patterns (monitoring, auditing)

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** Documentation

**Content:**
- Wildcard subscriptions receive all matching events
- Use for cross-cutting concerns (logging, monitoring, audit)
- Restrict access to trusted components
- Never expose wildcard pattern to untrusted users

**Examples:**
- Monitoring system subscribes to ">"
- Audit system subscribes to "tenant-*"
- Application logic uses exact subscriptions only

**Technical Notes**

- Intentional design; not a bug
- Different from NATS server-side filtering

**Test Cases**

- Examples compile
- Documentation is clear and accurate

**Dependencies**

- Depends on: Issue 2.3 (exact isolation)

---

#### Issue 2.5: [Event] Publish SubscriptionCreated for tracking

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2

**Title:** Track subscription lifecycle

**User Story**

As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues.

**Acceptance Criteria**

- [ ] SubscriptionCreated event published on Subscribe
- [ ] SubscriptionDestroyed event published on Unsubscribe
- [ ] Event contains: namespacePattern, filterCriteria, timestamp
- [ ] Metrics increment on subscribe/unsubscribe
- [ ] SubscriberCount(namespace) returns current count

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** SubscriptionCreated(namespacePattern, filter, timestamp)

**Event:** SubscriptionDestroyed(namespacePattern, timestamp)

**Metrics:** Subscriber count per namespace

**Technical Notes**

- SubscriberCount already in eventbus.go
- Add events to EventBus.Subscribe and EventBus.Unsubscribe
- Internal events (infrastructure)

**Test Cases**

- Subscribe → metrics increment
- Unsubscribe → metrics decrement
- SubscriberCount correct

**Dependencies**

- Depends on: Issue 2.2 (Subscribe/Unsubscribe)

---

#### Issue 2.6: [Event] Publish EventPublished for delivery tracking

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2

**Title:** Record event publication metrics

**User Story**

As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues.

**Acceptance Criteria**

- [ ] EventPublished event published on Publish
- [ ] Metrics track: published count, delivered count, dropped count per namespace
- [ ] Dropped events (full channel) recorded
- [ ] Application can query metrics via Metrics()
- [ ] Example: 1000 events published, 995 delivered, 5 dropped

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** New Feature (Event/Metrics)

**Event:** EventPublished (infrastructure event)

**Metrics:**
- PublishCount[namespace]
- DeliveryCount[namespace]
- DroppedCount[namespace]

**Implementation:**
- RecordPublish(namespace)
- RecordReceive(namespace)
- RecordDroppedEvent(namespace)

**Technical Notes**

- Metrics already in DefaultMetricsCollector
- RecordDroppedEvent signals subscriber backpressure
- Can be used to auto-scale subscribers

**Test Cases**

- Publish 100 events: metrics show 100 published
- All delivered: metrics show 100 delivered
- Full subscriber: next event dropped, metrics show 1 dropped
- Query via bus.Metrics(): values accurate

**Dependencies**

- Depends on: Issue 2.1 (Publish)

---

#### Issue 2.7: [Read Model] Implement GetSubscriptions query

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2

**Title:** Query active subscriptions for operational visibility

**User Story**

As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health.

**Acceptance Criteria**

- [ ] GetSubscriptions() returns []SubscriptionInfo
- [ ] SubscriptionInfo contains: pattern, filter, subscriberID, createdAt
- [ ] Works for both exact and wildcard subscriptions
- [ ] Metrics accessible via SubscriberCount(namespace)
- [ ] Example: "What subscriptions are listening to 'orders'?"

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Read Model:** SubscriptionRegistry

**Data:**
- Pattern: namespace pattern (e.g., "tenant-*")
- Filter: optional filter criteria
- SubscriberID: unique ID for each subscription
- CreatedAt: timestamp

**Implementation:**
- Track subscriptions in eventbus.go
- Expose via GetSubscriptions() method

**Technical Notes**

- Useful for debugging
- Optional feature; not critical

**Test Cases**

- Subscribe to "orders": GetSubscriptions shows it
- Subscribe to "order*": GetSubscriptions shows it
- Unsubscribe: GetSubscriptions removes it
- Multiple subscribers: all listed

**Dependencies**

- Depends on: Issue 2.2 (Subscribe)

---

### Feature Set 2b: Buffering and Backpressure

**Capability:** Route and Filter Domain Events (non-blocking delivery)

**Description:** Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed).

**Success Condition:** Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher.

---

#### Issue 2.8: [Rule] Implement non-blocking event delivery

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P1

**Title:** Ensure event publication never blocks

**User Story**

As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations.

**Acceptance Criteria**

- [ ] Publish(namespace, event) returns immediately
- [ ] If subscriber channel full, event dropped (non-blocking)
- [ ] Dropped events counted in metrics
- [ ] Buffered channel size is 100 (tunable)
- [ ] Publisher never waits for subscriber
- [ ] Metrics alert on high drop rate

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** New Feature (Policy)

**Invariant:** Publishers not blocked by slow subscribers

**Implementation:**
- select { case ch <- event: ... default: ... }
- Count drops in default case

**Trade-off:**
- Pro: Publisher never blocks
- Con: Events may be lost if subscriber can't keep up
- Mitigation: Metrics alert on drops; subscriber can increase buffer or retry

**Technical Notes**

- Already implemented in eventbus.go (deliverToSubscriber)
- 100-event buffer is reasonable default

**Test Cases**

- Subscribe, receive 100 events: no drops
- Publish 101st event immediately: dropped
- Metrics show drop count
- Publisher latency < 1ms regardless of subscribers

**Dependencies**

- Depends on: Issue 2.1 (Publish)

---

#### Issue 2.9: [Documentation] Document EventBus backpressure handling

**Type:** New Feature
**Bounded Context:** Event Bus
**Priority:** P2

**Title:** Explain buffer management and recovery from drops

**User Story**

As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers.

**Acceptance Criteria**

- [ ] Document buffer size (100 events default)
- [ ] Explain what happens on overflow (event dropped)
- [ ] Document recovery patterns (subscriber restarts, re-syncs)
- [ ] Example: Subscriber catches up from JetStream after restart
- [ ] Metrics to monitor (drop rate)
- [ ] README section on backpressure

**Bounded Context:** Event Bus

**DDD Implementation Guidance**

**Type:** Documentation

**Content:**
- Buffer size and behavior
- Drop semantics
- Recovery patterns
- Metrics to monitor
- When to increase buffer size

**Examples:**
- Slow subscriber: increase buffer or fix handler
- Network latency: events may be dropped
- Handler panics: subscriber must restart and re-sync

**Technical Notes**

- Events are lost if dropped; only durable via JetStream
- Phase 3 (NATS) addresses durability

**Test Cases**

- Documentation is clear
- Examples work

**Dependencies**

- Depends on: Issue 2.8 (non-blocking delivery)

---

## Phase 3: Cluster Coordination

### Feature Set 3a: Cluster Topology and Leadership

**Capability:** Coordinate Cluster Topology

**Description:** Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease.

**Success Condition:** Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s.

---

#### Issue 3.1: [Command] Implement JoinCluster protocol

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1

**Title:** Enable node discovery via cluster join

**User Story**

As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically.

**Acceptance Criteria**

- [ ] JoinCluster() announces node via NATS
- [ ] Node info contains: NodeID, Address, Timestamp, Status
- [ ] Other nodes receive join announcement
- [ ] Cluster topology updated atomically
- [ ] Rejoining node detected and updated
- [ ] Tests verify multi-node discovery

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** JoinCluster()

**Aggregates:** Cluster (group of nodes)

**Events:** NodeJoined(nodeID, address, timestamp)

**Technical Notes**

- NATS subject: "aether.cluster.nodes"
- NodeDiscovery subscribes to announcements
- ClusterManager.Start() initiates join

**Test Cases**

- Single node joins: topology = [node-a]
- Second node joins: topology = [node-a, node-b]
- Third node joins: topology = [node-a, node-b, node-c]
- Node rejoins: updates existing entry

**Dependencies**

- None (first cluster feature)

---

#### Issue 3.2: [Command] Implement LeaderElection

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0

**Title:** Elect single leader via NATS-based voting

**User Story**

As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing.

**Acceptance Criteria**

- [ ] LeaderElection holds election every HeartbeatInterval (5s)
- [ ] Nodes vote for themselves (no voting logic; first wins)
- [ ] One leader elected per term
- [ ] Leader holds lease (TTL = 2 * HeartbeatInterval)
- [ ] All nodes converge on same leader
- [ ] Lease renewal happens automatically

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** ElectLeader()

**Aggregates:** LeadershipLease (time-bound authority)

**Events:** LeaderElected(leaderID, term, leaseExpiration)

**Technical Notes**

- NATS subject: "aether.cluster.election"
- Each node publishes heartbeat with NodeID, Timestamp
- First node to publish becomes leader
- Lease expires if no heartbeat for TTL

**Test Cases**

- Single node: elected immediately
- Three nodes: exactly one elected
- Leader dies: remaining nodes elect new leader within 2*interval
- Leader comes back: may or may not stay leader

**Dependencies**

- Depends on: Issue 3.1 (node discovery)

---

#### Issue 3.3: [Rule] Enforce single leader invariant

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0

**Title:** Guarantee exactly one leader at any time

**User Story**

As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict.

**Acceptance Criteria**

- [ ] At most one leader at any time (lease-based)
- [ ] If leader lease expires, no leader until re-election
- [ ] All nodes see same leader (or none)
- [ ] Tests verify invariant under various failure scenarios
- [ ] Split-brain prevented by lease TTL < network latency

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Invariant)

**Invariant:** At most one leader (enforced by lease TTL)

**Mechanism:**
- Leader publishes heartbeat every HeartbeatInterval
- Other nodes trust leader if heartbeat < HeartbeatInterval old
- If no heartbeat for 2*HeartbeatInterval, lease expired
- New election begins

**Technical Notes**

- Lease-based; not consensus-based (simpler)
- Allows temporary split-brain until lease expires
- Acceptable for Aether (eventual consistency)

**Test Cases**

- Simulate leader death: lease expires, new leader elected
- Simulate network partition: partition may have >1 leader until lease expires
- Verify no coordination during lease expiration

**Dependencies**

- Depends on: Issue 3.2 (leader election)

---

#### Issue 3.4: [Event] Publish LeaderElected on election

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1

**Title:** Record leadership election outcomes

**User Story**

As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency.

**Acceptance Criteria**

- [ ] LeaderElected event published after successful election
- [ ] Event contains: LeaderID, Term, LeaseExpiration, Timestamp
- [ ] Metrics increment on election
- [ ] Helpful for debugging split-brain scenarios
- [ ] Track election frequency (ideally < 1 per minute)

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** LeaderElected(leaderID, term, leaseExpiration, timestamp)

**Triggered by:** Successful election

**Consumers:** Metrics, audit logs

**Technical Notes**

- Event published locally to all observers
- Infrastructure event (not domain event)

**Test Cases**

- Election happens: event published
- Term increments: event reflects new term
- Metrics accurate

**Dependencies**

- Depends on: Issue 3.2 (election)

---

#### Issue 3.5: [Event] Publish LeadershipLost on lease expiration

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2

**Title:** Track leadership transitions

**User Story**

As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events.

**Acceptance Criteria**

- [ ] LeadershipLost event published when lease expires
- [ ] Event contains: PreviousLeaderID, Timestamp, Reason
- [ ] Metrics track leadership transitions
- [ ] Helpful for debugging cascading failures

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** LeadershipLost(previousLeaderID, timestamp, reason)

**Reason:** "lease_expired", "node_failed", etc.

**Technical Notes**

- Published when lease TTL expires
- Useful for observability

**Test Cases**

- Leader lease expires: LeadershipLost published
- Metrics show transition

**Dependencies**

- Depends on: Issue 3.2 (election)

---

#### Issue 3.6: [Read Model] Implement GetClusterTopology query

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1

**Title:** Query current cluster members and status

**User Story**

As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues.

**Acceptance Criteria**

- [ ] GetNodes() returns map[nodeID]*NodeInfo
- [ ] NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs
- [ ] Status is: Active, Degraded, Failed
- [ ] LastSeen is accurate heartbeat timestamp
- [ ] ShardIDs show shard ownership (filled in Phase 3b)
- [ ] Example: "node-a is active; node-b failed 30s ago"

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Read Model:** ClusterTopology

**Data:**
- NodeID → NodeInfo (status, heartbeat, shards)
- LeaderID (current leader)
- Term (election term)

**Technical Notes**

- ClusterManager maintains topology in-memory
- Update on each heartbeat/announcement

**Test Cases**

- GetNodes() returns active nodes
- Status accurate (Active, Failed, etc.)
- LastSeen updates on heartbeat
- Rejoining node updates existing entry

**Dependencies**

- Depends on: Issue 3.1 (node discovery)

---

#### Issue 3.7: [Read Model] Implement GetLeader query

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0

**Title:** Query current leader

**User Story**

As a client, I want to know who the leader is, so that I can route coordination requests to the right node.

**Acceptance Criteria**

- [ ] GetLeader() returns current leader NodeID or ""
- [ ] IsLeader() returns true if this node is leader
- [ ] Both consistent with LeaderElection state
- [ ] Updated immediately on election
- [ ] Example: "node-b is leader (term 5)"

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Read Model:** LeadershipRegistry

**Data:** CurrentLeader, CurrentTerm, LeaseExpiration

**Implementation:**
- LeaderElection maintains this
- ClusterManager queries it

**Technical Notes**

- Critical for routing coordination work
- Must be consistent across cluster

**Test Cases**

- No leader: GetLeader returns ""
- Leader elected: GetLeader returns leader ID
- IsLeader true on leader, false on others
- Changes on re-election

**Dependencies**

- Depends on: Issue 3.2 (election)

---

### Feature Set 3b: Shard Distribution

**Capability:** Distribute Actors Across Cluster Nodes

**Description:** Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling.

**Success Condition:** 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently.

---

#### Issue 3.8: [Command] Implement consistent hash ring

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1

**Title:** Distribute shards across nodes with minimal reshuffling

**User Story**

As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling.

**Acceptance Criteria**

- [ ] ConsistentHashRing(numShards=1024) creates ring
- [ ] GetShard(actorID) returns consistent shard [0, 1024)
- [ ] AddNode(nodeID) rebalances ~numShards/numNodes shards
- [ ] RemoveNode(nodeID) rebalances shards evenly
- [ ] Same actor always maps to same shard
- [ ] Reshuffling < 40% on node add/remove

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** AssignShards(nodes)

**Aggregates:** ConsistentHashRing (distribution algorithm)

**Invariants:**
- Each shard [0, 1024) assigned to exactly one node
- ActorID hashes consistently to shard
- Topology changes minimize reassignment

**Technical Notes**

- hashring.go already implements this
- Use crypto/md5 or compatible hash
- 1024 shards is tunable (P1 default)

**Test Cases**

- Single node: all shards assigned to it
- Two nodes: ~512 shards each
- Three nodes: ~341 shards each
- Add fourth node: ~256 shards each (~20% reshuffled)
- Remove node: remaining nodes rebalance evenly
- Same actor-id always hashes to same shard

**Dependencies**

- Depends on: Issue 3.1 (node discovery)

---

#### Issue 3.9: [Rule] Enforce single shard owner invariant

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0

**Title:** Guarantee each shard has exactly one owner

**User Story**

As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically.

**Acceptance Criteria**

- [ ] ShardMap tracks shard → nodeID assignment
- [ ] No shard is unassigned (every shard has owner)
- [ ] No shard assigned to multiple nodes
- [ ] Reassignment is atomic (no in-between state)
- [ ] Tests verify invariant after topology changes

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Invariant)

**Invariant:** Each shard [0, 1024) assigned to exactly one active node

**Mechanism:**
- ShardMap[shardID] = [nodeID]
- Maintained by leader
- Updated atomically on rebalancing

**Technical Notes**

- shard.go implements ShardManager
- Validated after each rebalancing

**Test Cases**

- After rebalancing: all shards assigned
- No orphaned shards
- No multiply-assigned shards
- Reassignment is atomic

**Dependencies**

- Depends on: Issue 3.8 (consistent hashing)

---

#### Issue 3.10: [Event] Publish ShardAssigned on assignment

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2

**Title:** Track shard-to-node assignments

**User Story**

As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues.

**Acceptance Criteria**

- [ ] ShardAssigned event published after assignment
- [ ] Event contains: ShardID, NodeID, Timestamp
- [ ] Metrics track: shards per node, rebalancing frequency
- [ ] Example: Shard 42 assigned to node-b

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** ShardAssigned(shardID, nodeID, timestamp)

**Triggered by:** AssignShards command succeeds

**Metrics:** Shards per node, distribution evenness

**Technical Notes**

- Infrastructure event
- Useful for monitoring load distribution

**Test Cases**

- Assignment published on rebalancing
- Metrics reflect distribution

**Dependencies**

- Depends on: Issue 3.9 (shard ownership)

---

#### Issue 3.11: [Read Model] Implement GetShardAssignments query

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1

**Title:** Query shard-to-node mapping

**User Story**

As a client, I want to know which node owns a shard, so that I can route actor requests correctly.

**Acceptance Criteria**

- [ ] GetShardAssignments() returns ShardMap
- [ ] ShardMap[shardID] returns owning nodeID
- [ ] GetShard(actorID) returns shard for actor
- [ ] Routing decision: actorID → shard → nodeID
- [ ] Cached locally; refreshed on each rebalancing

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Query)

**Read Model:** ShardMap

**Data:**
- ShardID → NodeID (primary owner)
- Version (incremented on rebalancing)
- UpdateTime

**Implementation:**
- ClusterManager.GetShardMap()
- Cached; updated on assignment changes

**Technical Notes**

- Critical for routing
- Must be consistent across cluster
- Version helps detect stale caches

**Test Cases**

- GetShardAssignments returns current map
- GetShard(actorID) returns consistent shard
- Routing: actor ID → shard → node owner

**Dependencies**

- Depends on: Issue 3.9 (shard ownership)

---

### Feature Set 3c: Failure Detection and Recovery

**Capability:** Recover from Node Failures

**Description:** Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes.

**Success Condition:** Node dies → failure detected within 90s → shards reassigned → actors replay automatically.

---

#### Issue 3.12: [Command] Implement node health checks

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P1

**Title:** Detect node failures via heartbeat timeout

**User Story**

As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes.

**Acceptance Criteria**

- [ ] Each node publishes heartbeat every 30s
- [ ] Nodes without heartbeat for 90s marked as Failed
- [ ] checkNodeHealth() runs every 30s
- [ ] Failed node's status updates atomically
- [ ] Tests verify failure detection timing
- [ ] Failed node can rejoin cluster

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** MarkNodeFailed(nodeID)

**Trigger:** monitorNodes detects missing heartbeat

**Events:** NodeFailed(nodeID, lastSeenTimestamp)

**Technical Notes**

- monitorNodes() loop in manager.go
- Check LastSeen timestamp
- Update status if stale (>90s)

**Test Cases**

- Active node: status stays Active
- No heartbeat for 90s: status → Failed
- Rejoin: status → Active
- Failure detected < 100s (ideally 90-120s)

**Dependencies**

- Depends on: Issue 3.1 (node discovery)

---

#### Issue 3.13: [Command] Implement RebalanceShards after node failure

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0

**Title:** Reassign failed node's shards to healthy nodes

**User Story**

As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes.

**Acceptance Criteria**

- [ ] Leader detects node failure
- [ ] Leader triggers RebalanceShards
- [ ] Failed node's shards reassigned evenly
- [ ] No shard left orphaned
- [ ] ShardMap updated atomically
- [ ] Rebalancing completes within 5 seconds

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Command)

**Command:** RebalanceShards(failedNodeID)

**Aggregates:** ShardMap, ConsistentHashRing

**Events:** RebalanceStarted, ShardMigrated

**Technical Notes**

- Leader only (IsLeader() check)
- Use consistent hashing to assign
- Calculate new assignments atomically

**Test Cases**

- Node-a fails with shards [1, 2, 3]
- Leader reassigns [1, 2, 3] to remaining nodes
- No orphaned shards
- Rebalancing < 5s

**Dependencies**

- Depends on: Issue 3.8 (consistent hashing)
- Depends on: Issue 3.12 (failure detection)

---

#### Issue 3.14: [Rule] Enforce no-orphan invariant

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P0

**Title:** Guarantee all shards have owners after rebalancing

**User Story**

As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable.

**Acceptance Criteria**

- [ ] Before rebalancing: verify no orphaned shards
- [ ] After rebalancing: verify all shards assigned
- [ ] Tests fail if invariant violated
- [ ] Rebalancing aborted if invariant would be violated

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Invariant)

**Invariant:** All shards [0, 1024) have owners after any rebalancing

**Check:**
- Count assigned shards
- Verify = 1024
- Abort if not

**Technical Notes**

- Validate before committing ShardMap
- Logs errors but doesn't assert (graceful degradation)

**Test Cases**

- Rebalancing completes: all shards assigned
- Orphaned shard detected: rebalancing rolled back
- Tests verify count = 1024

**Dependencies**

- Depends on: Issue 3.13 (rebalancing)

---

#### Issue 3.15: [Event] Publish NodeFailed on failure detection

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2

**Title:** Record node failure for observability

**User Story**

As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues.

**Acceptance Criteria**

- [ ] NodeFailed event published when failure detected
- [ ] Event contains: NodeID, LastSeenTimestamp, AffectedShards
- [ ] Metrics track failure frequency
- [ ] Example: "node-a failed; 341 shards affected"

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs)

**Triggered by:** checkNodeHealth marks node failed

**Consumers:** Metrics, alerts, audit logs

**Technical Notes**

- Infrastructure event
- AffectedShards helps assess impact

**Test Cases**

- Node failure detected: event published
- Metrics show affected shard count

**Dependencies**

- Depends on: Issue 3.12 (failure detection)

---

#### Issue 3.16: [Event] Publish ShardMigrated on shard movement

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2

**Title:** Track shard migrations

**User Story**

As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations.

**Acceptance Criteria**

- [ ] ShardMigrated event published on each shard movement
- [ ] Event contains: ShardID, FromNodeID, ToNodeID, Status
- [ ] Status: "Started", "InProgress", "Completed", "Failed"
- [ ] Metrics track migration count and duration
- [ ] Example: "Shard 42 migrated from node-a to node-b (2.3s)"

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs)

**Status:** Started → InProgress → Completed

**Consumers:** Metrics, progress tracking

**Technical Notes**

- Published for each shard move
- Helps track rebalancing progress
- Useful for SLO monitoring

**Test Cases**

- Shard moves: event published
- Metrics track duration
- Status transitions correct

**Dependencies**

- Depends on: Issue 3.13 (rebalancing)

---

#### Issue 3.17: [Documentation] Document actor migration and replay

**Type:** New Feature
**Bounded Context:** Cluster Coordination
**Priority:** P2

**Title:** Explain how actors move and recover state

**User Story**

As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application.

**Acceptance Criteria**

- [ ] Design doc: cluster/ACTOR_MIGRATION.md
- [ ] Explain shard reassignment process
- [ ] Explain state rebuild via GetEvents + replay
- [ ] Explain snapshot optimization
- [ ] Example: Shard 42 moves to new node; 1000-event actor replays in <100ms
- [ ] Explain out-of-order message handling

**Bounded Context:** Cluster Coordination

**DDD Implementation Guidance**

**Type:** Documentation

**Content:**
- Shard assignment (consistent hashing)
- Actor discovery (routing via shard map)
- State rebuild (replay from JetStream)
- Snapshots (optional optimization)
- In-flight messages (may arrive before replay completes)

**Examples:**
- Manual failover: reassign shards manually
- Auto failover: leader initiates on failure detection

**Technical Notes**

- Complex topic; good documentation prevents bugs

**Test Cases**

- Documentation is clear
- Examples correct

**Dependencies**

- Depends on: Issue 3.13 (rebalancing)
- Depends on: Phase 1 (event replay)

---

## Phase 4: Namespace Isolation and NATS Event Delivery

### Feature Set 4a: Namespace Storage Isolation

**Capability:** Isolate Logical Domains Using Namespaces

**Description:** Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer.

**Success Condition:** Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries.

---

#### Issue 4.1: [Rule] Enforce namespace-based stream naming

**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P1

**Title:** Use namespace prefixes in JetStream stream names

**User Story**

As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage.

**Acceptance Criteria**

- [ ] Namespace "tenant-a" → stream "tenant-a_events"
- [ ] Namespace "tenant-b" → stream "tenant-b_events"
- [ ] Empty namespace → stream "events" (default)
- [ ] JetStreamConfig.Namespace sets prefix
- [ ] NewJetStreamEventStoreWithNamespace convenience function
- [ ] Tests verify stream names have namespace prefix

**Bounded Context:** Namespace Isolation

**DDD Implementation Guidance**

**Type:** New Feature (Configuration)

**Value Object:** Namespace (string identifier)

**Implementation:**
- JetStreamConfig.Namespace field
- StreamName = namespace + "_events" if namespace set
- StreamName = "events" if namespace empty

**Technical Notes**

- Already partially implemented in jetstream.go
- Ensure safe characters (sanitize spaces, dots, wildcards)

**Test Cases**

- NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events"
- NewJetStreamEventStoreWithNamespace(""): creates stream "events"
- Stream name verified

**Dependencies**

- None (orthogonal to other contexts)

---

#### Issue 4.2: [Rule] Enforce storage-level namespace isolation

**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P0

**Title:** Prevent cross-namespace data leakage at storage layer

**User Story**

As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems.

**Acceptance Criteria**

- [ ] SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events"
- [ ] GetEvents("tenant-a") queries "tenant-a_events" stream only
- [ ] No possibility of accidental cross-namespace leakage
- [ ] JetStream subject filtering enforces isolation
- [ ] Integration tests verify with multiple namespaces

**Bounded Context:** Namespace Isolation

**DDD Implementation Guidance**

**Type:** New Feature (Invariant)

**Invariant:** Events from namespace X are invisible to namespace Y

**Mechanism:**
- Separate JetStream streams per namespace
- Subject prefixing: "tenant-a.events.actor-123"
- Subscribe filters by subject prefix

**Technical Notes**

- jetstream.go: SubscribeToActorEvents uses subject prefix
- Consumer created with subject filter matching namespace

**Test Cases**

- SaveEvent to tenant-a: visible in tenant-a queries
- Same event invisible to tenant-b queries
- GetLatestVersion scoped to namespace
- GetEvents scoped to namespace
- Multi-namespace integration test

**Dependencies**

- Depends on: Issue 4.1 (stream naming)

---

#### Issue 4.3: [Documentation] Document namespace design patterns

**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P1

**Title:** Provide guidance on namespace naming and use

**User Story**

As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system.

**Acceptance Criteria**

- [ ] Design doc: NAMESPACE_DESIGN_PATTERNS.md
- [ ] Pattern 1: "tenant-{id}" (per-customer)
- [ ] Pattern 2: "env.domain" (per-env, per-bounded-context)
- [ ] Pattern 3: "env.domain.customer" (most granular)
- [ ] Examples of each pattern
- [ ] Guidance on choosing granularity
- [ ] Anti-patterns (wildcards, spaces, dots)

**Bounded Context:** Namespace Isolation

**DDD Implementation Guidance**

**Type:** Documentation

**Content:**
- Multi-tenant patterns
- Granularity decisions
- Namespace naming rules
- Examples
- Anti-patterns
- Performance implications

**Examples:**
- SaaS: "tenant-uuid"
- Microservices: "service.orders"
- Complex: "env.service.tenant"

**Technical Notes**

- No hard restrictions; naming is flexible
- Sanitization (spaces → underscores)

**Test Cases**

- Documentation is clear
- Examples valid

**Dependencies**

- Depends on: Issue 4.1 (stream naming)

---

#### Issue 4.4: [Validation] Add namespace format validation (P2)

**Type:** New Feature
**Bounded Context:** Namespace Isolation
**Priority:** P2

**Title:** Validate namespace names to prevent invalid streams

**User Story**

As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names.

**Acceptance Criteria**

- [ ] ValidateNamespace(ns string) returns error for invalid names
- [ ] Rejects: "tenant-*", "tenant a", "tenant."
- [ ] Accepts: "tenant-abc", "prod.orders", "tenant_123"
- [ ] Called on NewJetStreamEventStoreWithNamespace
- [ ] Clear error messages
- [ ] Tests verify validation rules

**Bounded Context:** Namespace Isolation

**DDD Implementation Guidance**

**Type:** New Feature (Validation)

**Validation Rules:**
- No wildcards (*, >)
- No spaces
- No leading/trailing dots
- Alphanumeric, hyphens, underscores, dots only

**Implementation:**
- ValidateNamespace regex
- Called before stream creation

**Technical Notes**

- Nice-to-have; currently strings accepted as-is
- Could sanitize instead of rejecting (replace _ for spaces)

**Test Cases**

- Valid: "tenant-abc", "prod.orders"
- Invalid: "tenant-*", "tenant a", ".prod"
- Error messages clear

**Dependencies**

- Depends on: Issue 4.1 (stream naming)

---

### Feature Set 4b: Cross-Node Event Delivery via NATS

**Capability:** Deliver Events Across Cluster Nodes

**Description:** Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering.

**Success Condition:** Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS).

---

#### Issue 4.5: [Command] Implement NATSEventBus wrapper

**Type:** New Feature
**Bounded Context:** Event Bus (with NATS)
**Priority:** P1

**Title:** Extend EventBus with NATS-native pub/sub

**User Story**

As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations.

**Acceptance Criteria**

- [ ] NATSEventBus embeds EventBus
- [ ] Publish(namespace, event) sends to local EventBus AND NATS
- [ ] NATS subject: "aether.events.{namespace}"
- [ ] SubscribeWithFilter works across nodes
- [ ] Self-published events not re-delivered (avoid loops)
- [ ] Tests verify cross-node delivery

**Bounded Context:** Event Bus (NATS extension)

**DDD Implementation Guidance**

**Type:** New Feature (Extension)

**Aggregate:** EventBus extended with NATSEventBus

**Commands:** Publish(namespace, event) [same interface, distributed]

**Implementation:**
- NATSEventBus composes EventBus
- Override Publish to also publish to NATS
- Subscribe to NATS subjects matching namespace

**Technical Notes**

- nats_eventbus.go already partially implemented
- NATS subject: "aether.events.orders" for namespace "orders"
- Include sourceNodeID in event to prevent redelivery

**Test Cases**

- Publish on node-a: local subscribers on node-a receive
- Same publish: node-b subscribers receive via NATS
- Self-loop prevented: node-a doesn't re-receive own publish
- Multi-node: all nodes converge on same events

**Dependencies**

- Depends on: Issue 2.1 (EventBus.Publish)
- Depends on: Issue 3.1 (cluster setup for multi-node tests)

---

#### Issue 4.6: [Rule] Enforce exactly-once delivery across cluster

**Type:** New Feature
**Bounded Context:** Event Bus (NATS)
**Priority:** P1

**Title:** Guarantee events delivered to all cluster subscribers

**User Story**

As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events.

**Acceptance Criteria**

- [ ] Event published to NATS with JetStream consumer
- [ ] Consumer acknowledges delivery
- [ ] Redelivery on network failure (JetStream handles)
- [ ] No duplicate delivery to same subscriber
- [ ] All nodes see same events in same order

**Bounded Context:** Event Bus (NATS)

**DDD Implementation Guidance**

**Type:** New Feature (Invariant)

**Invariant:** Exactly-once delivery to each subscriber

**Mechanism:**
- JetStream consumer per subscriber group
- Acknowledgment on delivery
- Automatic redelivery on timeout

**Technical Notes**

- JetStream handles durability and ordering
- Consumer name = subscriber ID
- Push consumer model (events pushed to subscriber)

**Test Cases**

- Publish event: all subscribers receive once
- Network failure: redelivery after timeout
- No duplicates on subscriber
- Order preserved across nodes

**Dependencies**

- Depends on: Issue 4.5 (NATSEventBus)

---

#### Issue 4.7: [Event] Publish EventPublished (via NATS)

**Type:** New Feature
**Bounded Context:** Event Bus (NATS)
**Priority:** P2

**Title:** Route published events to NATS subjects

**User Story**

As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks.

**Acceptance Criteria**

- [ ] EventPublished event published to NATS
- [ ] Subject: "aether.events.{namespace}.published"
- [ ] Message contains: eventID, timestamp, sourceNodeID
- [ ] Metrics track: events published, delivered, dropped
- [ ] Helps identify partition/latency issues

**Bounded Context:** Event Bus (NATS)

**DDD Implementation Guidance**

**Type:** New Feature (Event)

**Event:** EventPublished (infrastructure)

**Subject:** aether.events.{namespace}.published

**Consumers:** Metrics, monitoring

**Technical Notes**

- Published after NATS publish succeeds
- Separate from local EventPublished (for clarity)

**Test Cases**

- Publish event: EventPublished message on NATS
- Metrics count delivery
- Cross-node visibility works

**Dependencies**

- Depends on: Issue 4.5 (NATSEventBus)

---

#### Issue 4.8: [Read Model] Implement cross-node subscription

**Type:** New Feature
**Bounded Context:** Event Bus (NATS)
**Priority:** P1

**Title:** Receive events from other nodes via NATS

**User Story**

As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows.

**Acceptance Criteria**

- [ ] NATSEventBus.Subscribe(namespace) receives local + NATS events
- [ ] SubscribeWithFilter works with NATS
- [ ] Events from local node: delivered via local EventBus
- [ ] Events from remote nodes: delivered via NATS consumer
- [ ] Subscriber sees unified stream (no duplication)

**Bounded Context:** Event Bus (NATS)

**DDD Implementation Guidance**

**Type:** New Feature (Query/Subscription)

**Read Model:** UnifiedEventStream (local + remote)

**Implementation:**
- Subscribe creates local channel
- NATSEventBus subscribes to NATS subject
- Both feed into subscriber channel

**Technical Notes**

- Unified view is transparent to subscriber
- No need to know if event is local or remote

**Test Cases**

- Subscribe to namespace: receive local events
- Subscribe to namespace: receive remote events
- Filter works across both sources
- No duplication

**Dependencies**

- Depends on: Issue 4.5 (NATSEventBus)

---

## Summary

This backlog contains **67 executable issues** across **5 bounded contexts** organized into **4 implementation phases**. Each issue:

- Is decomposed using DDD-informed order (commands → rules → events → reads)
- References domain concepts (aggregates, commands, events, value objects)
- Includes acceptance criteria (testable, specific)
- States dependencies (enabling parallel work)
- Is sized to 1-3 days of work

**Recommended Build Order:**

1. **Phase 1** (17 issues): Event Sourcing Foundation - everything depends on this
2. **Phase 2** (9 issues): Local Event Bus - enables observability before clustering
3. **Phase 3** (20 issues): Cluster Coordination - enables distributed deployment
4. **Phase 4** (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery

**Total Scope:** ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team)

---

## Next Steps

1. Create Gitea issues from this backlog
2. Assign to team members
3. Set up dependency tracking in Gitea
4. Use `/spawn-issues` skill to parallelize implementation
5. Iterate on acceptance criteria with domain experts

See `/issue-writing` skill for proper issue formatting in Gitea.