Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2622 lines
69 KiB
Markdown
2622 lines
69 KiB
Markdown
# Aether Executable Backlog
|
|
|
|
**Built from:** 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition
|
|
|
|
**Date:** 2026-01-12
|
|
|
|
---
|
|
|
|
## Backlog Overview
|
|
|
|
This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API.
|
|
|
|
**Total Scope:**
|
|
- **Capabilities:** 9 (all complete)
|
|
- **Features:** 14
|
|
- **Issues:** 67
|
|
- **Contexts:** 5
|
|
- **Implementation Phases:** 4
|
|
|
|
**Build Order (by value and dependencies):**
|
|
|
|
1. **Phase 1: Event Sourcing Foundation** (Capabilities 1-3)
|
|
- Issues: 17
|
|
- Enables all other work
|
|
|
|
2. **Phase 2: Local Event Bus** (Capability 8)
|
|
- Issues: 9
|
|
- Enables local pub/sub before clustering
|
|
|
|
3. **Phase 3: Cluster Coordination** (Capabilities 5-7)
|
|
- Issues: 20
|
|
- Enables distributed deployment
|
|
|
|
4. **Phase 4: Namespace & NATS** (Capabilities 4, 9)
|
|
- Issues: 21
|
|
- Enables multi-tenancy and cross-node delivery
|
|
|
|
---
|
|
|
|
## Phase 1: Event Sourcing Foundation
|
|
|
|
### Feature Set 1a: Event Storage with Version Conflict Detection
|
|
|
|
**Capability:** Store Events Durably with Conflict Detection
|
|
|
|
**Description:** Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers.
|
|
|
|
**Success Condition:** Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history.
|
|
|
|
---
|
|
|
|
#### Issue 1.1: [Command] Implement SaveEvent with monotonic version validation
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P0
|
|
|
|
**Title:** As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely
|
|
|
|
**User Story**
|
|
|
|
As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] SaveEvent accepts event with Version > current for actor
|
|
- [ ] SaveEvent rejects event with Version <= current (returns VersionConflictError)
|
|
- [ ] VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion
|
|
- [ ] First event for new actor must have Version > 0 (typically 1)
|
|
- [ ] Version gaps are allowed (1, 3, 5 is valid)
|
|
- [ ] Validation happens before persistence (fail-fast)
|
|
- [ ] InMemoryEventStore and JetStreamEventStore both implement validation
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Core)
|
|
|
|
**Aggregate:** ActorEventStream (implicit; each actor has independent version sequence)
|
|
|
|
**Command:** SaveEvent(event)
|
|
|
|
**Validation Rules:**
|
|
- If no events exist for actor: version must be > 0
|
|
- If events exist: new version must be > latest version
|
|
|
|
**Success Event:** EventStored (published when SaveEvent succeeds)
|
|
|
|
**Error Event:** VersionConflict (triggered when version validation fails)
|
|
|
|
**Technical Notes**
|
|
|
|
- Version validation is the core invariant; everything else depends on it
|
|
- Use `GetLatestVersion()` to implement validation
|
|
- No database-level locks; optimistic validation only
|
|
- Conflict should fail in <1ms
|
|
|
|
**Test Cases**
|
|
|
|
- New actor, version 1: succeeds
|
|
- Same actor, version 2 (after 1): succeeds
|
|
- Same actor, version 2 (after 1, concurrent): second call fails
|
|
- Same actor, version 1 (duplicate): fails
|
|
- Same actor, version 0 or negative: fails
|
|
- Concurrent 100 writers: 99 fail, 1 succeeds
|
|
|
|
**Dependencies**
|
|
|
|
- None (foundation)
|
|
|
|
---
|
|
|
|
#### Issue 1.2: [Rule] Enforce append-only and immutability invariants
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P0
|
|
|
|
**Title:** Enforce event immutability and append-only semantics
|
|
|
|
**User Story**
|
|
|
|
As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] EventStore interface has no Update or Delete methods
|
|
- [ ] Events cannot be modified after persistence
|
|
- [ ] Replay of same events always produces same state
|
|
- [ ] Corrupted events are reported (not silently skipped)
|
|
- [ ] JetStream stream configuration prevents deletes (retention policy only)
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Core Invariant)
|
|
|
|
**Aggregate:** ActorEventStream
|
|
|
|
**Invariant:** Events are immutable; stream is append-only; no modifications to EventStore interface
|
|
|
|
**Implementation:**
|
|
- Event struct has no Setters (only getters)
|
|
- SaveEvent is the only public persistence method
|
|
- JetStream streams configured with `NoDelete` policy
|
|
|
|
**Technical Notes**
|
|
|
|
- This is enforced at interface level (no Update/Delete in EventStore)
|
|
- JetStream configuration prevents accidental deletes
|
|
- ReplayError allows visibility into corruption without losing good data
|
|
|
|
**Test Cases**
|
|
|
|
- Attempt to modify Event.Data after creation: compile error (if immutable)
|
|
- Attempt to call UpdateEvent: interface doesn't exist
|
|
- JetStream stream created with correct retention policy
|
|
- ReplayError captured when event unmarshaling fails
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent implementation)
|
|
|
|
---
|
|
|
|
#### Issue 1.3: [Event] Publish EventStored after successful save
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P0
|
|
|
|
**Title:** Emit EventStored event for persistence observability
|
|
|
|
**User Story**
|
|
|
|
As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections).
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] EventStored event published after SaveEvent succeeds
|
|
- [ ] EventStored contains: EventID, ActorID, Version, Timestamp
|
|
- [ ] No EventStored published if SaveEvent fails
|
|
- [ ] EventBus receives EventStored in same transaction context
|
|
- [ ] Metrics increment for each EventStored
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature
|
|
|
|
**Event:** EventStored(eventID, actorID, version, timestamp)
|
|
|
|
**Triggered by:** Successful SaveEvent call
|
|
|
|
**Consumers:** Metrics collectors, projections, audit systems
|
|
|
|
**Technical Notes**
|
|
|
|
- EventStored is an internal event (Aether infrastructure)
|
|
- Published to local EventBus (see Phase 2 for cross-node)
|
|
- Allows observability without coupling application code
|
|
|
|
**Test Cases**
|
|
|
|
- Save event → EventStored published
|
|
- Version conflict → no EventStored published
|
|
- Multiple saves → multiple EventStored events in order
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent)
|
|
- Depends on: Phase 2, Issue 2.1 (EventBus.Publish)
|
|
|
|
---
|
|
|
|
#### Issue 1.4: [Event] Publish VersionConflict error with full context
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing, Optimistic Concurrency Control
|
|
**Priority:** P0
|
|
|
|
**Title:** Return detailed version conflict information for retry logic
|
|
|
|
**User Story**
|
|
|
|
As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker).
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion
|
|
- [ ] Error message is human-readable with all context
|
|
- [ ] Errors.Is(err, ErrVersionConflict) returns true for sentinel check
|
|
- [ ] Errors.As(err, &versionErr) allows unpacking to VersionConflictError
|
|
- [ ] Application can read CurrentVersion to decide retry strategy
|
|
|
|
**Bounded Context:** Event Sourcing + OCC
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature
|
|
|
|
**Error Type:** VersionConflictError (wraps ErrVersionConflict sentinel)
|
|
|
|
**Data:** ActorID, AttemptedVersion, CurrentVersion
|
|
|
|
**Use:** Application uses this to implement retry strategies
|
|
|
|
**Technical Notes**
|
|
|
|
- Already implemented in `/aether/event.go` (VersionConflictError struct)
|
|
- Document standard retry patterns in examples/
|
|
|
|
**Test Cases**
|
|
|
|
- Conflict with detailed error: ActorID, versions present
|
|
- Application reads CurrentVersion: succeeds
|
|
- Errors.Is(err, ErrVersionConflict): true
|
|
- Errors.As(err, &versionErr): works
|
|
- Manual test: log the error, see all context
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent)
|
|
|
|
---
|
|
|
|
#### Issue 1.5: [Read Model] Implement GetLatestVersion query
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P0
|
|
|
|
**Title:** Provide efficient version lookup for optimistic locking
|
|
|
|
**User Story**
|
|
|
|
As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetLatestVersion(actorID) returns latest version or 0 if no events
|
|
- [ ] Execution time is O(1) or O(log n), not O(n)
|
|
- [ ] InMemoryEventStore implements with map lookup
|
|
- [ ] JetStreamEventStore caches latest version per actor
|
|
- [ ] Cache is invalidated after each SaveEvent
|
|
- [ ] Multiple calls for same actor within 1s hit cache
|
|
- [ ] Namespace isolation: GetLatestVersion scoped to namespace
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Read Model:** ActorVersionIndex
|
|
|
|
**Source Events:** SaveEvent (updates cache)
|
|
|
|
**Data:** ActorID → LatestVersion
|
|
|
|
**Performance:** O(1) lookup after SaveEvent
|
|
|
|
**Technical Notes**
|
|
|
|
- InMemoryEventStore: use map[actorID]int64
|
|
- JetStreamEventStore: query JetStream metadata OR maintain cache
|
|
- Cache invalidation: update after every SaveEvent
|
|
- Thread-safe with RWMutex (read-heavy)
|
|
|
|
**Test Cases**
|
|
|
|
- New actor: GetLatestVersion returns 0
|
|
- After SaveEvent(version: 1): GetLatestVersion returns 1
|
|
- After SaveEvent(version: 3): GetLatestVersion returns 3
|
|
- Concurrent reads from same actor: all return consistent value
|
|
- Namespace isolation: "tenant-a" and "tenant-b" have independent versions
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent)
|
|
|
|
---
|
|
|
|
### Feature Set 1b: State Rebuild from Event History
|
|
|
|
**Capability:** Rebuild State from Event History
|
|
|
|
**Description:** Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors.
|
|
|
|
**Success Condition:** GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1).
|
|
|
|
---
|
|
|
|
#### Issue 1.6: [Command] Implement GetEvents for replay
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P0
|
|
|
|
**Title:** Load events from store for state replay
|
|
|
|
**User Story**
|
|
|
|
As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetEvents(actorID, fromVersion) returns []*Event in version order
|
|
- [ ] Events are ordered by version (ascending)
|
|
- [ ] fromVersion is inclusive (GetEvents(actorID, 5) includes version 5)
|
|
- [ ] If no events exist, returns empty slice (not error)
|
|
- [ ] If actorID has no events >= fromVersion, returns empty slice
|
|
- [ ] Namespace isolation: GetEvents scoped to namespace
|
|
- [ ] Large result sets don't cause memory issues (stream if >10k events)
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Command:** GetEvents(actorID, fromVersion)
|
|
|
|
**Returns:** []*Event ordered by version
|
|
|
|
**Invariant:** Order is deterministic (version order always)
|
|
|
|
**Technical Notes**
|
|
|
|
- InMemoryEventStore: filter and sort by version
|
|
- JetStreamEventStore: query JetStream subject and order results
|
|
- Consider pagination for very large actor histories
|
|
- fromVersion=0 means "start from beginning"
|
|
|
|
**Test Cases**
|
|
|
|
- GetEvents(actorID, 0) with 5 events: returns all 5 in order
|
|
- GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5
|
|
- GetEvents(nonexistent, 0): returns empty slice
|
|
- GetEvents with gap (versions 1, 3, 5): returns only those 3
|
|
- Order is guaranteed (version order, not insertion order)
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent)
|
|
|
|
---
|
|
|
|
#### Issue 1.7: [Rule] Define and enforce snapshot validity
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P1
|
|
|
|
**Title:** Implement snapshot invalidation policy
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Snapshot valid until Version + MaxVersionGap (default 1000)
|
|
- [ ] GetLatestSnapshot returns nil if no snapshot or invalid
|
|
- [ ] Application can override MaxVersionGap in config
|
|
- [ ] Snapshot timestamp recorded for debugging
|
|
- [ ] No automatic cleanup; application calls SaveSnapshot to create
|
|
- [ ] Tests confirm snapshot invalidation logic
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Policy)
|
|
|
|
**Aggregate:** ActorSnapshot + SnapshotPolicy
|
|
|
|
**Policy:** Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap
|
|
|
|
**Implementation:**
|
|
- SnapshotStore.GetLatestSnapshot validates before returning
|
|
- If invalid, returns nil; application must replay
|
|
|
|
**Technical Notes**
|
|
|
|
- This is a safety policy; prevents stale snapshots
|
|
- Application owns decision to create snapshots (no auto-triggering)
|
|
- MaxVersionGap is tunable per deployment
|
|
|
|
**Test Cases**
|
|
|
|
- Snapshot at version 10, MaxGap=100, current=50: valid
|
|
- Snapshot at version 10, MaxGap=100, current=111: invalid
|
|
- Snapshot at version 10, MaxGap=100, current=110: valid
|
|
- GetLatestSnapshot returns nil for invalid snapshot
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.6 (GetEvents)
|
|
|
|
---
|
|
|
|
#### Issue 1.8: [Event] Publish SnapshotCreated for observability
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P1
|
|
|
|
**Title:** Emit snapshot creation event for lifecycle tracking
|
|
|
|
**User Story**
|
|
|
|
As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] SnapshotCreated event published after SaveSnapshot succeeds
|
|
- [ ] Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration
|
|
- [ ] Metrics increment for snapshot creation
|
|
- [ ] No event if SaveSnapshot fails
|
|
- [ ] Example: Snapshot created every 1000 versions
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** SnapshotCreated(actorID, version, timestamp, replayDurationMs)
|
|
|
|
**Triggered by:** SaveSnapshot call succeeds
|
|
|
|
**Consumers:** Metrics, monitoring dashboards
|
|
|
|
**Technical Notes**
|
|
|
|
- SnapshotCreated is infrastructure event (like EventStored)
|
|
- ReplayDuration helps identify slow actors needing snapshots more frequently
|
|
|
|
**Test Cases**
|
|
|
|
- SaveSnapshot succeeds → SnapshotCreated published
|
|
- SaveSnapshot fails → no event published
|
|
- ReplayDuration recorded accurately
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.7 (SnapshotStore interface)
|
|
|
|
---
|
|
|
|
#### Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P1
|
|
|
|
**Title:** Handle corrupted events during replay without data loss
|
|
|
|
**User Story**
|
|
|
|
As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetEventsWithErrors(actorID, fromVersion) returns ReplayResult
|
|
- [ ] ReplayResult contains: []*Event (good) and []ReplayError (bad)
|
|
- [ ] Good events are returned in order despite errors
|
|
- [ ] ReplayError contains: SequenceNumber, RawData, UnmarshalError
|
|
- [ ] Application decides how to handle corrupted events
|
|
- [ ] Metrics track corruption frequency
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Interface:** EventStoreWithErrors extends EventStore
|
|
|
|
**Method:** GetEventsWithErrors(actorID, fromVersion) → ReplayResult
|
|
|
|
**Data:**
|
|
- ReplayResult.Events: successfully deserialized events
|
|
- ReplayResult.Errors: corruption records
|
|
- ReplayResult.HasErrors(): convenience check
|
|
|
|
**Technical Notes**
|
|
|
|
- Already defined in event.go (ReplayError, ReplayResult)
|
|
- JetStreamEventStore should implement EventStoreWithErrors
|
|
- Application uses HasErrors() to decide on recovery action
|
|
|
|
**Test Cases**
|
|
|
|
- All good events: ReplayResult.Events populated, no errors
|
|
- Corrupted event in middle: good events before/after, one error recorded
|
|
- Multiple corruptions: all recorded with context
|
|
- Application can inspect RawData for forensics
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.6 (GetEvents)
|
|
|
|
---
|
|
|
|
#### Issue 1.10: [Interface] Implement SnapshotStore interface
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Sourcing
|
|
**Priority:** P0
|
|
|
|
**Title:** Define snapshot storage contract
|
|
|
|
**User Story**
|
|
|
|
As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3).
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] SnapshotStore extends EventStore
|
|
- [ ] GetLatestSnapshot(actorID) returns ActorSnapshot or nil
|
|
- [ ] SaveSnapshot(snapshot) persists snapshot
|
|
- [ ] ActorSnapshot contains: ActorID, Version, State, Timestamp
|
|
- [ ] Namespace isolation: snapshots scoped to namespace
|
|
- [ ] Tests verify interface contract
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Interface)
|
|
|
|
**Interface:** SnapshotStore extends EventStore
|
|
|
|
**Methods:**
|
|
- GetLatestSnapshot(actorID) → (*ActorSnapshot, error)
|
|
- SaveSnapshot(snapshot) → error
|
|
|
|
**Aggregates:** ActorSnapshot (value object)
|
|
|
|
**Technical Notes**
|
|
|
|
- Already defined in event.go
|
|
- Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore
|
|
- Keep snapshots in same store as events (co-located)
|
|
|
|
**Test Cases**
|
|
|
|
- SaveSnapshot persists; GetLatestSnapshot retrieves it
|
|
- New actor: GetLatestSnapshot returns nil
|
|
- Multiple snapshots: only latest returned
|
|
- Namespace isolation: snapshots from tenant-a don't appear in tenant-b
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent + storage foundation)
|
|
|
|
---
|
|
|
|
### Feature Set 1c: Optimistic Concurrency Control
|
|
|
|
**Capability:** Enable Safe Concurrent Writes
|
|
|
|
**Description:** Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy.
|
|
|
|
**Success Condition:** Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks.
|
|
|
|
---
|
|
|
|
#### Issue 1.11: [Rule] Enforce fail-fast on version conflict
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Optimistic Concurrency Control
|
|
**Priority:** P0
|
|
|
|
**Title:** Fail immediately on version conflict; no auto-retry
|
|
|
|
**User Story**
|
|
|
|
As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.).
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] SaveEvent returns VersionConflictError immediately on mismatch
|
|
- [ ] No built-in retry loop in SaveEvent
|
|
- [ ] No database-level locks held
|
|
- [ ] Application reads VersionConflictError and decides retry
|
|
- [ ] Default retry strategy documented (examples/)
|
|
|
|
**Bounded Context:** Optimistic Concurrency Control
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Policy)
|
|
|
|
**Invariant:** Conflicts trigger immediate failure; application owns retry
|
|
|
|
**Implementation:**
|
|
- SaveEvent: version check, return error if mismatch, done
|
|
- No loop, no backoff, no retries
|
|
- Clean error with context for caller
|
|
|
|
**Technical Notes**
|
|
|
|
- This is a design choice: fail-fast enables flexible retry strategies
|
|
- Application can choose exponential backoff, jitter, circuit-breaker, etc.
|
|
|
|
**Test Cases**
|
|
|
|
- SaveEvent(version: 2) when current=2: fails immediately
|
|
- No retry attempted by library
|
|
- Application can retry if desired
|
|
- Example patterns in examples/retry.go
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (SaveEvent)
|
|
|
|
---
|
|
|
|
#### Issue 1.12: [Documentation] Document concurrent write patterns
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Optimistic Concurrency Control
|
|
**Priority:** P1
|
|
|
|
**Title:** Provide retry strategy examples (backoff, circuit-breaker, queue)
|
|
|
|
**User Story**
|
|
|
|
As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] examples/retry_exponential_backoff.go
|
|
- [ ] examples/retry_circuit_breaker.go
|
|
- [ ] examples/retry_queue_based.go
|
|
- [ ] examples/concurrent_write_test.go showing patterns
|
|
- [ ] README mentions OCC patterns
|
|
- [ ] Each example is >100 lines with explanation
|
|
|
|
**Bounded Context:** Optimistic Concurrency Control
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** Documentation
|
|
|
|
**Artifacts:**
|
|
- examples/retry_exponential_backoff.go
|
|
- examples/retry_circuit_breaker.go
|
|
- examples/retry_queue_based.go
|
|
- examples/concurrent_write_test.go
|
|
|
|
**Content:**
|
|
- How to read VersionConflictError
|
|
- When to retry (idempotent operations)
|
|
- When not to retry (non-idempotent)
|
|
- Backoff strategies
|
|
- Monitoring
|
|
|
|
**Technical Notes**
|
|
|
|
- Real, runnable code (not pseudocode)
|
|
- Show metrics collection
|
|
- Show when to give up
|
|
|
|
**Test Cases**
|
|
|
|
- Examples compile without error
|
|
- Examples use idempotent operations
|
|
- Test coverage for examples
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.11 (fail-fast behavior)
|
|
|
|
---
|
|
|
|
## Phase 2: Local Event Bus
|
|
|
|
### Feature Set 2a: Event Routing and Filtering
|
|
|
|
**Capability:** Route and Filter Domain Events
|
|
|
|
**Description:** Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern.
|
|
|
|
**Success Condition:** Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match.
|
|
|
|
---
|
|
|
|
#### Issue 2.1: [Command] Implement Publish to local subscribers
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P1
|
|
|
|
**Title:** Publish events to local subscribers
|
|
|
|
**User Story**
|
|
|
|
As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Publish(namespaceID, event) sends to all subscribers of that namespace
|
|
- [ ] Exact subscribers (namespace="orders") receive event
|
|
- [ ] Wildcard subscribers (namespace="order*") receive matching events
|
|
- [ ] Events delivered in-process (no NATS yet)
|
|
- [ ] Buffered channels (100-event buffer) prevent blocking
|
|
- [ ] Full subscribers dropped non-blocking (no deadlock)
|
|
- [ ] Metrics track publish count, receive count, dropped count
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** Publish(namespaceID, event)
|
|
|
|
**Invariant:** All subscribers matching namespace receive event
|
|
|
|
**Implementation:**
|
|
- Iterate exact subscribers for namespace
|
|
- Iterate wildcard subscribers matching pattern
|
|
- Deliver to each (non-blocking, buffered)
|
|
- Count drops
|
|
|
|
**Technical Notes**
|
|
|
|
- EventBus in eventbus.go already implements this
|
|
- Ensure buffered channels don't cause memory leaks
|
|
- Metrics important for observability
|
|
|
|
**Test Cases**
|
|
|
|
- Publish to "orders": exact subscriber of "orders" receives
|
|
- Publish to "orders.new": wildcard subscriber of "order*" receives
|
|
- Publish to "payments": subscriber to "orders" does NOT receive
|
|
- Subscriber with full buffer: event dropped (non-blocking)
|
|
- 1000 publishes: metrics accurate
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.2 (Subscribe)
|
|
|
|
---
|
|
|
|
#### Issue 2.2: [Command] Implement Subscribe with optional filter
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P1
|
|
|
|
**Title:** Register subscriber with optional event filter
|
|
|
|
**User Story**
|
|
|
|
As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Subscribe(namespacePattern) returns <-chan *Event
|
|
- [ ] SubscribeWithFilter(namespacePattern, filter) returns filtered channel
|
|
- [ ] Filter supports EventTypes ([]string) and ActorPattern (string)
|
|
- [ ] Filters applied client-side (subscriber decides)
|
|
- [ ] Wildcard patterns work: "*" matches single token, ">" matches multiple
|
|
- [ ] Subscription channel is buffered (100 events)
|
|
- [ ] Unsubscribe(namespacePattern, ch) removes subscription
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter)
|
|
|
|
**Invariants:**
|
|
- Namespace pattern determines which namespaces
|
|
- Filter determines which events within namespace
|
|
- Both work together (AND logic)
|
|
|
|
**Filter Types:**
|
|
- EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"])
|
|
- ActorPattern: string (e.g., "order-customer-*")
|
|
|
|
**Technical Notes**
|
|
|
|
- Pattern matching follows NATS conventions
|
|
- Filters are optional (nil filter = all events)
|
|
- Client-side filtering is efficient (NATS does server-side)
|
|
|
|
**Test Cases**
|
|
|
|
- Subscribe("orders"): exact match only
|
|
- Subscribe("order*"): wildcard match
|
|
- Subscribe("order.*"): NATS-style wildcard
|
|
- SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works
|
|
- SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works
|
|
- Unsubscribe closes channel
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 1.1 (events structure)
|
|
|
|
---
|
|
|
|
#### Issue 2.3: [Rule] Enforce exact subscription isolation
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus + Namespace Isolation
|
|
**Priority:** P1
|
|
|
|
**Title:** Guarantee exact namespace subscriptions are isolated
|
|
|
|
**User Story**
|
|
|
|
As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Subscriber to "tenant-a" receives events from "tenant-a" only
|
|
- [ ] Subscriber to "tenant-a" does NOT receive from "tenant-b"
|
|
- [ ] Wildcard subscriber to "tenant*" receives from both
|
|
- [ ] Exact match subscribers are isolated from wildcard
|
|
- [ ] Tests verify isolation with multi-namespace setup
|
|
- [ ] Documentation warns about wildcard security implications
|
|
|
|
**Bounded Context:** Event Bus + Namespace Isolation
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Policy/Invariant)
|
|
|
|
**Invariant:** Exact subscriptions are isolated
|
|
|
|
**Implementation:**
|
|
- exactSubscribers map[namespace][]*subscription
|
|
- Wildcard subscriptions separate collection
|
|
- Publish checks exact first, then wildcard patterns
|
|
|
|
**Security Note:** Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.)
|
|
|
|
**Technical Notes**
|
|
|
|
- Enforced at EventBus.Publish level
|
|
- Exact match is simple string equality
|
|
- Wildcard uses MatchNamespacePattern helper
|
|
|
|
**Test Cases**
|
|
|
|
- Publish to "tenant-a": only "tenant-a" exact subscribers get it
|
|
- Publish to "tenant-b": only "tenant-b" exact subscribers get it
|
|
- Publish to "tenant-a": "tenant*" wildcard subscriber gets it
|
|
- Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.2 (Subscribe)
|
|
|
|
---
|
|
|
|
#### Issue 2.4: [Rule] Document wildcard subscription security
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P1
|
|
|
|
**Title:** Document that wildcard subscriptions bypass isolation
|
|
|
|
**User Story**
|
|
|
|
As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] eventbus.go comments explain wildcard behavior
|
|
- [ ] Security warning in Subscribe godoc
|
|
- [ ] Example showing wildcard usage for logging
|
|
- [ ] Example showing why wildcard is dangerous (if not restricted)
|
|
- [ ] README mentions namespace isolation caveats
|
|
- [ ] Examples show proper patterns (monitoring, auditing)
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** Documentation
|
|
|
|
**Content:**
|
|
- Wildcard subscriptions receive all matching events
|
|
- Use for cross-cutting concerns (logging, monitoring, audit)
|
|
- Restrict access to trusted components
|
|
- Never expose wildcard pattern to untrusted users
|
|
|
|
**Examples:**
|
|
- Monitoring system subscribes to ">"
|
|
- Audit system subscribes to "tenant-*"
|
|
- Application logic uses exact subscriptions only
|
|
|
|
**Technical Notes**
|
|
|
|
- Intentional design; not a bug
|
|
- Different from NATS server-side filtering
|
|
|
|
**Test Cases**
|
|
|
|
- Examples compile
|
|
- Documentation is clear and accurate
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.3 (exact isolation)
|
|
|
|
---
|
|
|
|
#### Issue 2.5: [Event] Publish SubscriptionCreated for tracking
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P2
|
|
|
|
**Title:** Track subscription lifecycle
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] SubscriptionCreated event published on Subscribe
|
|
- [ ] SubscriptionDestroyed event published on Unsubscribe
|
|
- [ ] Event contains: namespacePattern, filterCriteria, timestamp
|
|
- [ ] Metrics increment on subscribe/unsubscribe
|
|
- [ ] SubscriberCount(namespace) returns current count
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** SubscriptionCreated(namespacePattern, filter, timestamp)
|
|
|
|
**Event:** SubscriptionDestroyed(namespacePattern, timestamp)
|
|
|
|
**Metrics:** Subscriber count per namespace
|
|
|
|
**Technical Notes**
|
|
|
|
- SubscriberCount already in eventbus.go
|
|
- Add events to EventBus.Subscribe and EventBus.Unsubscribe
|
|
- Internal events (infrastructure)
|
|
|
|
**Test Cases**
|
|
|
|
- Subscribe → metrics increment
|
|
- Unsubscribe → metrics decrement
|
|
- SubscriberCount correct
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.2 (Subscribe/Unsubscribe)
|
|
|
|
---
|
|
|
|
#### Issue 2.6: [Event] Publish EventPublished for delivery tracking
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P2
|
|
|
|
**Title:** Record event publication metrics
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] EventPublished event published on Publish
|
|
- [ ] Metrics track: published count, delivered count, dropped count per namespace
|
|
- [ ] Dropped events (full channel) recorded
|
|
- [ ] Application can query metrics via Metrics()
|
|
- [ ] Example: 1000 events published, 995 delivered, 5 dropped
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event/Metrics)
|
|
|
|
**Event:** EventPublished (infrastructure event)
|
|
|
|
**Metrics:**
|
|
- PublishCount[namespace]
|
|
- DeliveryCount[namespace]
|
|
- DroppedCount[namespace]
|
|
|
|
**Implementation:**
|
|
- RecordPublish(namespace)
|
|
- RecordReceive(namespace)
|
|
- RecordDroppedEvent(namespace)
|
|
|
|
**Technical Notes**
|
|
|
|
- Metrics already in DefaultMetricsCollector
|
|
- RecordDroppedEvent signals subscriber backpressure
|
|
- Can be used to auto-scale subscribers
|
|
|
|
**Test Cases**
|
|
|
|
- Publish 100 events: metrics show 100 published
|
|
- All delivered: metrics show 100 delivered
|
|
- Full subscriber: next event dropped, metrics show 1 dropped
|
|
- Query via bus.Metrics(): values accurate
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.1 (Publish)
|
|
|
|
---
|
|
|
|
#### Issue 2.7: [Read Model] Implement GetSubscriptions query
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P2
|
|
|
|
**Title:** Query active subscriptions for operational visibility
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetSubscriptions() returns []SubscriptionInfo
|
|
- [ ] SubscriptionInfo contains: pattern, filter, subscriberID, createdAt
|
|
- [ ] Works for both exact and wildcard subscriptions
|
|
- [ ] Metrics accessible via SubscriberCount(namespace)
|
|
- [ ] Example: "What subscriptions are listening to 'orders'?"
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Read Model:** SubscriptionRegistry
|
|
|
|
**Data:**
|
|
- Pattern: namespace pattern (e.g., "tenant-*")
|
|
- Filter: optional filter criteria
|
|
- SubscriberID: unique ID for each subscription
|
|
- CreatedAt: timestamp
|
|
|
|
**Implementation:**
|
|
- Track subscriptions in eventbus.go
|
|
- Expose via GetSubscriptions() method
|
|
|
|
**Technical Notes**
|
|
|
|
- Useful for debugging
|
|
- Optional feature; not critical
|
|
|
|
**Test Cases**
|
|
|
|
- Subscribe to "orders": GetSubscriptions shows it
|
|
- Subscribe to "order*": GetSubscriptions shows it
|
|
- Unsubscribe: GetSubscriptions removes it
|
|
- Multiple subscribers: all listed
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.2 (Subscribe)
|
|
|
|
---
|
|
|
|
### Feature Set 2b: Buffering and Backpressure
|
|
|
|
**Capability:** Route and Filter Domain Events (non-blocking delivery)
|
|
|
|
**Description:** Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed).
|
|
|
|
**Success Condition:** Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher.
|
|
|
|
---
|
|
|
|
#### Issue 2.8: [Rule] Implement non-blocking event delivery
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P1
|
|
|
|
**Title:** Ensure event publication never blocks
|
|
|
|
**User Story**
|
|
|
|
As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Publish(namespace, event) returns immediately
|
|
- [ ] If subscriber channel full, event dropped (non-blocking)
|
|
- [ ] Dropped events counted in metrics
|
|
- [ ] Buffered channel size is 100 (tunable)
|
|
- [ ] Publisher never waits for subscriber
|
|
- [ ] Metrics alert on high drop rate
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Policy)
|
|
|
|
**Invariant:** Publishers not blocked by slow subscribers
|
|
|
|
**Implementation:**
|
|
- select { case ch <- event: ... default: ... }
|
|
- Count drops in default case
|
|
|
|
**Trade-off:**
|
|
- Pro: Publisher never blocks
|
|
- Con: Events may be lost if subscriber can't keep up
|
|
- Mitigation: Metrics alert on drops; subscriber can increase buffer or retry
|
|
|
|
**Technical Notes**
|
|
|
|
- Already implemented in eventbus.go (deliverToSubscriber)
|
|
- 100-event buffer is reasonable default
|
|
|
|
**Test Cases**
|
|
|
|
- Subscribe, receive 100 events: no drops
|
|
- Publish 101st event immediately: dropped
|
|
- Metrics show drop count
|
|
- Publisher latency < 1ms regardless of subscribers
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.1 (Publish)
|
|
|
|
---
|
|
|
|
#### Issue 2.9: [Documentation] Document EventBus backpressure handling
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus
|
|
**Priority:** P2
|
|
|
|
**Title:** Explain buffer management and recovery from drops
|
|
|
|
**User Story**
|
|
|
|
As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Document buffer size (100 events default)
|
|
- [ ] Explain what happens on overflow (event dropped)
|
|
- [ ] Document recovery patterns (subscriber restarts, re-syncs)
|
|
- [ ] Example: Subscriber catches up from JetStream after restart
|
|
- [ ] Metrics to monitor (drop rate)
|
|
- [ ] README section on backpressure
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** Documentation
|
|
|
|
**Content:**
|
|
- Buffer size and behavior
|
|
- Drop semantics
|
|
- Recovery patterns
|
|
- Metrics to monitor
|
|
- When to increase buffer size
|
|
|
|
**Examples:**
|
|
- Slow subscriber: increase buffer or fix handler
|
|
- Network latency: events may be dropped
|
|
- Handler panics: subscriber must restart and re-sync
|
|
|
|
**Technical Notes**
|
|
|
|
- Events are lost if dropped; only durable via JetStream
|
|
- Phase 3 (NATS) addresses durability
|
|
|
|
**Test Cases**
|
|
|
|
- Documentation is clear
|
|
- Examples work
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.8 (non-blocking delivery)
|
|
|
|
---
|
|
|
|
## Phase 3: Cluster Coordination
|
|
|
|
### Feature Set 3a: Cluster Topology and Leadership
|
|
|
|
**Capability:** Coordinate Cluster Topology
|
|
|
|
**Description:** Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease.
|
|
|
|
**Success Condition:** Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s.
|
|
|
|
---
|
|
|
|
#### Issue 3.1: [Command] Implement JoinCluster protocol
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P1
|
|
|
|
**Title:** Enable node discovery via cluster join
|
|
|
|
**User Story**
|
|
|
|
As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] JoinCluster() announces node via NATS
|
|
- [ ] Node info contains: NodeID, Address, Timestamp, Status
|
|
- [ ] Other nodes receive join announcement
|
|
- [ ] Cluster topology updated atomically
|
|
- [ ] Rejoining node detected and updated
|
|
- [ ] Tests verify multi-node discovery
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** JoinCluster()
|
|
|
|
**Aggregates:** Cluster (group of nodes)
|
|
|
|
**Events:** NodeJoined(nodeID, address, timestamp)
|
|
|
|
**Technical Notes**
|
|
|
|
- NATS subject: "aether.cluster.nodes"
|
|
- NodeDiscovery subscribes to announcements
|
|
- ClusterManager.Start() initiates join
|
|
|
|
**Test Cases**
|
|
|
|
- Single node joins: topology = [node-a]
|
|
- Second node joins: topology = [node-a, node-b]
|
|
- Third node joins: topology = [node-a, node-b, node-c]
|
|
- Node rejoins: updates existing entry
|
|
|
|
**Dependencies**
|
|
|
|
- None (first cluster feature)
|
|
|
|
---
|
|
|
|
#### Issue 3.2: [Command] Implement LeaderElection
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P0
|
|
|
|
**Title:** Elect single leader via NATS-based voting
|
|
|
|
**User Story**
|
|
|
|
As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] LeaderElection holds election every HeartbeatInterval (5s)
|
|
- [ ] Nodes vote for themselves (no voting logic; first wins)
|
|
- [ ] One leader elected per term
|
|
- [ ] Leader holds lease (TTL = 2 * HeartbeatInterval)
|
|
- [ ] All nodes converge on same leader
|
|
- [ ] Lease renewal happens automatically
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** ElectLeader()
|
|
|
|
**Aggregates:** LeadershipLease (time-bound authority)
|
|
|
|
**Events:** LeaderElected(leaderID, term, leaseExpiration)
|
|
|
|
**Technical Notes**
|
|
|
|
- NATS subject: "aether.cluster.election"
|
|
- Each node publishes heartbeat with NodeID, Timestamp
|
|
- First node to publish becomes leader
|
|
- Lease expires if no heartbeat for TTL
|
|
|
|
**Test Cases**
|
|
|
|
- Single node: elected immediately
|
|
- Three nodes: exactly one elected
|
|
- Leader dies: remaining nodes elect new leader within 2*interval
|
|
- Leader comes back: may or may not stay leader
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.1 (node discovery)
|
|
|
|
---
|
|
|
|
#### Issue 3.3: [Rule] Enforce single leader invariant
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P0
|
|
|
|
**Title:** Guarantee exactly one leader at any time
|
|
|
|
**User Story**
|
|
|
|
As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] At most one leader at any time (lease-based)
|
|
- [ ] If leader lease expires, no leader until re-election
|
|
- [ ] All nodes see same leader (or none)
|
|
- [ ] Tests verify invariant under various failure scenarios
|
|
- [ ] Split-brain prevented by lease TTL < network latency
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Invariant)
|
|
|
|
**Invariant:** At most one leader (enforced by lease TTL)
|
|
|
|
**Mechanism:**
|
|
- Leader publishes heartbeat every HeartbeatInterval
|
|
- Other nodes trust leader if heartbeat < HeartbeatInterval old
|
|
- If no heartbeat for 2*HeartbeatInterval, lease expired
|
|
- New election begins
|
|
|
|
**Technical Notes**
|
|
|
|
- Lease-based; not consensus-based (simpler)
|
|
- Allows temporary split-brain until lease expires
|
|
- Acceptable for Aether (eventual consistency)
|
|
|
|
**Test Cases**
|
|
|
|
- Simulate leader death: lease expires, new leader elected
|
|
- Simulate network partition: partition may have >1 leader until lease expires
|
|
- Verify no coordination during lease expiration
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.2 (leader election)
|
|
|
|
---
|
|
|
|
#### Issue 3.4: [Event] Publish LeaderElected on election
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P1
|
|
|
|
**Title:** Record leadership election outcomes
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] LeaderElected event published after successful election
|
|
- [ ] Event contains: LeaderID, Term, LeaseExpiration, Timestamp
|
|
- [ ] Metrics increment on election
|
|
- [ ] Helpful for debugging split-brain scenarios
|
|
- [ ] Track election frequency (ideally < 1 per minute)
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** LeaderElected(leaderID, term, leaseExpiration, timestamp)
|
|
|
|
**Triggered by:** Successful election
|
|
|
|
**Consumers:** Metrics, audit logs
|
|
|
|
**Technical Notes**
|
|
|
|
- Event published locally to all observers
|
|
- Infrastructure event (not domain event)
|
|
|
|
**Test Cases**
|
|
|
|
- Election happens: event published
|
|
- Term increments: event reflects new term
|
|
- Metrics accurate
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.2 (election)
|
|
|
|
---
|
|
|
|
#### Issue 3.5: [Event] Publish LeadershipLost on lease expiration
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P2
|
|
|
|
**Title:** Track leadership transitions
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] LeadershipLost event published when lease expires
|
|
- [ ] Event contains: PreviousLeaderID, Timestamp, Reason
|
|
- [ ] Metrics track leadership transitions
|
|
- [ ] Helpful for debugging cascading failures
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** LeadershipLost(previousLeaderID, timestamp, reason)
|
|
|
|
**Reason:** "lease_expired", "node_failed", etc.
|
|
|
|
**Technical Notes**
|
|
|
|
- Published when lease TTL expires
|
|
- Useful for observability
|
|
|
|
**Test Cases**
|
|
|
|
- Leader lease expires: LeadershipLost published
|
|
- Metrics show transition
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.2 (election)
|
|
|
|
---
|
|
|
|
#### Issue 3.6: [Read Model] Implement GetClusterTopology query
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P1
|
|
|
|
**Title:** Query current cluster members and status
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetNodes() returns map[nodeID]*NodeInfo
|
|
- [ ] NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs
|
|
- [ ] Status is: Active, Degraded, Failed
|
|
- [ ] LastSeen is accurate heartbeat timestamp
|
|
- [ ] ShardIDs show shard ownership (filled in Phase 3b)
|
|
- [ ] Example: "node-a is active; node-b failed 30s ago"
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Read Model:** ClusterTopology
|
|
|
|
**Data:**
|
|
- NodeID → NodeInfo (status, heartbeat, shards)
|
|
- LeaderID (current leader)
|
|
- Term (election term)
|
|
|
|
**Technical Notes**
|
|
|
|
- ClusterManager maintains topology in-memory
|
|
- Update on each heartbeat/announcement
|
|
|
|
**Test Cases**
|
|
|
|
- GetNodes() returns active nodes
|
|
- Status accurate (Active, Failed, etc.)
|
|
- LastSeen updates on heartbeat
|
|
- Rejoining node updates existing entry
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.1 (node discovery)
|
|
|
|
---
|
|
|
|
#### Issue 3.7: [Read Model] Implement GetLeader query
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P0
|
|
|
|
**Title:** Query current leader
|
|
|
|
**User Story**
|
|
|
|
As a client, I want to know who the leader is, so that I can route coordination requests to the right node.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetLeader() returns current leader NodeID or ""
|
|
- [ ] IsLeader() returns true if this node is leader
|
|
- [ ] Both consistent with LeaderElection state
|
|
- [ ] Updated immediately on election
|
|
- [ ] Example: "node-b is leader (term 5)"
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Read Model:** LeadershipRegistry
|
|
|
|
**Data:** CurrentLeader, CurrentTerm, LeaseExpiration
|
|
|
|
**Implementation:**
|
|
- LeaderElection maintains this
|
|
- ClusterManager queries it
|
|
|
|
**Technical Notes**
|
|
|
|
- Critical for routing coordination work
|
|
- Must be consistent across cluster
|
|
|
|
**Test Cases**
|
|
|
|
- No leader: GetLeader returns ""
|
|
- Leader elected: GetLeader returns leader ID
|
|
- IsLeader true on leader, false on others
|
|
- Changes on re-election
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.2 (election)
|
|
|
|
---
|
|
|
|
### Feature Set 3b: Shard Distribution
|
|
|
|
**Capability:** Distribute Actors Across Cluster Nodes
|
|
|
|
**Description:** Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling.
|
|
|
|
**Success Condition:** 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently.
|
|
|
|
---
|
|
|
|
#### Issue 3.8: [Command] Implement consistent hash ring
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P1
|
|
|
|
**Title:** Distribute shards across nodes with minimal reshuffling
|
|
|
|
**User Story**
|
|
|
|
As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] ConsistentHashRing(numShards=1024) creates ring
|
|
- [ ] GetShard(actorID) returns consistent shard [0, 1024)
|
|
- [ ] AddNode(nodeID) rebalances ~numShards/numNodes shards
|
|
- [ ] RemoveNode(nodeID) rebalances shards evenly
|
|
- [ ] Same actor always maps to same shard
|
|
- [ ] Reshuffling < 40% on node add/remove
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** AssignShards(nodes)
|
|
|
|
**Aggregates:** ConsistentHashRing (distribution algorithm)
|
|
|
|
**Invariants:**
|
|
- Each shard [0, 1024) assigned to exactly one node
|
|
- ActorID hashes consistently to shard
|
|
- Topology changes minimize reassignment
|
|
|
|
**Technical Notes**
|
|
|
|
- hashring.go already implements this
|
|
- Use crypto/md5 or compatible hash
|
|
- 1024 shards is tunable (P1 default)
|
|
|
|
**Test Cases**
|
|
|
|
- Single node: all shards assigned to it
|
|
- Two nodes: ~512 shards each
|
|
- Three nodes: ~341 shards each
|
|
- Add fourth node: ~256 shards each (~20% reshuffled)
|
|
- Remove node: remaining nodes rebalance evenly
|
|
- Same actor-id always hashes to same shard
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.1 (node discovery)
|
|
|
|
---
|
|
|
|
#### Issue 3.9: [Rule] Enforce single shard owner invariant
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P0
|
|
|
|
**Title:** Guarantee each shard has exactly one owner
|
|
|
|
**User Story**
|
|
|
|
As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] ShardMap tracks shard → nodeID assignment
|
|
- [ ] No shard is unassigned (every shard has owner)
|
|
- [ ] No shard assigned to multiple nodes
|
|
- [ ] Reassignment is atomic (no in-between state)
|
|
- [ ] Tests verify invariant after topology changes
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Invariant)
|
|
|
|
**Invariant:** Each shard [0, 1024) assigned to exactly one active node
|
|
|
|
**Mechanism:**
|
|
- ShardMap[shardID] = [nodeID]
|
|
- Maintained by leader
|
|
- Updated atomically on rebalancing
|
|
|
|
**Technical Notes**
|
|
|
|
- shard.go implements ShardManager
|
|
- Validated after each rebalancing
|
|
|
|
**Test Cases**
|
|
|
|
- After rebalancing: all shards assigned
|
|
- No orphaned shards
|
|
- No multiply-assigned shards
|
|
- Reassignment is atomic
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.8 (consistent hashing)
|
|
|
|
---
|
|
|
|
#### Issue 3.10: [Event] Publish ShardAssigned on assignment
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P2
|
|
|
|
**Title:** Track shard-to-node assignments
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] ShardAssigned event published after assignment
|
|
- [ ] Event contains: ShardID, NodeID, Timestamp
|
|
- [ ] Metrics track: shards per node, rebalancing frequency
|
|
- [ ] Example: Shard 42 assigned to node-b
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** ShardAssigned(shardID, nodeID, timestamp)
|
|
|
|
**Triggered by:** AssignShards command succeeds
|
|
|
|
**Metrics:** Shards per node, distribution evenness
|
|
|
|
**Technical Notes**
|
|
|
|
- Infrastructure event
|
|
- Useful for monitoring load distribution
|
|
|
|
**Test Cases**
|
|
|
|
- Assignment published on rebalancing
|
|
- Metrics reflect distribution
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.9 (shard ownership)
|
|
|
|
---
|
|
|
|
#### Issue 3.11: [Read Model] Implement GetShardAssignments query
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P1
|
|
|
|
**Title:** Query shard-to-node mapping
|
|
|
|
**User Story**
|
|
|
|
As a client, I want to know which node owns a shard, so that I can route actor requests correctly.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] GetShardAssignments() returns ShardMap
|
|
- [ ] ShardMap[shardID] returns owning nodeID
|
|
- [ ] GetShard(actorID) returns shard for actor
|
|
- [ ] Routing decision: actorID → shard → nodeID
|
|
- [ ] Cached locally; refreshed on each rebalancing
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query)
|
|
|
|
**Read Model:** ShardMap
|
|
|
|
**Data:**
|
|
- ShardID → NodeID (primary owner)
|
|
- Version (incremented on rebalancing)
|
|
- UpdateTime
|
|
|
|
**Implementation:**
|
|
- ClusterManager.GetShardMap()
|
|
- Cached; updated on assignment changes
|
|
|
|
**Technical Notes**
|
|
|
|
- Critical for routing
|
|
- Must be consistent across cluster
|
|
- Version helps detect stale caches
|
|
|
|
**Test Cases**
|
|
|
|
- GetShardAssignments returns current map
|
|
- GetShard(actorID) returns consistent shard
|
|
- Routing: actor ID → shard → node owner
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.9 (shard ownership)
|
|
|
|
---
|
|
|
|
### Feature Set 3c: Failure Detection and Recovery
|
|
|
|
**Capability:** Recover from Node Failures
|
|
|
|
**Description:** Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes.
|
|
|
|
**Success Condition:** Node dies → failure detected within 90s → shards reassigned → actors replay automatically.
|
|
|
|
---
|
|
|
|
#### Issue 3.12: [Command] Implement node health checks
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P1
|
|
|
|
**Title:** Detect node failures via heartbeat timeout
|
|
|
|
**User Story**
|
|
|
|
As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Each node publishes heartbeat every 30s
|
|
- [ ] Nodes without heartbeat for 90s marked as Failed
|
|
- [ ] checkNodeHealth() runs every 30s
|
|
- [ ] Failed node's status updates atomically
|
|
- [ ] Tests verify failure detection timing
|
|
- [ ] Failed node can rejoin cluster
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** MarkNodeFailed(nodeID)
|
|
|
|
**Trigger:** monitorNodes detects missing heartbeat
|
|
|
|
**Events:** NodeFailed(nodeID, lastSeenTimestamp)
|
|
|
|
**Technical Notes**
|
|
|
|
- monitorNodes() loop in manager.go
|
|
- Check LastSeen timestamp
|
|
- Update status if stale (>90s)
|
|
|
|
**Test Cases**
|
|
|
|
- Active node: status stays Active
|
|
- No heartbeat for 90s: status → Failed
|
|
- Rejoin: status → Active
|
|
- Failure detected < 100s (ideally 90-120s)
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.1 (node discovery)
|
|
|
|
---
|
|
|
|
#### Issue 3.13: [Command] Implement RebalanceShards after node failure
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P0
|
|
|
|
**Title:** Reassign failed node's shards to healthy nodes
|
|
|
|
**User Story**
|
|
|
|
As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Leader detects node failure
|
|
- [ ] Leader triggers RebalanceShards
|
|
- [ ] Failed node's shards reassigned evenly
|
|
- [ ] No shard left orphaned
|
|
- [ ] ShardMap updated atomically
|
|
- [ ] Rebalancing completes within 5 seconds
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Command)
|
|
|
|
**Command:** RebalanceShards(failedNodeID)
|
|
|
|
**Aggregates:** ShardMap, ConsistentHashRing
|
|
|
|
**Events:** RebalanceStarted, ShardMigrated
|
|
|
|
**Technical Notes**
|
|
|
|
- Leader only (IsLeader() check)
|
|
- Use consistent hashing to assign
|
|
- Calculate new assignments atomically
|
|
|
|
**Test Cases**
|
|
|
|
- Node-a fails with shards [1, 2, 3]
|
|
- Leader reassigns [1, 2, 3] to remaining nodes
|
|
- No orphaned shards
|
|
- Rebalancing < 5s
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.8 (consistent hashing)
|
|
- Depends on: Issue 3.12 (failure detection)
|
|
|
|
---
|
|
|
|
#### Issue 3.14: [Rule] Enforce no-orphan invariant
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P0
|
|
|
|
**Title:** Guarantee all shards have owners after rebalancing
|
|
|
|
**User Story**
|
|
|
|
As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Before rebalancing: verify no orphaned shards
|
|
- [ ] After rebalancing: verify all shards assigned
|
|
- [ ] Tests fail if invariant violated
|
|
- [ ] Rebalancing aborted if invariant would be violated
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Invariant)
|
|
|
|
**Invariant:** All shards [0, 1024) have owners after any rebalancing
|
|
|
|
**Check:**
|
|
- Count assigned shards
|
|
- Verify = 1024
|
|
- Abort if not
|
|
|
|
**Technical Notes**
|
|
|
|
- Validate before committing ShardMap
|
|
- Logs errors but doesn't assert (graceful degradation)
|
|
|
|
**Test Cases**
|
|
|
|
- Rebalancing completes: all shards assigned
|
|
- Orphaned shard detected: rebalancing rolled back
|
|
- Tests verify count = 1024
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.13 (rebalancing)
|
|
|
|
---
|
|
|
|
#### Issue 3.15: [Event] Publish NodeFailed on failure detection
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P2
|
|
|
|
**Title:** Record node failure for observability
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] NodeFailed event published when failure detected
|
|
- [ ] Event contains: NodeID, LastSeenTimestamp, AffectedShards
|
|
- [ ] Metrics track failure frequency
|
|
- [ ] Example: "node-a failed; 341 shards affected"
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs)
|
|
|
|
**Triggered by:** checkNodeHealth marks node failed
|
|
|
|
**Consumers:** Metrics, alerts, audit logs
|
|
|
|
**Technical Notes**
|
|
|
|
- Infrastructure event
|
|
- AffectedShards helps assess impact
|
|
|
|
**Test Cases**
|
|
|
|
- Node failure detected: event published
|
|
- Metrics show affected shard count
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.12 (failure detection)
|
|
|
|
---
|
|
|
|
#### Issue 3.16: [Event] Publish ShardMigrated on shard movement
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P2
|
|
|
|
**Title:** Track shard migrations
|
|
|
|
**User Story**
|
|
|
|
As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] ShardMigrated event published on each shard movement
|
|
- [ ] Event contains: ShardID, FromNodeID, ToNodeID, Status
|
|
- [ ] Status: "Started", "InProgress", "Completed", "Failed"
|
|
- [ ] Metrics track migration count and duration
|
|
- [ ] Example: "Shard 42 migrated from node-a to node-b (2.3s)"
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs)
|
|
|
|
**Status:** Started → InProgress → Completed
|
|
|
|
**Consumers:** Metrics, progress tracking
|
|
|
|
**Technical Notes**
|
|
|
|
- Published for each shard move
|
|
- Helps track rebalancing progress
|
|
- Useful for SLO monitoring
|
|
|
|
**Test Cases**
|
|
|
|
- Shard moves: event published
|
|
- Metrics track duration
|
|
- Status transitions correct
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.13 (rebalancing)
|
|
|
|
---
|
|
|
|
#### Issue 3.17: [Documentation] Document actor migration and replay
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Cluster Coordination
|
|
**Priority:** P2
|
|
|
|
**Title:** Explain how actors move and recover state
|
|
|
|
**User Story**
|
|
|
|
As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Design doc: cluster/ACTOR_MIGRATION.md
|
|
- [ ] Explain shard reassignment process
|
|
- [ ] Explain state rebuild via GetEvents + replay
|
|
- [ ] Explain snapshot optimization
|
|
- [ ] Example: Shard 42 moves to new node; 1000-event actor replays in <100ms
|
|
- [ ] Explain out-of-order message handling
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** Documentation
|
|
|
|
**Content:**
|
|
- Shard assignment (consistent hashing)
|
|
- Actor discovery (routing via shard map)
|
|
- State rebuild (replay from JetStream)
|
|
- Snapshots (optional optimization)
|
|
- In-flight messages (may arrive before replay completes)
|
|
|
|
**Examples:**
|
|
- Manual failover: reassign shards manually
|
|
- Auto failover: leader initiates on failure detection
|
|
|
|
**Technical Notes**
|
|
|
|
- Complex topic; good documentation prevents bugs
|
|
|
|
**Test Cases**
|
|
|
|
- Documentation is clear
|
|
- Examples correct
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 3.13 (rebalancing)
|
|
- Depends on: Phase 1 (event replay)
|
|
|
|
---
|
|
|
|
## Phase 4: Namespace Isolation and NATS Event Delivery
|
|
|
|
### Feature Set 4a: Namespace Storage Isolation
|
|
|
|
**Capability:** Isolate Logical Domains Using Namespaces
|
|
|
|
**Description:** Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer.
|
|
|
|
**Success Condition:** Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries.
|
|
|
|
---
|
|
|
|
#### Issue 4.1: [Rule] Enforce namespace-based stream naming
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Namespace Isolation
|
|
**Priority:** P1
|
|
|
|
**Title:** Use namespace prefixes in JetStream stream names
|
|
|
|
**User Story**
|
|
|
|
As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Namespace "tenant-a" → stream "tenant-a_events"
|
|
- [ ] Namespace "tenant-b" → stream "tenant-b_events"
|
|
- [ ] Empty namespace → stream "events" (default)
|
|
- [ ] JetStreamConfig.Namespace sets prefix
|
|
- [ ] NewJetStreamEventStoreWithNamespace convenience function
|
|
- [ ] Tests verify stream names have namespace prefix
|
|
|
|
**Bounded Context:** Namespace Isolation
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Configuration)
|
|
|
|
**Value Object:** Namespace (string identifier)
|
|
|
|
**Implementation:**
|
|
- JetStreamConfig.Namespace field
|
|
- StreamName = namespace + "_events" if namespace set
|
|
- StreamName = "events" if namespace empty
|
|
|
|
**Technical Notes**
|
|
|
|
- Already partially implemented in jetstream.go
|
|
- Ensure safe characters (sanitize spaces, dots, wildcards)
|
|
|
|
**Test Cases**
|
|
|
|
- NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events"
|
|
- NewJetStreamEventStoreWithNamespace(""): creates stream "events"
|
|
- Stream name verified
|
|
|
|
**Dependencies**
|
|
|
|
- None (orthogonal to other contexts)
|
|
|
|
---
|
|
|
|
#### Issue 4.2: [Rule] Enforce storage-level namespace isolation
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Namespace Isolation
|
|
**Priority:** P0
|
|
|
|
**Title:** Prevent cross-namespace data leakage at storage layer
|
|
|
|
**User Story**
|
|
|
|
As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events"
|
|
- [ ] GetEvents("tenant-a") queries "tenant-a_events" stream only
|
|
- [ ] No possibility of accidental cross-namespace leakage
|
|
- [ ] JetStream subject filtering enforces isolation
|
|
- [ ] Integration tests verify with multiple namespaces
|
|
|
|
**Bounded Context:** Namespace Isolation
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Invariant)
|
|
|
|
**Invariant:** Events from namespace X are invisible to namespace Y
|
|
|
|
**Mechanism:**
|
|
- Separate JetStream streams per namespace
|
|
- Subject prefixing: "tenant-a.events.actor-123"
|
|
- Subscribe filters by subject prefix
|
|
|
|
**Technical Notes**
|
|
|
|
- jetstream.go: SubscribeToActorEvents uses subject prefix
|
|
- Consumer created with subject filter matching namespace
|
|
|
|
**Test Cases**
|
|
|
|
- SaveEvent to tenant-a: visible in tenant-a queries
|
|
- Same event invisible to tenant-b queries
|
|
- GetLatestVersion scoped to namespace
|
|
- GetEvents scoped to namespace
|
|
- Multi-namespace integration test
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 4.1 (stream naming)
|
|
|
|
---
|
|
|
|
#### Issue 4.3: [Documentation] Document namespace design patterns
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Namespace Isolation
|
|
**Priority:** P1
|
|
|
|
**Title:** Provide guidance on namespace naming and use
|
|
|
|
**User Story**
|
|
|
|
As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Design doc: NAMESPACE_DESIGN_PATTERNS.md
|
|
- [ ] Pattern 1: "tenant-{id}" (per-customer)
|
|
- [ ] Pattern 2: "env.domain" (per-env, per-bounded-context)
|
|
- [ ] Pattern 3: "env.domain.customer" (most granular)
|
|
- [ ] Examples of each pattern
|
|
- [ ] Guidance on choosing granularity
|
|
- [ ] Anti-patterns (wildcards, spaces, dots)
|
|
|
|
**Bounded Context:** Namespace Isolation
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** Documentation
|
|
|
|
**Content:**
|
|
- Multi-tenant patterns
|
|
- Granularity decisions
|
|
- Namespace naming rules
|
|
- Examples
|
|
- Anti-patterns
|
|
- Performance implications
|
|
|
|
**Examples:**
|
|
- SaaS: "tenant-uuid"
|
|
- Microservices: "service.orders"
|
|
- Complex: "env.service.tenant"
|
|
|
|
**Technical Notes**
|
|
|
|
- No hard restrictions; naming is flexible
|
|
- Sanitization (spaces → underscores)
|
|
|
|
**Test Cases**
|
|
|
|
- Documentation is clear
|
|
- Examples valid
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 4.1 (stream naming)
|
|
|
|
---
|
|
|
|
#### Issue 4.4: [Validation] Add namespace format validation (P2)
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Namespace Isolation
|
|
**Priority:** P2
|
|
|
|
**Title:** Validate namespace names to prevent invalid streams
|
|
|
|
**User Story**
|
|
|
|
As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] ValidateNamespace(ns string) returns error for invalid names
|
|
- [ ] Rejects: "tenant-*", "tenant a", "tenant."
|
|
- [ ] Accepts: "tenant-abc", "prod.orders", "tenant_123"
|
|
- [ ] Called on NewJetStreamEventStoreWithNamespace
|
|
- [ ] Clear error messages
|
|
- [ ] Tests verify validation rules
|
|
|
|
**Bounded Context:** Namespace Isolation
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Validation)
|
|
|
|
**Validation Rules:**
|
|
- No wildcards (*, >)
|
|
- No spaces
|
|
- No leading/trailing dots
|
|
- Alphanumeric, hyphens, underscores, dots only
|
|
|
|
**Implementation:**
|
|
- ValidateNamespace regex
|
|
- Called before stream creation
|
|
|
|
**Technical Notes**
|
|
|
|
- Nice-to-have; currently strings accepted as-is
|
|
- Could sanitize instead of rejecting (replace _ for spaces)
|
|
|
|
**Test Cases**
|
|
|
|
- Valid: "tenant-abc", "prod.orders"
|
|
- Invalid: "tenant-*", "tenant a", ".prod"
|
|
- Error messages clear
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 4.1 (stream naming)
|
|
|
|
---
|
|
|
|
### Feature Set 4b: Cross-Node Event Delivery via NATS
|
|
|
|
**Capability:** Deliver Events Across Cluster Nodes
|
|
|
|
**Description:** Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering.
|
|
|
|
**Success Condition:** Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS).
|
|
|
|
---
|
|
|
|
#### Issue 4.5: [Command] Implement NATSEventBus wrapper
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus (with NATS)
|
|
**Priority:** P1
|
|
|
|
**Title:** Extend EventBus with NATS-native pub/sub
|
|
|
|
**User Story**
|
|
|
|
As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] NATSEventBus embeds EventBus
|
|
- [ ] Publish(namespace, event) sends to local EventBus AND NATS
|
|
- [ ] NATS subject: "aether.events.{namespace}"
|
|
- [ ] SubscribeWithFilter works across nodes
|
|
- [ ] Self-published events not re-delivered (avoid loops)
|
|
- [ ] Tests verify cross-node delivery
|
|
|
|
**Bounded Context:** Event Bus (NATS extension)
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Extension)
|
|
|
|
**Aggregate:** EventBus extended with NATSEventBus
|
|
|
|
**Commands:** Publish(namespace, event) [same interface, distributed]
|
|
|
|
**Implementation:**
|
|
- NATSEventBus composes EventBus
|
|
- Override Publish to also publish to NATS
|
|
- Subscribe to NATS subjects matching namespace
|
|
|
|
**Technical Notes**
|
|
|
|
- nats_eventbus.go already partially implemented
|
|
- NATS subject: "aether.events.orders" for namespace "orders"
|
|
- Include sourceNodeID in event to prevent redelivery
|
|
|
|
**Test Cases**
|
|
|
|
- Publish on node-a: local subscribers on node-a receive
|
|
- Same publish: node-b subscribers receive via NATS
|
|
- Self-loop prevented: node-a doesn't re-receive own publish
|
|
- Multi-node: all nodes converge on same events
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 2.1 (EventBus.Publish)
|
|
- Depends on: Issue 3.1 (cluster setup for multi-node tests)
|
|
|
|
---
|
|
|
|
#### Issue 4.6: [Rule] Enforce exactly-once delivery across cluster
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus (NATS)
|
|
**Priority:** P1
|
|
|
|
**Title:** Guarantee events delivered to all cluster subscribers
|
|
|
|
**User Story**
|
|
|
|
As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] Event published to NATS with JetStream consumer
|
|
- [ ] Consumer acknowledges delivery
|
|
- [ ] Redelivery on network failure (JetStream handles)
|
|
- [ ] No duplicate delivery to same subscriber
|
|
- [ ] All nodes see same events in same order
|
|
|
|
**Bounded Context:** Event Bus (NATS)
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Invariant)
|
|
|
|
**Invariant:** Exactly-once delivery to each subscriber
|
|
|
|
**Mechanism:**
|
|
- JetStream consumer per subscriber group
|
|
- Acknowledgment on delivery
|
|
- Automatic redelivery on timeout
|
|
|
|
**Technical Notes**
|
|
|
|
- JetStream handles durability and ordering
|
|
- Consumer name = subscriber ID
|
|
- Push consumer model (events pushed to subscriber)
|
|
|
|
**Test Cases**
|
|
|
|
- Publish event: all subscribers receive once
|
|
- Network failure: redelivery after timeout
|
|
- No duplicates on subscriber
|
|
- Order preserved across nodes
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 4.5 (NATSEventBus)
|
|
|
|
---
|
|
|
|
#### Issue 4.7: [Event] Publish EventPublished (via NATS)
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus (NATS)
|
|
**Priority:** P2
|
|
|
|
**Title:** Route published events to NATS subjects
|
|
|
|
**User Story**
|
|
|
|
As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] EventPublished event published to NATS
|
|
- [ ] Subject: "aether.events.{namespace}.published"
|
|
- [ ] Message contains: eventID, timestamp, sourceNodeID
|
|
- [ ] Metrics track: events published, delivered, dropped
|
|
- [ ] Helps identify partition/latency issues
|
|
|
|
**Bounded Context:** Event Bus (NATS)
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Event)
|
|
|
|
**Event:** EventPublished (infrastructure)
|
|
|
|
**Subject:** aether.events.{namespace}.published
|
|
|
|
**Consumers:** Metrics, monitoring
|
|
|
|
**Technical Notes**
|
|
|
|
- Published after NATS publish succeeds
|
|
- Separate from local EventPublished (for clarity)
|
|
|
|
**Test Cases**
|
|
|
|
- Publish event: EventPublished message on NATS
|
|
- Metrics count delivery
|
|
- Cross-node visibility works
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 4.5 (NATSEventBus)
|
|
|
|
---
|
|
|
|
#### Issue 4.8: [Read Model] Implement cross-node subscription
|
|
|
|
**Type:** New Feature
|
|
**Bounded Context:** Event Bus (NATS)
|
|
**Priority:** P1
|
|
|
|
**Title:** Receive events from other nodes via NATS
|
|
|
|
**User Story**
|
|
|
|
As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows.
|
|
|
|
**Acceptance Criteria**
|
|
|
|
- [ ] NATSEventBus.Subscribe(namespace) receives local + NATS events
|
|
- [ ] SubscribeWithFilter works with NATS
|
|
- [ ] Events from local node: delivered via local EventBus
|
|
- [ ] Events from remote nodes: delivered via NATS consumer
|
|
- [ ] Subscriber sees unified stream (no duplication)
|
|
|
|
**Bounded Context:** Event Bus (NATS)
|
|
|
|
**DDD Implementation Guidance**
|
|
|
|
**Type:** New Feature (Query/Subscription)
|
|
|
|
**Read Model:** UnifiedEventStream (local + remote)
|
|
|
|
**Implementation:**
|
|
- Subscribe creates local channel
|
|
- NATSEventBus subscribes to NATS subject
|
|
- Both feed into subscriber channel
|
|
|
|
**Technical Notes**
|
|
|
|
- Unified view is transparent to subscriber
|
|
- No need to know if event is local or remote
|
|
|
|
**Test Cases**
|
|
|
|
- Subscribe to namespace: receive local events
|
|
- Subscribe to namespace: receive remote events
|
|
- Filter works across both sources
|
|
- No duplication
|
|
|
|
**Dependencies**
|
|
|
|
- Depends on: Issue 4.5 (NATSEventBus)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
This backlog contains **67 executable issues** across **5 bounded contexts** organized into **4 implementation phases**. Each issue:
|
|
|
|
- Is decomposed using DDD-informed order (commands → rules → events → reads)
|
|
- References domain concepts (aggregates, commands, events, value objects)
|
|
- Includes acceptance criteria (testable, specific)
|
|
- States dependencies (enabling parallel work)
|
|
- Is sized to 1-3 days of work
|
|
|
|
**Recommended Build Order:**
|
|
|
|
1. **Phase 1** (17 issues): Event Sourcing Foundation - everything depends on this
|
|
2. **Phase 2** (9 issues): Local Event Bus - enables observability before clustering
|
|
3. **Phase 3** (20 issues): Cluster Coordination - enables distributed deployment
|
|
4. **Phase 4** (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery
|
|
|
|
**Total Scope:** ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Create Gitea issues from this backlog
|
|
2. Assign to team members
|
|
3. Set up dependency tracking in Gitea
|
|
4. Use `/spawn-issues` skill to parallelize implementation
|
|
5. Iterate on acceptance criteria with domain experts
|
|
|
|
See `/issue-writing` skill for proper issue formatting in Gitea.
|