Files
aether/.product-strategy/BACKLOG.md
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

69 KiB

Aether Executable Backlog

Built from: 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition

Date: 2026-01-12


Backlog Overview

This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API.

Total Scope:

  • Capabilities: 9 (all complete)
  • Features: 14
  • Issues: 67
  • Contexts: 5
  • Implementation Phases: 4

Build Order (by value and dependencies):

  1. Phase 1: Event Sourcing Foundation (Capabilities 1-3)

    • Issues: 17
    • Enables all other work
  2. Phase 2: Local Event Bus (Capability 8)

    • Issues: 9
    • Enables local pub/sub before clustering
  3. Phase 3: Cluster Coordination (Capabilities 5-7)

    • Issues: 20
    • Enables distributed deployment
  4. Phase 4: Namespace & NATS (Capabilities 4, 9)

    • Issues: 21
    • Enables multi-tenancy and cross-node delivery

Phase 1: Event Sourcing Foundation

Feature Set 1a: Event Storage with Version Conflict Detection

Capability: Store Events Durably with Conflict Detection

Description: Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers.

Success Condition: Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history.


Issue 1.1: [Command] Implement SaveEvent with monotonic version validation

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely

User Story

As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately.

Acceptance Criteria

  • SaveEvent accepts event with Version > current for actor
  • SaveEvent rejects event with Version <= current (returns VersionConflictError)
  • VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion
  • First event for new actor must have Version > 0 (typically 1)
  • Version gaps are allowed (1, 3, 5 is valid)
  • Validation happens before persistence (fail-fast)
  • InMemoryEventStore and JetStreamEventStore both implement validation

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Core)

Aggregate: ActorEventStream (implicit; each actor has independent version sequence)

Command: SaveEvent(event)

Validation Rules:

  • If no events exist for actor: version must be > 0
  • If events exist: new version must be > latest version

Success Event: EventStored (published when SaveEvent succeeds)

Error Event: VersionConflict (triggered when version validation fails)

Technical Notes

  • Version validation is the core invariant; everything else depends on it
  • Use GetLatestVersion() to implement validation
  • No database-level locks; optimistic validation only
  • Conflict should fail in <1ms

Test Cases

  • New actor, version 1: succeeds
  • Same actor, version 2 (after 1): succeeds
  • Same actor, version 2 (after 1, concurrent): second call fails
  • Same actor, version 1 (duplicate): fails
  • Same actor, version 0 or negative: fails
  • Concurrent 100 writers: 99 fail, 1 succeeds

Dependencies

  • None (foundation)

Issue 1.2: [Rule] Enforce append-only and immutability invariants

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Enforce event immutability and append-only semantics

User Story

As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates.

Acceptance Criteria

  • EventStore interface has no Update or Delete methods
  • Events cannot be modified after persistence
  • Replay of same events always produces same state
  • Corrupted events are reported (not silently skipped)
  • JetStream stream configuration prevents deletes (retention policy only)

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Core Invariant)

Aggregate: ActorEventStream

Invariant: Events are immutable; stream is append-only; no modifications to EventStore interface

Implementation:

  • Event struct has no Setters (only getters)
  • SaveEvent is the only public persistence method
  • JetStream streams configured with NoDelete policy

Technical Notes

  • This is enforced at interface level (no Update/Delete in EventStore)
  • JetStream configuration prevents accidental deletes
  • ReplayError allows visibility into corruption without losing good data

Test Cases

  • Attempt to modify Event.Data after creation: compile error (if immutable)
  • Attempt to call UpdateEvent: interface doesn't exist
  • JetStream stream created with correct retention policy
  • ReplayError captured when event unmarshaling fails

Dependencies

  • Depends on: Issue 1.1 (SaveEvent implementation)

Issue 1.3: [Event] Publish EventStored after successful save

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Emit EventStored event for persistence observability

User Story

As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections).

Acceptance Criteria

  • EventStored event published after SaveEvent succeeds
  • EventStored contains: EventID, ActorID, Version, Timestamp
  • No EventStored published if SaveEvent fails
  • EventBus receives EventStored in same transaction context
  • Metrics increment for each EventStored

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature

Event: EventStored(eventID, actorID, version, timestamp)

Triggered by: Successful SaveEvent call

Consumers: Metrics collectors, projections, audit systems

Technical Notes

  • EventStored is an internal event (Aether infrastructure)
  • Published to local EventBus (see Phase 2 for cross-node)
  • Allows observability without coupling application code

Test Cases

  • Save event → EventStored published
  • Version conflict → no EventStored published
  • Multiple saves → multiple EventStored events in order

Dependencies

  • Depends on: Issue 1.1 (SaveEvent)
  • Depends on: Phase 2, Issue 2.1 (EventBus.Publish)

Issue 1.4: [Event] Publish VersionConflict error with full context

Type: New Feature Bounded Context: Event Sourcing, Optimistic Concurrency Control Priority: P0

Title: Return detailed version conflict information for retry logic

User Story

As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker).

Acceptance Criteria

  • VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion
  • Error message is human-readable with all context
  • Errors.Is(err, ErrVersionConflict) returns true for sentinel check
  • Errors.As(err, &versionErr) allows unpacking to VersionConflictError
  • Application can read CurrentVersion to decide retry strategy

Bounded Context: Event Sourcing + OCC

DDD Implementation Guidance

Type: New Feature

Error Type: VersionConflictError (wraps ErrVersionConflict sentinel)

Data: ActorID, AttemptedVersion, CurrentVersion

Use: Application uses this to implement retry strategies

Technical Notes

  • Already implemented in /aether/event.go (VersionConflictError struct)
  • Document standard retry patterns in examples/

Test Cases

  • Conflict with detailed error: ActorID, versions present
  • Application reads CurrentVersion: succeeds
  • Errors.Is(err, ErrVersionConflict): true
  • Errors.As(err, &versionErr): works
  • Manual test: log the error, see all context

Dependencies

  • Depends on: Issue 1.1 (SaveEvent)

Issue 1.5: [Read Model] Implement GetLatestVersion query

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Provide efficient version lookup for optimistic locking

User Story

As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead.

Acceptance Criteria

  • GetLatestVersion(actorID) returns latest version or 0 if no events
  • Execution time is O(1) or O(log n), not O(n)
  • InMemoryEventStore implements with map lookup
  • JetStreamEventStore caches latest version per actor
  • Cache is invalidated after each SaveEvent
  • Multiple calls for same actor within 1s hit cache
  • Namespace isolation: GetLatestVersion scoped to namespace

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: ActorVersionIndex

Source Events: SaveEvent (updates cache)

Data: ActorID → LatestVersion

Performance: O(1) lookup after SaveEvent

Technical Notes

  • InMemoryEventStore: use map[actorID]int64
  • JetStreamEventStore: query JetStream metadata OR maintain cache
  • Cache invalidation: update after every SaveEvent
  • Thread-safe with RWMutex (read-heavy)

Test Cases

  • New actor: GetLatestVersion returns 0
  • After SaveEvent(version: 1): GetLatestVersion returns 1
  • After SaveEvent(version: 3): GetLatestVersion returns 3
  • Concurrent reads from same actor: all return consistent value
  • Namespace isolation: "tenant-a" and "tenant-b" have independent versions

Dependencies

  • Depends on: Issue 1.1 (SaveEvent)

Feature Set 1b: State Rebuild from Event History

Capability: Rebuild State from Event History

Description: Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors.

Success Condition: GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1).


Issue 1.6: [Command] Implement GetEvents for replay

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Load events from store for state replay

User Story

As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state.

Acceptance Criteria

  • GetEvents(actorID, fromVersion) returns []*Event in version order
  • Events are ordered by version (ascending)
  • fromVersion is inclusive (GetEvents(actorID, 5) includes version 5)
  • If no events exist, returns empty slice (not error)
  • If actorID has no events >= fromVersion, returns empty slice
  • Namespace isolation: GetEvents scoped to namespace
  • Large result sets don't cause memory issues (stream if >10k events)

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Query)

Command: GetEvents(actorID, fromVersion)

Returns: []*Event ordered by version

Invariant: Order is deterministic (version order always)

Technical Notes

  • InMemoryEventStore: filter and sort by version
  • JetStreamEventStore: query JetStream subject and order results
  • Consider pagination for very large actor histories
  • fromVersion=0 means "start from beginning"

Test Cases

  • GetEvents(actorID, 0) with 5 events: returns all 5 in order
  • GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5
  • GetEvents(nonexistent, 0): returns empty slice
  • GetEvents with gap (versions 1, 3, 5): returns only those 3
  • Order is guaranteed (version order, not insertion order)

Dependencies

  • Depends on: Issue 1.1 (SaveEvent)

Issue 1.7: [Rule] Define and enforce snapshot validity

Type: New Feature Bounded Context: Event Sourcing Priority: P1

Title: Implement snapshot invalidation policy

User Story

As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat.

Acceptance Criteria

  • Snapshot valid until Version + MaxVersionGap (default 1000)
  • GetLatestSnapshot returns nil if no snapshot or invalid
  • Application can override MaxVersionGap in config
  • Snapshot timestamp recorded for debugging
  • No automatic cleanup; application calls SaveSnapshot to create
  • Tests confirm snapshot invalidation logic

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Policy)

Aggregate: ActorSnapshot + SnapshotPolicy

Policy: Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap

Implementation:

  • SnapshotStore.GetLatestSnapshot validates before returning
  • If invalid, returns nil; application must replay

Technical Notes

  • This is a safety policy; prevents stale snapshots
  • Application owns decision to create snapshots (no auto-triggering)
  • MaxVersionGap is tunable per deployment

Test Cases

  • Snapshot at version 10, MaxGap=100, current=50: valid
  • Snapshot at version 10, MaxGap=100, current=111: invalid
  • Snapshot at version 10, MaxGap=100, current=110: valid
  • GetLatestSnapshot returns nil for invalid snapshot

Dependencies

  • Depends on: Issue 1.6 (GetEvents)

Issue 1.8: [Event] Publish SnapshotCreated for observability

Type: New Feature Bounded Context: Event Sourcing Priority: P1

Title: Emit snapshot creation event for lifecycle tracking

User Story

As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting.

Acceptance Criteria

  • SnapshotCreated event published after SaveSnapshot succeeds
  • Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration
  • Metrics increment for snapshot creation
  • No event if SaveSnapshot fails
  • Example: Snapshot created every 1000 versions

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Event)

Event: SnapshotCreated(actorID, version, timestamp, replayDurationMs)

Triggered by: SaveSnapshot call succeeds

Consumers: Metrics, monitoring dashboards

Technical Notes

  • SnapshotCreated is infrastructure event (like EventStored)
  • ReplayDuration helps identify slow actors needing snapshots more frequently

Test Cases

  • SaveSnapshot succeeds → SnapshotCreated published
  • SaveSnapshot fails → no event published
  • ReplayDuration recorded accurately

Dependencies

  • Depends on: Issue 1.7 (SnapshotStore interface)

Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay

Type: New Feature Bounded Context: Event Sourcing Priority: P1

Title: Handle corrupted events during replay without data loss

User Story

As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events.

Acceptance Criteria

  • GetEventsWithErrors(actorID, fromVersion) returns ReplayResult
  • ReplayResult contains: []*Event (good) and []ReplayError (bad)
  • Good events are returned in order despite errors
  • ReplayError contains: SequenceNumber, RawData, UnmarshalError
  • Application decides how to handle corrupted events
  • Metrics track corruption frequency

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Query)

Interface: EventStoreWithErrors extends EventStore

Method: GetEventsWithErrors(actorID, fromVersion) → ReplayResult

Data:

  • ReplayResult.Events: successfully deserialized events
  • ReplayResult.Errors: corruption records
  • ReplayResult.HasErrors(): convenience check

Technical Notes

  • Already defined in event.go (ReplayError, ReplayResult)
  • JetStreamEventStore should implement EventStoreWithErrors
  • Application uses HasErrors() to decide on recovery action

Test Cases

  • All good events: ReplayResult.Events populated, no errors
  • Corrupted event in middle: good events before/after, one error recorded
  • Multiple corruptions: all recorded with context
  • Application can inspect RawData for forensics

Dependencies

  • Depends on: Issue 1.6 (GetEvents)

Issue 1.10: [Interface] Implement SnapshotStore interface

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Define snapshot storage contract

User Story

As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3).

Acceptance Criteria

  • SnapshotStore extends EventStore
  • GetLatestSnapshot(actorID) returns ActorSnapshot or nil
  • SaveSnapshot(snapshot) persists snapshot
  • ActorSnapshot contains: ActorID, Version, State, Timestamp
  • Namespace isolation: snapshots scoped to namespace
  • Tests verify interface contract

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Interface)

Interface: SnapshotStore extends EventStore

Methods:

  • GetLatestSnapshot(actorID) → (*ActorSnapshot, error)
  • SaveSnapshot(snapshot) → error

Aggregates: ActorSnapshot (value object)

Technical Notes

  • Already defined in event.go
  • Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore
  • Keep snapshots in same store as events (co-located)

Test Cases

  • SaveSnapshot persists; GetLatestSnapshot retrieves it
  • New actor: GetLatestSnapshot returns nil
  • Multiple snapshots: only latest returned
  • Namespace isolation: snapshots from tenant-a don't appear in tenant-b

Dependencies

  • Depends on: Issue 1.1 (SaveEvent + storage foundation)

Feature Set 1c: Optimistic Concurrency Control

Capability: Enable Safe Concurrent Writes

Description: Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy.

Success Condition: Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks.


Issue 1.11: [Rule] Enforce fail-fast on version conflict

Type: New Feature Bounded Context: Optimistic Concurrency Control Priority: P0

Title: Fail immediately on version conflict; no auto-retry

User Story

As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.).

Acceptance Criteria

  • SaveEvent returns VersionConflictError immediately on mismatch
  • No built-in retry loop in SaveEvent
  • No database-level locks held
  • Application reads VersionConflictError and decides retry
  • Default retry strategy documented (examples/)

Bounded Context: Optimistic Concurrency Control

DDD Implementation Guidance

Type: New Feature (Policy)

Invariant: Conflicts trigger immediate failure; application owns retry

Implementation:

  • SaveEvent: version check, return error if mismatch, done
  • No loop, no backoff, no retries
  • Clean error with context for caller

Technical Notes

  • This is a design choice: fail-fast enables flexible retry strategies
  • Application can choose exponential backoff, jitter, circuit-breaker, etc.

Test Cases

  • SaveEvent(version: 2) when current=2: fails immediately
  • No retry attempted by library
  • Application can retry if desired
  • Example patterns in examples/retry.go

Dependencies

  • Depends on: Issue 1.1 (SaveEvent)

Issue 1.12: [Documentation] Document concurrent write patterns

Type: New Feature Bounded Context: Optimistic Concurrency Control Priority: P1

Title: Provide retry strategy examples (backoff, circuit-breaker, queue)

User Story

As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application.

Acceptance Criteria

  • examples/retry_exponential_backoff.go
  • examples/retry_circuit_breaker.go
  • examples/retry_queue_based.go
  • examples/concurrent_write_test.go showing patterns
  • README mentions OCC patterns
  • Each example is >100 lines with explanation

Bounded Context: Optimistic Concurrency Control

DDD Implementation Guidance

Type: Documentation

Artifacts:

  • examples/retry_exponential_backoff.go
  • examples/retry_circuit_breaker.go
  • examples/retry_queue_based.go
  • examples/concurrent_write_test.go

Content:

  • How to read VersionConflictError
  • When to retry (idempotent operations)
  • When not to retry (non-idempotent)
  • Backoff strategies
  • Monitoring

Technical Notes

  • Real, runnable code (not pseudocode)
  • Show metrics collection
  • Show when to give up

Test Cases

  • Examples compile without error
  • Examples use idempotent operations
  • Test coverage for examples

Dependencies

  • Depends on: Issue 1.11 (fail-fast behavior)

Phase 2: Local Event Bus

Feature Set 2a: Event Routing and Filtering

Capability: Route and Filter Domain Events

Description: Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern.

Success Condition: Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match.


Issue 2.1: [Command] Implement Publish to local subscribers

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Publish events to local subscribers

User Story

As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling.

Acceptance Criteria

  • Publish(namespaceID, event) sends to all subscribers of that namespace
  • Exact subscribers (namespace="orders") receive event
  • Wildcard subscribers (namespace="order*") receive matching events
  • Events delivered in-process (no NATS yet)
  • Buffered channels (100-event buffer) prevent blocking
  • Full subscribers dropped non-blocking (no deadlock)
  • Metrics track publish count, receive count, dropped count

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Command)

Command: Publish(namespaceID, event)

Invariant: All subscribers matching namespace receive event

Implementation:

  • Iterate exact subscribers for namespace
  • Iterate wildcard subscribers matching pattern
  • Deliver to each (non-blocking, buffered)
  • Count drops

Technical Notes

  • EventBus in eventbus.go already implements this
  • Ensure buffered channels don't cause memory leaks
  • Metrics important for observability

Test Cases

  • Publish to "orders": exact subscriber of "orders" receives
  • Publish to "orders.new": wildcard subscriber of "order*" receives
  • Publish to "payments": subscriber to "orders" does NOT receive
  • Subscriber with full buffer: event dropped (non-blocking)
  • 1000 publishes: metrics accurate

Dependencies

  • Depends on: Issue 2.2 (Subscribe)

Issue 2.2: [Command] Implement Subscribe with optional filter

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Register subscriber with optional event filter

User Story

As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about.

Acceptance Criteria

  • Subscribe(namespacePattern) returns <-chan *Event
  • SubscribeWithFilter(namespacePattern, filter) returns filtered channel
  • Filter supports EventTypes ([]string) and ActorPattern (string)
  • Filters applied client-side (subscriber decides)
  • Wildcard patterns work: "*" matches single token, ">" matches multiple
  • Subscription channel is buffered (100 events)
  • Unsubscribe(namespacePattern, ch) removes subscription

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Command)

Command: Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter)

Invariants:

  • Namespace pattern determines which namespaces
  • Filter determines which events within namespace
  • Both work together (AND logic)

Filter Types:

  • EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"])
  • ActorPattern: string (e.g., "order-customer-*")

Technical Notes

  • Pattern matching follows NATS conventions
  • Filters are optional (nil filter = all events)
  • Client-side filtering is efficient (NATS does server-side)

Test Cases

  • Subscribe("orders"): exact match only
  • Subscribe("order*"): wildcard match
  • Subscribe("order.*"): NATS-style wildcard
  • SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works
  • SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works
  • Unsubscribe closes channel

Dependencies

  • Depends on: Issue 1.1 (events structure)

Issue 2.3: [Rule] Enforce exact subscription isolation

Type: New Feature Bounded Context: Event Bus + Namespace Isolation Priority: P1

Title: Guarantee exact namespace subscriptions are isolated

User Story

As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level.

Acceptance Criteria

  • Subscriber to "tenant-a" receives events from "tenant-a" only
  • Subscriber to "tenant-a" does NOT receive from "tenant-b"
  • Wildcard subscriber to "tenant*" receives from both
  • Exact match subscribers are isolated from wildcard
  • Tests verify isolation with multi-namespace setup
  • Documentation warns about wildcard security implications

Bounded Context: Event Bus + Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Policy/Invariant)

Invariant: Exact subscriptions are isolated

Implementation:

  • exactSubscribers map[namespace][]*subscription
  • Wildcard subscriptions separate collection
  • Publish checks exact first, then wildcard patterns

Security Note: Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.)

Technical Notes

  • Enforced at EventBus.Publish level
  • Exact match is simple string equality
  • Wildcard uses MatchNamespacePattern helper

Test Cases

  • Publish to "tenant-a": only "tenant-a" exact subscribers get it
  • Publish to "tenant-b": only "tenant-b" exact subscribers get it
  • Publish to "tenant-a": "tenant*" wildcard subscriber gets it
  • Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it

Dependencies

  • Depends on: Issue 2.2 (Subscribe)

Issue 2.4: [Rule] Document wildcard subscription security

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Document that wildcard subscriptions bypass isolation

User Story

As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions.

Acceptance Criteria

  • eventbus.go comments explain wildcard behavior
  • Security warning in Subscribe godoc
  • Example showing wildcard usage for logging
  • Example showing why wildcard is dangerous (if not restricted)
  • README mentions namespace isolation caveats
  • Examples show proper patterns (monitoring, auditing)

Bounded Context: Event Bus

DDD Implementation Guidance

Type: Documentation

Content:

  • Wildcard subscriptions receive all matching events
  • Use for cross-cutting concerns (logging, monitoring, audit)
  • Restrict access to trusted components
  • Never expose wildcard pattern to untrusted users

Examples:

  • Monitoring system subscribes to ">"
  • Audit system subscribes to "tenant-*"
  • Application logic uses exact subscriptions only

Technical Notes

  • Intentional design; not a bug
  • Different from NATS server-side filtering

Test Cases

  • Examples compile
  • Documentation is clear and accurate

Dependencies

  • Depends on: Issue 2.3 (exact isolation)

Issue 2.5: [Event] Publish SubscriptionCreated for tracking

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Track subscription lifecycle

User Story

As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues.

Acceptance Criteria

  • SubscriptionCreated event published on Subscribe
  • SubscriptionDestroyed event published on Unsubscribe
  • Event contains: namespacePattern, filterCriteria, timestamp
  • Metrics increment on subscribe/unsubscribe
  • SubscriberCount(namespace) returns current count

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Event)

Event: SubscriptionCreated(namespacePattern, filter, timestamp)

Event: SubscriptionDestroyed(namespacePattern, timestamp)

Metrics: Subscriber count per namespace

Technical Notes

  • SubscriberCount already in eventbus.go
  • Add events to EventBus.Subscribe and EventBus.Unsubscribe
  • Internal events (infrastructure)

Test Cases

  • Subscribe → metrics increment
  • Unsubscribe → metrics decrement
  • SubscriberCount correct

Dependencies

  • Depends on: Issue 2.2 (Subscribe/Unsubscribe)

Issue 2.6: [Event] Publish EventPublished for delivery tracking

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Record event publication metrics

User Story

As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues.

Acceptance Criteria

  • EventPublished event published on Publish
  • Metrics track: published count, delivered count, dropped count per namespace
  • Dropped events (full channel) recorded
  • Application can query metrics via Metrics()
  • Example: 1000 events published, 995 delivered, 5 dropped

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Event/Metrics)

Event: EventPublished (infrastructure event)

Metrics:

  • PublishCount[namespace]
  • DeliveryCount[namespace]
  • DroppedCount[namespace]

Implementation:

  • RecordPublish(namespace)
  • RecordReceive(namespace)
  • RecordDroppedEvent(namespace)

Technical Notes

  • Metrics already in DefaultMetricsCollector
  • RecordDroppedEvent signals subscriber backpressure
  • Can be used to auto-scale subscribers

Test Cases

  • Publish 100 events: metrics show 100 published
  • All delivered: metrics show 100 delivered
  • Full subscriber: next event dropped, metrics show 1 dropped
  • Query via bus.Metrics(): values accurate

Dependencies

  • Depends on: Issue 2.1 (Publish)

Issue 2.7: [Read Model] Implement GetSubscriptions query

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Query active subscriptions for operational visibility

User Story

As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health.

Acceptance Criteria

  • GetSubscriptions() returns []SubscriptionInfo
  • SubscriptionInfo contains: pattern, filter, subscriberID, createdAt
  • Works for both exact and wildcard subscriptions
  • Metrics accessible via SubscriberCount(namespace)
  • Example: "What subscriptions are listening to 'orders'?"

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: SubscriptionRegistry

Data:

  • Pattern: namespace pattern (e.g., "tenant-*")
  • Filter: optional filter criteria
  • SubscriberID: unique ID for each subscription
  • CreatedAt: timestamp

Implementation:

  • Track subscriptions in eventbus.go
  • Expose via GetSubscriptions() method

Technical Notes

  • Useful for debugging
  • Optional feature; not critical

Test Cases

  • Subscribe to "orders": GetSubscriptions shows it
  • Subscribe to "order*": GetSubscriptions shows it
  • Unsubscribe: GetSubscriptions removes it
  • Multiple subscribers: all listed

Dependencies

  • Depends on: Issue 2.2 (Subscribe)

Feature Set 2b: Buffering and Backpressure

Capability: Route and Filter Domain Events (non-blocking delivery)

Description: Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed).

Success Condition: Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher.


Issue 2.8: [Rule] Implement non-blocking event delivery

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Ensure event publication never blocks

User Story

As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations.

Acceptance Criteria

  • Publish(namespace, event) returns immediately
  • If subscriber channel full, event dropped (non-blocking)
  • Dropped events counted in metrics
  • Buffered channel size is 100 (tunable)
  • Publisher never waits for subscriber
  • Metrics alert on high drop rate

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Policy)

Invariant: Publishers not blocked by slow subscribers

Implementation:

  • select { case ch <- event: ... default: ... }
  • Count drops in default case

Trade-off:

  • Pro: Publisher never blocks
  • Con: Events may be lost if subscriber can't keep up
  • Mitigation: Metrics alert on drops; subscriber can increase buffer or retry

Technical Notes

  • Already implemented in eventbus.go (deliverToSubscriber)
  • 100-event buffer is reasonable default

Test Cases

  • Subscribe, receive 100 events: no drops
  • Publish 101st event immediately: dropped
  • Metrics show drop count
  • Publisher latency < 1ms regardless of subscribers

Dependencies

  • Depends on: Issue 2.1 (Publish)

Issue 2.9: [Documentation] Document EventBus backpressure handling

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Explain buffer management and recovery from drops

User Story

As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers.

Acceptance Criteria

  • Document buffer size (100 events default)
  • Explain what happens on overflow (event dropped)
  • Document recovery patterns (subscriber restarts, re-syncs)
  • Example: Subscriber catches up from JetStream after restart
  • Metrics to monitor (drop rate)
  • README section on backpressure

Bounded Context: Event Bus

DDD Implementation Guidance

Type: Documentation

Content:

  • Buffer size and behavior
  • Drop semantics
  • Recovery patterns
  • Metrics to monitor
  • When to increase buffer size

Examples:

  • Slow subscriber: increase buffer or fix handler
  • Network latency: events may be dropped
  • Handler panics: subscriber must restart and re-sync

Technical Notes

  • Events are lost if dropped; only durable via JetStream
  • Phase 3 (NATS) addresses durability

Test Cases

  • Documentation is clear
  • Examples work

Dependencies

  • Depends on: Issue 2.8 (non-blocking delivery)

Phase 3: Cluster Coordination

Feature Set 3a: Cluster Topology and Leadership

Capability: Coordinate Cluster Topology

Description: Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease.

Success Condition: Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s.


Issue 3.1: [Command] Implement JoinCluster protocol

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Enable node discovery via cluster join

User Story

As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically.

Acceptance Criteria

  • JoinCluster() announces node via NATS
  • Node info contains: NodeID, Address, Timestamp, Status
  • Other nodes receive join announcement
  • Cluster topology updated atomically
  • Rejoining node detected and updated
  • Tests verify multi-node discovery

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: JoinCluster()

Aggregates: Cluster (group of nodes)

Events: NodeJoined(nodeID, address, timestamp)

Technical Notes

  • NATS subject: "aether.cluster.nodes"
  • NodeDiscovery subscribes to announcements
  • ClusterManager.Start() initiates join

Test Cases

  • Single node joins: topology = [node-a]
  • Second node joins: topology = [node-a, node-b]
  • Third node joins: topology = [node-a, node-b, node-c]
  • Node rejoins: updates existing entry

Dependencies

  • None (first cluster feature)

Issue 3.2: [Command] Implement LeaderElection

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Elect single leader via NATS-based voting

User Story

As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing.

Acceptance Criteria

  • LeaderElection holds election every HeartbeatInterval (5s)
  • Nodes vote for themselves (no voting logic; first wins)
  • One leader elected per term
  • Leader holds lease (TTL = 2 * HeartbeatInterval)
  • All nodes converge on same leader
  • Lease renewal happens automatically

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: ElectLeader()

Aggregates: LeadershipLease (time-bound authority)

Events: LeaderElected(leaderID, term, leaseExpiration)

Technical Notes

  • NATS subject: "aether.cluster.election"
  • Each node publishes heartbeat with NodeID, Timestamp
  • First node to publish becomes leader
  • Lease expires if no heartbeat for TTL

Test Cases

  • Single node: elected immediately
  • Three nodes: exactly one elected
  • Leader dies: remaining nodes elect new leader within 2*interval
  • Leader comes back: may or may not stay leader

Dependencies

  • Depends on: Issue 3.1 (node discovery)

Issue 3.3: [Rule] Enforce single leader invariant

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Guarantee exactly one leader at any time

User Story

As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict.

Acceptance Criteria

  • At most one leader at any time (lease-based)
  • If leader lease expires, no leader until re-election
  • All nodes see same leader (or none)
  • Tests verify invariant under various failure scenarios
  • Split-brain prevented by lease TTL < network latency

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: At most one leader (enforced by lease TTL)

Mechanism:

  • Leader publishes heartbeat every HeartbeatInterval
  • Other nodes trust leader if heartbeat < HeartbeatInterval old
  • If no heartbeat for 2*HeartbeatInterval, lease expired
  • New election begins

Technical Notes

  • Lease-based; not consensus-based (simpler)
  • Allows temporary split-brain until lease expires
  • Acceptable for Aether (eventual consistency)

Test Cases

  • Simulate leader death: lease expires, new leader elected
  • Simulate network partition: partition may have >1 leader until lease expires
  • Verify no coordination during lease expiration

Dependencies

  • Depends on: Issue 3.2 (leader election)

Issue 3.4: [Event] Publish LeaderElected on election

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Record leadership election outcomes

User Story

As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency.

Acceptance Criteria

  • LeaderElected event published after successful election
  • Event contains: LeaderID, Term, LeaseExpiration, Timestamp
  • Metrics increment on election
  • Helpful for debugging split-brain scenarios
  • Track election frequency (ideally < 1 per minute)

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: LeaderElected(leaderID, term, leaseExpiration, timestamp)

Triggered by: Successful election

Consumers: Metrics, audit logs

Technical Notes

  • Event published locally to all observers
  • Infrastructure event (not domain event)

Test Cases

  • Election happens: event published
  • Term increments: event reflects new term
  • Metrics accurate

Dependencies

  • Depends on: Issue 3.2 (election)

Issue 3.5: [Event] Publish LeadershipLost on lease expiration

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Track leadership transitions

User Story

As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events.

Acceptance Criteria

  • LeadershipLost event published when lease expires
  • Event contains: PreviousLeaderID, Timestamp, Reason
  • Metrics track leadership transitions
  • Helpful for debugging cascading failures

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: LeadershipLost(previousLeaderID, timestamp, reason)

Reason: "lease_expired", "node_failed", etc.

Technical Notes

  • Published when lease TTL expires
  • Useful for observability

Test Cases

  • Leader lease expires: LeadershipLost published
  • Metrics show transition

Dependencies

  • Depends on: Issue 3.2 (election)

Issue 3.6: [Read Model] Implement GetClusterTopology query

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Query current cluster members and status

User Story

As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues.

Acceptance Criteria

  • GetNodes() returns map[nodeID]*NodeInfo
  • NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs
  • Status is: Active, Degraded, Failed
  • LastSeen is accurate heartbeat timestamp
  • ShardIDs show shard ownership (filled in Phase 3b)
  • Example: "node-a is active; node-b failed 30s ago"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: ClusterTopology

Data:

  • NodeID → NodeInfo (status, heartbeat, shards)
  • LeaderID (current leader)
  • Term (election term)

Technical Notes

  • ClusterManager maintains topology in-memory
  • Update on each heartbeat/announcement

Test Cases

  • GetNodes() returns active nodes
  • Status accurate (Active, Failed, etc.)
  • LastSeen updates on heartbeat
  • Rejoining node updates existing entry

Dependencies

  • Depends on: Issue 3.1 (node discovery)

Issue 3.7: [Read Model] Implement GetLeader query

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Query current leader

User Story

As a client, I want to know who the leader is, so that I can route coordination requests to the right node.

Acceptance Criteria

  • GetLeader() returns current leader NodeID or ""
  • IsLeader() returns true if this node is leader
  • Both consistent with LeaderElection state
  • Updated immediately on election
  • Example: "node-b is leader (term 5)"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: LeadershipRegistry

Data: CurrentLeader, CurrentTerm, LeaseExpiration

Implementation:

  • LeaderElection maintains this
  • ClusterManager queries it

Technical Notes

  • Critical for routing coordination work
  • Must be consistent across cluster

Test Cases

  • No leader: GetLeader returns ""
  • Leader elected: GetLeader returns leader ID
  • IsLeader true on leader, false on others
  • Changes on re-election

Dependencies

  • Depends on: Issue 3.2 (election)

Feature Set 3b: Shard Distribution

Capability: Distribute Actors Across Cluster Nodes

Description: Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling.

Success Condition: 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently.


Issue 3.8: [Command] Implement consistent hash ring

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Distribute shards across nodes with minimal reshuffling

User Story

As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling.

Acceptance Criteria

  • ConsistentHashRing(numShards=1024) creates ring
  • GetShard(actorID) returns consistent shard [0, 1024)
  • AddNode(nodeID) rebalances ~numShards/numNodes shards
  • RemoveNode(nodeID) rebalances shards evenly
  • Same actor always maps to same shard
  • Reshuffling < 40% on node add/remove

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: AssignShards(nodes)

Aggregates: ConsistentHashRing (distribution algorithm)

Invariants:

  • Each shard [0, 1024) assigned to exactly one node
  • ActorID hashes consistently to shard
  • Topology changes minimize reassignment

Technical Notes

  • hashring.go already implements this
  • Use crypto/md5 or compatible hash
  • 1024 shards is tunable (P1 default)

Test Cases

  • Single node: all shards assigned to it
  • Two nodes: ~512 shards each
  • Three nodes: ~341 shards each
  • Add fourth node: ~256 shards each (~20% reshuffled)
  • Remove node: remaining nodes rebalance evenly
  • Same actor-id always hashes to same shard

Dependencies

  • Depends on: Issue 3.1 (node discovery)

Issue 3.9: [Rule] Enforce single shard owner invariant

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Guarantee each shard has exactly one owner

User Story

As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically.

Acceptance Criteria

  • ShardMap tracks shard → nodeID assignment
  • No shard is unassigned (every shard has owner)
  • No shard assigned to multiple nodes
  • Reassignment is atomic (no in-between state)
  • Tests verify invariant after topology changes

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: Each shard [0, 1024) assigned to exactly one active node

Mechanism:

  • ShardMap[shardID] = [nodeID]
  • Maintained by leader
  • Updated atomically on rebalancing

Technical Notes

  • shard.go implements ShardManager
  • Validated after each rebalancing

Test Cases

  • After rebalancing: all shards assigned
  • No orphaned shards
  • No multiply-assigned shards
  • Reassignment is atomic

Dependencies

  • Depends on: Issue 3.8 (consistent hashing)

Issue 3.10: [Event] Publish ShardAssigned on assignment

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Track shard-to-node assignments

User Story

As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues.

Acceptance Criteria

  • ShardAssigned event published after assignment
  • Event contains: ShardID, NodeID, Timestamp
  • Metrics track: shards per node, rebalancing frequency
  • Example: Shard 42 assigned to node-b

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: ShardAssigned(shardID, nodeID, timestamp)

Triggered by: AssignShards command succeeds

Metrics: Shards per node, distribution evenness

Technical Notes

  • Infrastructure event
  • Useful for monitoring load distribution

Test Cases

  • Assignment published on rebalancing
  • Metrics reflect distribution

Dependencies

  • Depends on: Issue 3.9 (shard ownership)

Issue 3.11: [Read Model] Implement GetShardAssignments query

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Query shard-to-node mapping

User Story

As a client, I want to know which node owns a shard, so that I can route actor requests correctly.

Acceptance Criteria

  • GetShardAssignments() returns ShardMap
  • ShardMap[shardID] returns owning nodeID
  • GetShard(actorID) returns shard for actor
  • Routing decision: actorID → shard → nodeID
  • Cached locally; refreshed on each rebalancing

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: ShardMap

Data:

  • ShardID → NodeID (primary owner)
  • Version (incremented on rebalancing)
  • UpdateTime

Implementation:

  • ClusterManager.GetShardMap()
  • Cached; updated on assignment changes

Technical Notes

  • Critical for routing
  • Must be consistent across cluster
  • Version helps detect stale caches

Test Cases

  • GetShardAssignments returns current map
  • GetShard(actorID) returns consistent shard
  • Routing: actor ID → shard → node owner

Dependencies

  • Depends on: Issue 3.9 (shard ownership)

Feature Set 3c: Failure Detection and Recovery

Capability: Recover from Node Failures

Description: Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes.

Success Condition: Node dies → failure detected within 90s → shards reassigned → actors replay automatically.


Issue 3.12: [Command] Implement node health checks

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Detect node failures via heartbeat timeout

User Story

As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes.

Acceptance Criteria

  • Each node publishes heartbeat every 30s
  • Nodes without heartbeat for 90s marked as Failed
  • checkNodeHealth() runs every 30s
  • Failed node's status updates atomically
  • Tests verify failure detection timing
  • Failed node can rejoin cluster

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: MarkNodeFailed(nodeID)

Trigger: monitorNodes detects missing heartbeat

Events: NodeFailed(nodeID, lastSeenTimestamp)

Technical Notes

  • monitorNodes() loop in manager.go
  • Check LastSeen timestamp
  • Update status if stale (>90s)

Test Cases

  • Active node: status stays Active
  • No heartbeat for 90s: status → Failed
  • Rejoin: status → Active
  • Failure detected < 100s (ideally 90-120s)

Dependencies

  • Depends on: Issue 3.1 (node discovery)

Issue 3.13: [Command] Implement RebalanceShards after node failure

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Reassign failed node's shards to healthy nodes

User Story

As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes.

Acceptance Criteria

  • Leader detects node failure
  • Leader triggers RebalanceShards
  • Failed node's shards reassigned evenly
  • No shard left orphaned
  • ShardMap updated atomically
  • Rebalancing completes within 5 seconds

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: RebalanceShards(failedNodeID)

Aggregates: ShardMap, ConsistentHashRing

Events: RebalanceStarted, ShardMigrated

Technical Notes

  • Leader only (IsLeader() check)
  • Use consistent hashing to assign
  • Calculate new assignments atomically

Test Cases

  • Node-a fails with shards [1, 2, 3]
  • Leader reassigns [1, 2, 3] to remaining nodes
  • No orphaned shards
  • Rebalancing < 5s

Dependencies

  • Depends on: Issue 3.8 (consistent hashing)
  • Depends on: Issue 3.12 (failure detection)

Issue 3.14: [Rule] Enforce no-orphan invariant

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Guarantee all shards have owners after rebalancing

User Story

As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable.

Acceptance Criteria

  • Before rebalancing: verify no orphaned shards
  • After rebalancing: verify all shards assigned
  • Tests fail if invariant violated
  • Rebalancing aborted if invariant would be violated

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: All shards [0, 1024) have owners after any rebalancing

Check:

  • Count assigned shards
  • Verify = 1024
  • Abort if not

Technical Notes

  • Validate before committing ShardMap
  • Logs errors but doesn't assert (graceful degradation)

Test Cases

  • Rebalancing completes: all shards assigned
  • Orphaned shard detected: rebalancing rolled back
  • Tests verify count = 1024

Dependencies

  • Depends on: Issue 3.13 (rebalancing)

Issue 3.15: [Event] Publish NodeFailed on failure detection

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Record node failure for observability

User Story

As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues.

Acceptance Criteria

  • NodeFailed event published when failure detected
  • Event contains: NodeID, LastSeenTimestamp, AffectedShards
  • Metrics track failure frequency
  • Example: "node-a failed; 341 shards affected"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs)

Triggered by: checkNodeHealth marks node failed

Consumers: Metrics, alerts, audit logs

Technical Notes

  • Infrastructure event
  • AffectedShards helps assess impact

Test Cases

  • Node failure detected: event published
  • Metrics show affected shard count

Dependencies

  • Depends on: Issue 3.12 (failure detection)

Issue 3.16: [Event] Publish ShardMigrated on shard movement

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Track shard migrations

User Story

As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations.

Acceptance Criteria

  • ShardMigrated event published on each shard movement
  • Event contains: ShardID, FromNodeID, ToNodeID, Status
  • Status: "Started", "InProgress", "Completed", "Failed"
  • Metrics track migration count and duration
  • Example: "Shard 42 migrated from node-a to node-b (2.3s)"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs)

Status: Started → InProgress → Completed

Consumers: Metrics, progress tracking

Technical Notes

  • Published for each shard move
  • Helps track rebalancing progress
  • Useful for SLO monitoring

Test Cases

  • Shard moves: event published
  • Metrics track duration
  • Status transitions correct

Dependencies

  • Depends on: Issue 3.13 (rebalancing)

Issue 3.17: [Documentation] Document actor migration and replay

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Explain how actors move and recover state

User Story

As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application.

Acceptance Criteria

  • Design doc: cluster/ACTOR_MIGRATION.md
  • Explain shard reassignment process
  • Explain state rebuild via GetEvents + replay
  • Explain snapshot optimization
  • Example: Shard 42 moves to new node; 1000-event actor replays in <100ms
  • Explain out-of-order message handling

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: Documentation

Content:

  • Shard assignment (consistent hashing)
  • Actor discovery (routing via shard map)
  • State rebuild (replay from JetStream)
  • Snapshots (optional optimization)
  • In-flight messages (may arrive before replay completes)

Examples:

  • Manual failover: reassign shards manually
  • Auto failover: leader initiates on failure detection

Technical Notes

  • Complex topic; good documentation prevents bugs

Test Cases

  • Documentation is clear
  • Examples correct

Dependencies

  • Depends on: Issue 3.13 (rebalancing)
  • Depends on: Phase 1 (event replay)

Phase 4: Namespace Isolation and NATS Event Delivery

Feature Set 4a: Namespace Storage Isolation

Capability: Isolate Logical Domains Using Namespaces

Description: Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer.

Success Condition: Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries.


Issue 4.1: [Rule] Enforce namespace-based stream naming

Type: New Feature Bounded Context: Namespace Isolation Priority: P1

Title: Use namespace prefixes in JetStream stream names

User Story

As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage.

Acceptance Criteria

  • Namespace "tenant-a" → stream "tenant-a_events"
  • Namespace "tenant-b" → stream "tenant-b_events"
  • Empty namespace → stream "events" (default)
  • JetStreamConfig.Namespace sets prefix
  • NewJetStreamEventStoreWithNamespace convenience function
  • Tests verify stream names have namespace prefix

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Configuration)

Value Object: Namespace (string identifier)

Implementation:

  • JetStreamConfig.Namespace field
  • StreamName = namespace + "_events" if namespace set
  • StreamName = "events" if namespace empty

Technical Notes

  • Already partially implemented in jetstream.go
  • Ensure safe characters (sanitize spaces, dots, wildcards)

Test Cases

  • NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events"
  • NewJetStreamEventStoreWithNamespace(""): creates stream "events"
  • Stream name verified

Dependencies

  • None (orthogonal to other contexts)

Issue 4.2: [Rule] Enforce storage-level namespace isolation

Type: New Feature Bounded Context: Namespace Isolation Priority: P0

Title: Prevent cross-namespace data leakage at storage layer

User Story

As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems.

Acceptance Criteria

  • SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events"
  • GetEvents("tenant-a") queries "tenant-a_events" stream only
  • No possibility of accidental cross-namespace leakage
  • JetStream subject filtering enforces isolation
  • Integration tests verify with multiple namespaces

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: Events from namespace X are invisible to namespace Y

Mechanism:

  • Separate JetStream streams per namespace
  • Subject prefixing: "tenant-a.events.actor-123"
  • Subscribe filters by subject prefix

Technical Notes

  • jetstream.go: SubscribeToActorEvents uses subject prefix
  • Consumer created with subject filter matching namespace

Test Cases

  • SaveEvent to tenant-a: visible in tenant-a queries
  • Same event invisible to tenant-b queries
  • GetLatestVersion scoped to namespace
  • GetEvents scoped to namespace
  • Multi-namespace integration test

Dependencies

  • Depends on: Issue 4.1 (stream naming)

Issue 4.3: [Documentation] Document namespace design patterns

Type: New Feature Bounded Context: Namespace Isolation Priority: P1

Title: Provide guidance on namespace naming and use

User Story

As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system.

Acceptance Criteria

  • Design doc: NAMESPACE_DESIGN_PATTERNS.md
  • Pattern 1: "tenant-{id}" (per-customer)
  • Pattern 2: "env.domain" (per-env, per-bounded-context)
  • Pattern 3: "env.domain.customer" (most granular)
  • Examples of each pattern
  • Guidance on choosing granularity
  • Anti-patterns (wildcards, spaces, dots)

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: Documentation

Content:

  • Multi-tenant patterns
  • Granularity decisions
  • Namespace naming rules
  • Examples
  • Anti-patterns
  • Performance implications

Examples:

  • SaaS: "tenant-uuid"
  • Microservices: "service.orders"
  • Complex: "env.service.tenant"

Technical Notes

  • No hard restrictions; naming is flexible
  • Sanitization (spaces → underscores)

Test Cases

  • Documentation is clear
  • Examples valid

Dependencies

  • Depends on: Issue 4.1 (stream naming)

Issue 4.4: [Validation] Add namespace format validation (P2)

Type: New Feature Bounded Context: Namespace Isolation Priority: P2

Title: Validate namespace names to prevent invalid streams

User Story

As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names.

Acceptance Criteria

  • ValidateNamespace(ns string) returns error for invalid names
  • Rejects: "tenant-*", "tenant a", "tenant."
  • Accepts: "tenant-abc", "prod.orders", "tenant_123"
  • Called on NewJetStreamEventStoreWithNamespace
  • Clear error messages
  • Tests verify validation rules

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Validation)

Validation Rules:

  • No wildcards (*, >)
  • No spaces
  • No leading/trailing dots
  • Alphanumeric, hyphens, underscores, dots only

Implementation:

  • ValidateNamespace regex
  • Called before stream creation

Technical Notes

  • Nice-to-have; currently strings accepted as-is
  • Could sanitize instead of rejecting (replace _ for spaces)

Test Cases

  • Valid: "tenant-abc", "prod.orders"
  • Invalid: "tenant-*", "tenant a", ".prod"
  • Error messages clear

Dependencies

  • Depends on: Issue 4.1 (stream naming)

Feature Set 4b: Cross-Node Event Delivery via NATS

Capability: Deliver Events Across Cluster Nodes

Description: Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering.

Success Condition: Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS).


Issue 4.5: [Command] Implement NATSEventBus wrapper

Type: New Feature Bounded Context: Event Bus (with NATS) Priority: P1

Title: Extend EventBus with NATS-native pub/sub

User Story

As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations.

Acceptance Criteria

  • NATSEventBus embeds EventBus
  • Publish(namespace, event) sends to local EventBus AND NATS
  • NATS subject: "aether.events.{namespace}"
  • SubscribeWithFilter works across nodes
  • Self-published events not re-delivered (avoid loops)
  • Tests verify cross-node delivery

Bounded Context: Event Bus (NATS extension)

DDD Implementation Guidance

Type: New Feature (Extension)

Aggregate: EventBus extended with NATSEventBus

Commands: Publish(namespace, event) [same interface, distributed]

Implementation:

  • NATSEventBus composes EventBus
  • Override Publish to also publish to NATS
  • Subscribe to NATS subjects matching namespace

Technical Notes

  • nats_eventbus.go already partially implemented
  • NATS subject: "aether.events.orders" for namespace "orders"
  • Include sourceNodeID in event to prevent redelivery

Test Cases

  • Publish on node-a: local subscribers on node-a receive
  • Same publish: node-b subscribers receive via NATS
  • Self-loop prevented: node-a doesn't re-receive own publish
  • Multi-node: all nodes converge on same events

Dependencies

  • Depends on: Issue 2.1 (EventBus.Publish)
  • Depends on: Issue 3.1 (cluster setup for multi-node tests)

Issue 4.6: [Rule] Enforce exactly-once delivery across cluster

Type: New Feature Bounded Context: Event Bus (NATS) Priority: P1

Title: Guarantee events delivered to all cluster subscribers

User Story

As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events.

Acceptance Criteria

  • Event published to NATS with JetStream consumer
  • Consumer acknowledges delivery
  • Redelivery on network failure (JetStream handles)
  • No duplicate delivery to same subscriber
  • All nodes see same events in same order

Bounded Context: Event Bus (NATS)

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: Exactly-once delivery to each subscriber

Mechanism:

  • JetStream consumer per subscriber group
  • Acknowledgment on delivery
  • Automatic redelivery on timeout

Technical Notes

  • JetStream handles durability and ordering
  • Consumer name = subscriber ID
  • Push consumer model (events pushed to subscriber)

Test Cases

  • Publish event: all subscribers receive once
  • Network failure: redelivery after timeout
  • No duplicates on subscriber
  • Order preserved across nodes

Dependencies

  • Depends on: Issue 4.5 (NATSEventBus)

Issue 4.7: [Event] Publish EventPublished (via NATS)

Type: New Feature Bounded Context: Event Bus (NATS) Priority: P2

Title: Route published events to NATS subjects

User Story

As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks.

Acceptance Criteria

  • EventPublished event published to NATS
  • Subject: "aether.events.{namespace}.published"
  • Message contains: eventID, timestamp, sourceNodeID
  • Metrics track: events published, delivered, dropped
  • Helps identify partition/latency issues

Bounded Context: Event Bus (NATS)

DDD Implementation Guidance

Type: New Feature (Event)

Event: EventPublished (infrastructure)

Subject: aether.events.{namespace}.published

Consumers: Metrics, monitoring

Technical Notes

  • Published after NATS publish succeeds
  • Separate from local EventPublished (for clarity)

Test Cases

  • Publish event: EventPublished message on NATS
  • Metrics count delivery
  • Cross-node visibility works

Dependencies

  • Depends on: Issue 4.5 (NATSEventBus)

Issue 4.8: [Read Model] Implement cross-node subscription

Type: New Feature Bounded Context: Event Bus (NATS) Priority: P1

Title: Receive events from other nodes via NATS

User Story

As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows.

Acceptance Criteria

  • NATSEventBus.Subscribe(namespace) receives local + NATS events
  • SubscribeWithFilter works with NATS
  • Events from local node: delivered via local EventBus
  • Events from remote nodes: delivered via NATS consumer
  • Subscriber sees unified stream (no duplication)

Bounded Context: Event Bus (NATS)

DDD Implementation Guidance

Type: New Feature (Query/Subscription)

Read Model: UnifiedEventStream (local + remote)

Implementation:

  • Subscribe creates local channel
  • NATSEventBus subscribes to NATS subject
  • Both feed into subscriber channel

Technical Notes

  • Unified view is transparent to subscriber
  • No need to know if event is local or remote

Test Cases

  • Subscribe to namespace: receive local events
  • Subscribe to namespace: receive remote events
  • Filter works across both sources
  • No duplication

Dependencies

  • Depends on: Issue 4.5 (NATSEventBus)

Summary

This backlog contains 67 executable issues across 5 bounded contexts organized into 4 implementation phases. Each issue:

  • Is decomposed using DDD-informed order (commands → rules → events → reads)
  • References domain concepts (aggregates, commands, events, value objects)
  • Includes acceptance criteria (testable, specific)
  • States dependencies (enabling parallel work)
  • Is sized to 1-3 days of work

Recommended Build Order:

  1. Phase 1 (17 issues): Event Sourcing Foundation - everything depends on this
  2. Phase 2 (9 issues): Local Event Bus - enables observability before clustering
  3. Phase 3 (20 issues): Cluster Coordination - enables distributed deployment
  4. Phase 4 (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery

Total Scope: ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team)


Next Steps

  1. Create Gitea issues from this backlog
  2. Assign to team members
  3. Set up dependency tracking in Gitea
  4. Use /spawn-issues skill to parallelize implementation
  5. Iterate on acceptance criteria with domain experts

See /issue-writing skill for proper issue formatting in Gitea.