flowmade-one/aether

Fork 0

Files

Hugo Nijhuis 271f5db444

CI / build (push) Successful in 21s

Details

CI / integration (push) Failing after 2m1s

Details

Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-12 23:57:20 +01:00

69 KiB

Raw Blame History

Aether Executable Backlog

Built from: 9 Capabilities, 5 Bounded Contexts, DDD-informed decomposition

Date: 2026-01-12

Backlog Overview

This backlog decomposes Aether's 9 product capabilities into executable features and issues using domain-driven decomposition. Each capability is broken into vertical slices following the decomposition order: Commands → Domain Rules → Events → Read Models → UI/API.

Total Scope:

Capabilities: 9 (all complete)
Features: 14
Issues: 67
Contexts: 5
Implementation Phases: 4

Build Order (by value and dependencies):

Phase 1: Event Sourcing Foundation (Capabilities 1-3)
- Issues: 17
- Enables all other work
Phase 2: Local Event Bus (Capability 8)
- Issues: 9
- Enables local pub/sub before clustering
Phase 3: Cluster Coordination (Capabilities 5-7)
- Issues: 20
- Enables distributed deployment
Phase 4: Namespace & NATS (Capabilities 4, 9)
- Issues: 21
- Enables multi-tenancy and cross-node delivery

Phase 1: Event Sourcing Foundation

Feature Set 1a: Event Storage with Version Conflict Detection

Capability: Store Events Durably with Conflict Detection

Description: Applications can persist domain events with automatic conflict detection, ensuring no lost writes from concurrent writers.

Success Condition: Multiple writers attempt to update same actor; first wins, others see VersionConflictError with details; all writes land in immutable history.

Issue 1.1: [Command] Implement SaveEvent with monotonic version validation

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: As a developer, I want SaveEvent to validate monotonic versions, so that concurrent writes are detected safely

User Story

As a developer building an event-sourced system, I want SaveEvent to reject any event with version <= current version for that actor, so that I can detect when another writer won a race and handle it appropriately.

Acceptance Criteria

SaveEvent accepts event with Version > current for actor
SaveEvent rejects event with Version <= current (returns VersionConflictError)
VersionConflictError contains ActorID, AttemptedVersion, CurrentVersion
First event for new actor must have Version > 0 (typically 1)
Version gaps are allowed (1, 3, 5 is valid)
Validation happens before persistence (fail-fast)
InMemoryEventStore and JetStreamEventStore both implement validation

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Core)

Aggregate: ActorEventStream (implicit; each actor has independent version sequence)

Command: SaveEvent(event)

Validation Rules:

If no events exist for actor: version must be > 0
If events exist: new version must be > latest version

Success Event: EventStored (published when SaveEvent succeeds)

Error Event: VersionConflict (triggered when version validation fails)

Technical Notes

Version validation is the core invariant; everything else depends on it
Use GetLatestVersion() to implement validation
No database-level locks; optimistic validation only
Conflict should fail in <1ms

Test Cases

New actor, version 1: succeeds
Same actor, version 2 (after 1): succeeds
Same actor, version 2 (after 1, concurrent): second call fails
Same actor, version 1 (duplicate): fails
Same actor, version 0 or negative: fails
Concurrent 100 writers: 99 fail, 1 succeeds

Dependencies

None (foundation)

Issue 1.2: [Rule] Enforce append-only and immutability invariants

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Enforce event immutability and append-only semantics

User Story

As a system architect, I need the system to guarantee events are immutable and append-only, so that the event stream is a reliable audit trail and cannot be corrupted by updates.

Acceptance Criteria

EventStore interface has no Update or Delete methods
Events cannot be modified after persistence
Replay of same events always produces same state
Corrupted events are reported (not silently skipped)
JetStream stream configuration prevents deletes (retention policy only)

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Core Invariant)

Aggregate: ActorEventStream

Invariant: Events are immutable; stream is append-only; no modifications to EventStore interface

Implementation:

Event struct has no Setters (only getters)
SaveEvent is the only public persistence method
JetStream streams configured with NoDelete policy

Technical Notes

This is enforced at interface level (no Update/Delete in EventStore)
JetStream configuration prevents accidental deletes
ReplayError allows visibility into corruption without losing good data

Test Cases

Attempt to modify Event.Data after creation: compile error (if immutable)
Attempt to call UpdateEvent: interface doesn't exist
JetStream stream created with correct retention policy
ReplayError captured when event unmarshaling fails

Dependencies

Depends on: Issue 1.1 (SaveEvent implementation)

Issue 1.3: [Event] Publish EventStored after successful save

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Emit EventStored event for persistence observability

User Story

As an application component, I want to be notified when an event is successfully persisted, so that I can trigger downstream workflows (caching, metrics, projections).

Acceptance Criteria

EventStored event published after SaveEvent succeeds
EventStored contains: EventID, ActorID, Version, Timestamp
No EventStored published if SaveEvent fails
EventBus receives EventStored in same transaction context
Metrics increment for each EventStored

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature

Event: EventStored(eventID, actorID, version, timestamp)

Triggered by: Successful SaveEvent call

Consumers: Metrics collectors, projections, audit systems

Technical Notes

EventStored is an internal event (Aether infrastructure)
Published to local EventBus (see Phase 2 for cross-node)
Allows observability without coupling application code

Test Cases

Save event → EventStored published
Version conflict → no EventStored published
Multiple saves → multiple EventStored events in order

Dependencies

Depends on: Issue 1.1 (SaveEvent)
Depends on: Phase 2, Issue 2.1 (EventBus.Publish)

Issue 1.4: [Event] Publish VersionConflict error with full context

Type: New Feature Bounded Context: Event Sourcing, Optimistic Concurrency Control Priority: P0

Title: Return detailed version conflict information for retry logic

User Story

As an application developer, I want VersionConflictError to include CurrentVersion and ActorID, so that I can implement intelligent retry logic (exponential backoff, circuit-breaker).

Acceptance Criteria

VersionConflictError struct contains: ActorID, AttemptedVersion, CurrentVersion
Error message is human-readable with all context
Errors.Is(err, ErrVersionConflict) returns true for sentinel check
Errors.As(err, &versionErr) allows unpacking to VersionConflictError
Application can read CurrentVersion to decide retry strategy

Bounded Context: Event Sourcing + OCC

DDD Implementation Guidance

Type: New Feature

Error Type: VersionConflictError (wraps ErrVersionConflict sentinel)

Data: ActorID, AttemptedVersion, CurrentVersion

Use: Application uses this to implement retry strategies

Technical Notes

Already implemented in /aether/event.go (VersionConflictError struct)
Document standard retry patterns in examples/

Test Cases

Conflict with detailed error: ActorID, versions present
Application reads CurrentVersion: succeeds
Errors.Is(err, ErrVersionConflict): true
Errors.As(err, &versionErr): works
Manual test: log the error, see all context

Dependencies

Depends on: Issue 1.1 (SaveEvent)

Issue 1.5: [Read Model] Implement GetLatestVersion query

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Provide efficient version lookup for optimistic locking

User Story

As an application, I want to efficiently query the latest version for an actor without fetching all events, so that I can implement optimistic locking with minimal overhead.

Acceptance Criteria

GetLatestVersion(actorID) returns latest version or 0 if no events
Execution time is O(1) or O(log n), not O(n)
InMemoryEventStore implements with map lookup
JetStreamEventStore caches latest version per actor
Cache is invalidated after each SaveEvent
Multiple calls for same actor within 1s hit cache
Namespace isolation: GetLatestVersion scoped to namespace

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: ActorVersionIndex

Source Events: SaveEvent (updates cache)

Data: ActorID → LatestVersion

Performance: O(1) lookup after SaveEvent

Technical Notes

InMemoryEventStore: use map[actorID]int64
JetStreamEventStore: query JetStream metadata OR maintain cache
Cache invalidation: update after every SaveEvent
Thread-safe with RWMutex (read-heavy)

Test Cases

New actor: GetLatestVersion returns 0
After SaveEvent(version: 1): GetLatestVersion returns 1
After SaveEvent(version: 3): GetLatestVersion returns 3
Concurrent reads from same actor: all return consistent value
Namespace isolation: "tenant-a" and "tenant-b" have independent versions

Dependencies

Depends on: Issue 1.1 (SaveEvent)

Feature Set 1b: State Rebuild from Event History

Capability: Rebuild State from Event History

Description: Applications can reconstruct any actor state by replaying events from a starting version. Snapshots optimize replay for long-lived actors.

Success Condition: GetEvents(actorID, 0) returns all events in order; replaying produces consistent state every time; snapshots reduce replay time from O(n) to O(1).

Issue 1.6: [Command] Implement GetEvents for replay

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Load events from store for state replay

User Story

As a developer, I want to retrieve all events for an actor from a starting version forward, so that I can replay them to reconstruct the actor's state.

Acceptance Criteria

GetEvents(actorID, fromVersion) returns []*Event in version order
Events are ordered by version (ascending)
fromVersion is inclusive (GetEvents(actorID, 5) includes version 5)
If no events exist, returns empty slice (not error)
If actorID has no events >= fromVersion, returns empty slice
Namespace isolation: GetEvents scoped to namespace
Large result sets don't cause memory issues (stream if >10k events)

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Query)

Command: GetEvents(actorID, fromVersion)

Returns: []*Event ordered by version

Invariant: Order is deterministic (version order always)

Technical Notes

InMemoryEventStore: filter and sort by version
JetStreamEventStore: query JetStream subject and order results
Consider pagination for very large actor histories
fromVersion=0 means "start from beginning"

Test Cases

GetEvents(actorID, 0) with 5 events: returns all 5 in order
GetEvents(actorID, 3) with 5 events: returns events 3, 4, 5
GetEvents(nonexistent, 0): returns empty slice
GetEvents with gap (versions 1, 3, 5): returns only those 3
Order is guaranteed (version order, not insertion order)

Dependencies

Depends on: Issue 1.1 (SaveEvent)

Issue 1.7: [Rule] Define and enforce snapshot validity

Type: New Feature Bounded Context: Event Sourcing Priority: P1

Title: Implement snapshot invalidation policy

User Story

As an operator, I want snapshots to automatically invalidate after a certain version gap, so that stale snapshots don't become a source of bugs and disk bloat.

Acceptance Criteria

Snapshot valid until Version + MaxVersionGap (default 1000)
GetLatestSnapshot returns nil if no snapshot or invalid
Application can override MaxVersionGap in config
Snapshot timestamp recorded for debugging
No automatic cleanup; application calls SaveSnapshot to create
Tests confirm snapshot invalidation logic

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Policy)

Aggregate: ActorSnapshot + SnapshotPolicy

Policy: Snapshot is valid only if (CurrentVersion - SnapshotVersion) <= MaxVersionGap

Implementation:

SnapshotStore.GetLatestSnapshot validates before returning
If invalid, returns nil; application must replay

Technical Notes

This is a safety policy; prevents stale snapshots
Application owns decision to create snapshots (no auto-triggering)
MaxVersionGap is tunable per deployment

Test Cases

Snapshot at version 10, MaxGap=100, current=50: valid
Snapshot at version 10, MaxGap=100, current=111: invalid
Snapshot at version 10, MaxGap=100, current=110: valid
GetLatestSnapshot returns nil for invalid snapshot

Dependencies

Depends on: Issue 1.6 (GetEvents)

Issue 1.8: [Event] Publish SnapshotCreated for observability

Type: New Feature Bounded Context: Event Sourcing Priority: P1

Title: Emit snapshot creation event for lifecycle tracking

User Story

As a system operator, I want to be notified when snapshots are created, so that I can monitor snapshot creation rates and catch runaway snapshotting.

Acceptance Criteria

SnapshotCreated event published after SaveSnapshot succeeds
Event contains: ActorID, Version, SnapshotTimestamp, ReplayDuration
Metrics increment for snapshot creation
No event if SaveSnapshot fails
Example: Snapshot created every 1000 versions

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Event)

Event: SnapshotCreated(actorID, version, timestamp, replayDurationMs)

Triggered by: SaveSnapshot call succeeds

Consumers: Metrics, monitoring dashboards

Technical Notes

SnapshotCreated is infrastructure event (like EventStored)
ReplayDuration helps identify slow actors needing snapshots more frequently

Test Cases

SaveSnapshot succeeds → SnapshotCreated published
SaveSnapshot fails → no event published
ReplayDuration recorded accurately

Dependencies

Depends on: Issue 1.7 (SnapshotStore interface)

Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay

Type: New Feature Bounded Context: Event Sourcing Priority: P1

Title: Handle corrupted events during replay without data loss

User Story

As a developer, I want GetEventsWithErrors to return both good events and corruption details, so that I can tolerate partial data corruption and still process clean events.

Acceptance Criteria

GetEventsWithErrors(actorID, fromVersion) returns ReplayResult
ReplayResult contains: []*Event (good) and []ReplayError (bad)
Good events are returned in order despite errors
ReplayError contains: SequenceNumber, RawData, UnmarshalError
Application decides how to handle corrupted events
Metrics track corruption frequency

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Query)

Interface: EventStoreWithErrors extends EventStore

Method: GetEventsWithErrors(actorID, fromVersion) → ReplayResult

Data:

ReplayResult.Events: successfully deserialized events
ReplayResult.Errors: corruption records
ReplayResult.HasErrors(): convenience check

Technical Notes

Already defined in event.go (ReplayError, ReplayResult)
JetStreamEventStore should implement EventStoreWithErrors
Application uses HasErrors() to decide on recovery action

Test Cases

All good events: ReplayResult.Events populated, no errors
Corrupted event in middle: good events before/after, one error recorded
Multiple corruptions: all recorded with context
Application can inspect RawData for forensics

Dependencies

Depends on: Issue 1.6 (GetEvents)

Issue 1.10: [Interface] Implement SnapshotStore interface

Type: New Feature Bounded Context: Event Sourcing Priority: P0

Title: Define snapshot storage contract

User Story

As a developer, I want a clean interface for snapshot operations, so that I can implement custom snapshot storage (Redis, PostgreSQL, S3).

Acceptance Criteria

SnapshotStore extends EventStore
GetLatestSnapshot(actorID) returns ActorSnapshot or nil
SaveSnapshot(snapshot) persists snapshot
ActorSnapshot contains: ActorID, Version, State, Timestamp
Namespace isolation: snapshots scoped to namespace
Tests verify interface contract

Bounded Context: Event Sourcing

DDD Implementation Guidance

Type: New Feature (Interface)

Interface: SnapshotStore extends EventStore

Methods:

GetLatestSnapshot(actorID) → (*ActorSnapshot, error)
SaveSnapshot(snapshot) → error

Aggregates: ActorSnapshot (value object)

Technical Notes

Already defined in event.go
Need implementations: InMemorySnapshotStore, JetStreamSnapshotStore
Keep snapshots in same store as events (co-located)

Test Cases

SaveSnapshot persists; GetLatestSnapshot retrieves it
New actor: GetLatestSnapshot returns nil
Multiple snapshots: only latest returned
Namespace isolation: snapshots from tenant-a don't appear in tenant-b

Dependencies

Depends on: Issue 1.1 (SaveEvent + storage foundation)

Feature Set 1c: Optimistic Concurrency Control

Capability: Enable Safe Concurrent Writes

Description: Multiple writers can update the same actor safely using optimistic locking. Application controls retry strategy.

Success Condition: Two concurrent writers race; one succeeds, other sees VersionConflictError; application retries without locks.

Issue 1.11: [Rule] Enforce fail-fast on version conflict

Type: New Feature Bounded Context: Optimistic Concurrency Control Priority: P0

Title: Fail immediately on version conflict; no auto-retry

User Story

As an application developer, I need SaveEvent to fail fast on conflict without retrying, so that I control my retry strategy (backoff, circuit-break, etc.).

Acceptance Criteria

SaveEvent returns VersionConflictError immediately on mismatch
No built-in retry loop in SaveEvent
No database-level locks held
Application reads VersionConflictError and decides retry
Default retry strategy documented (examples/)

Bounded Context: Optimistic Concurrency Control

DDD Implementation Guidance

Type: New Feature (Policy)

Invariant: Conflicts trigger immediate failure; application owns retry

Implementation:

SaveEvent: version check, return error if mismatch, done
No loop, no backoff, no retries
Clean error with context for caller

Technical Notes

This is a design choice: fail-fast enables flexible retry strategies
Application can choose exponential backoff, jitter, circuit-breaker, etc.

Test Cases

SaveEvent(version: 2) when current=2: fails immediately
No retry attempted by library
Application can retry if desired
Example patterns in examples/retry.go

Dependencies

Depends on: Issue 1.1 (SaveEvent)

Issue 1.12: [Documentation] Document concurrent write patterns

Type: New Feature Bounded Context: Optimistic Concurrency Control Priority: P1

Title: Provide retry strategy examples (backoff, circuit-breaker, queue)

User Story

As a developer using OCC, I want to see working examples of retry strategies, so that I can confidently implement safe concurrent writes in my application.

Acceptance Criteria

examples/retry_exponential_backoff.go
examples/retry_circuit_breaker.go
examples/retry_queue_based.go
examples/concurrent_write_test.go showing patterns
README mentions OCC patterns
Each example is >100 lines with explanation

Bounded Context: Optimistic Concurrency Control

DDD Implementation Guidance

Type: Documentation

Artifacts:

examples/retry_exponential_backoff.go
examples/retry_circuit_breaker.go
examples/retry_queue_based.go
examples/concurrent_write_test.go

Content:

How to read VersionConflictError
When to retry (idempotent operations)
When not to retry (non-idempotent)
Backoff strategies
Monitoring

Technical Notes

Real, runnable code (not pseudocode)
Show metrics collection
Show when to give up

Test Cases

Examples compile without error
Examples use idempotent operations
Test coverage for examples

Dependencies

Depends on: Issue 1.11 (fail-fast behavior)

Phase 2: Local Event Bus

Feature Set 2a: Event Routing and Filtering

Capability: Route and Filter Domain Events

Description: Events published to a namespace reach all subscribers of that namespace. Subscribers can filter by event type or actor pattern.

Success Condition: Publish event → exact subscriber receives, wildcard subscriber receives, filtered subscriber receives only if match.

Issue 2.1: [Command] Implement Publish to local subscribers

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Publish events to local subscribers

User Story

As an application component, I want to publish domain events to a namespace, so that all local subscribers are notified without tight coupling.

Acceptance Criteria

Publish(namespaceID, event) sends to all subscribers of that namespace
Exact subscribers (namespace="orders") receive event
Wildcard subscribers (namespace="order*") receive matching events
Events delivered in-process (no NATS yet)
Buffered channels (100-event buffer) prevent blocking
Full subscribers dropped non-blocking (no deadlock)
Metrics track publish count, receive count, dropped count

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Command)

Command: Publish(namespaceID, event)

Invariant: All subscribers matching namespace receive event

Implementation:

Iterate exact subscribers for namespace
Iterate wildcard subscribers matching pattern
Deliver to each (non-blocking, buffered)
Count drops

Technical Notes

EventBus in eventbus.go already implements this
Ensure buffered channels don't cause memory leaks
Metrics important for observability

Test Cases

Publish to "orders": exact subscriber of "orders" receives
Publish to "orders.new": wildcard subscriber of "order*" receives
Publish to "payments": subscriber to "orders" does NOT receive
Subscriber with full buffer: event dropped (non-blocking)
1000 publishes: metrics accurate

Dependencies

Depends on: Issue 2.2 (Subscribe)

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Register subscriber with optional event filter

User Story

As an application component, I want to subscribe to a namespace pattern with optional event filter, so that I receive only events I care about.

Acceptance Criteria

Subscribe(namespacePattern) returns <-chan *Event
SubscribeWithFilter(namespacePattern, filter) returns filtered channel
Filter supports EventTypes ([]string) and ActorPattern (string)
Filters applied client-side (subscriber decides)
Wildcard patterns work: "*" matches single token, ">" matches multiple
Subscription channel is buffered (100 events)
Unsubscribe(namespacePattern, ch) removes subscription

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Command)

Command: Subscribe(namespacePattern), SubscribeWithFilter(namespacePattern, filter)

Invariants:

Namespace pattern determines which namespaces
Filter determines which events within namespace
Both work together (AND logic)

Filter Types:

EventTypes: []string (e.g., ["OrderPlaced", "OrderShipped"])
ActorPattern: string (e.g., "order-customer-*")

Technical Notes

Pattern matching follows NATS conventions
Filters are optional (nil filter = all events)
Client-side filtering is efficient (NATS does server-side)

Test Cases

Subscribe("orders"): exact match only
Subscribe("order*"): wildcard match
Subscribe("order.*"): NATS-style wildcard
SubscribeWithFilter("orders", {EventTypes: ["OrderPlaced"]}): filter works
SubscribeWithFilter("orders", {ActorPattern: "order-123"}): actor filter works
Unsubscribe closes channel

Dependencies

Depends on: Issue 1.1 (events structure)

Issue 2.3: [Rule] Enforce exact subscription isolation

Type: New Feature Bounded Context: Event Bus + Namespace Isolation Priority: P1

Title: Guarantee exact namespace subscriptions are isolated

User Story

As an application owner, I need to guarantee that exact subscribers to namespace "tenant-a" never receive events from "tenant-b", so that I can enforce data isolation at the EventBus level.

Acceptance Criteria

Subscriber to "tenant-a" receives events from "tenant-a" only
Subscriber to "tenant-a" does NOT receive from "tenant-b"
Wildcard subscriber to "tenant*" receives from both
Exact match subscribers are isolated from wildcard
Tests verify isolation with multi-namespace setup
Documentation warns about wildcard security implications

Bounded Context: Event Bus + Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Policy/Invariant)

Invariant: Exact subscriptions are isolated

Implementation:

exactSubscribers map[namespace][]*subscription
Wildcard subscriptions separate collection
Publish checks exact first, then wildcard patterns

Security Note: Wildcard subscriptions bypass isolation intentionally (for logging, monitoring, etc.)

Technical Notes

Enforced at EventBus.Publish level
Exact match is simple string equality
Wildcard uses MatchNamespacePattern helper

Test Cases

Publish to "tenant-a": only "tenant-a" exact subscribers get it
Publish to "tenant-b": only "tenant-b" exact subscribers get it
Publish to "tenant-a": "tenant*" wildcard subscriber gets it
Publish to "tenant-a": "tenant-b" exact subscriber does NOT get it

Dependencies

Depends on: Issue 2.2 (Subscribe)

Issue 2.4: [Rule] Document wildcard subscription security

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Document that wildcard subscriptions bypass isolation

User Story

As an architect, I need clear documentation that wildcard subscriptions receive events across all namespaces, so that I can make informed security decisions.

Acceptance Criteria

eventbus.go comments explain wildcard behavior
Security warning in Subscribe godoc
Example showing wildcard usage for logging
Example showing why wildcard is dangerous (if not restricted)
README mentions namespace isolation caveats
Examples show proper patterns (monitoring, auditing)

Bounded Context: Event Bus

DDD Implementation Guidance

Type: Documentation

Content:

Wildcard subscriptions receive all matching events
Use for cross-cutting concerns (logging, monitoring, audit)
Restrict access to trusted components
Never expose wildcard pattern to untrusted users

Examples:

Monitoring system subscribes to ">"
Audit system subscribes to "tenant-*"
Application logic uses exact subscriptions only

Technical Notes

Intentional design; not a bug
Different from NATS server-side filtering

Test Cases

Examples compile
Documentation is clear and accurate

Dependencies

Depends on: Issue 2.3 (exact isolation)

Issue 2.5: [Event] Publish SubscriptionCreated for tracking

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Track subscription lifecycle

User Story

As an operator, I want to see when subscriptions are created and destroyed, so that I can monitor subscriber health and debug connection issues.

Acceptance Criteria

SubscriptionCreated event published on Subscribe
SubscriptionDestroyed event published on Unsubscribe
Event contains: namespacePattern, filterCriteria, timestamp
Metrics increment on subscribe/unsubscribe
SubscriberCount(namespace) returns current count

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Event)

Event: SubscriptionCreated(namespacePattern, filter, timestamp)

Event: SubscriptionDestroyed(namespacePattern, timestamp)

Metrics: Subscriber count per namespace

Technical Notes

SubscriberCount already in eventbus.go
Add events to EventBus.Subscribe and EventBus.Unsubscribe
Internal events (infrastructure)

Test Cases

Subscribe → metrics increment
Unsubscribe → metrics decrement
SubscriberCount correct

Dependencies

Depends on: Issue 2.2 (Subscribe/Unsubscribe)

Issue 2.6: [Event] Publish EventPublished for delivery tracking

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Record event publication metrics

User Story

As an operator, I want metrics on events published, delivered, and dropped, so that I can detect bottlenecks and subscriber health issues.

Acceptance Criteria

EventPublished event published on Publish
Metrics track: published count, delivered count, dropped count per namespace
Dropped events (full channel) recorded
Application can query metrics via Metrics()
Example: 1000 events published, 995 delivered, 5 dropped

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Event/Metrics)

Event: EventPublished (infrastructure event)

Metrics:

PublishCount[namespace]
DeliveryCount[namespace]
DroppedCount[namespace]

Implementation:

RecordPublish(namespace)
RecordReceive(namespace)
RecordDroppedEvent(namespace)

Technical Notes

Metrics already in DefaultMetricsCollector
RecordDroppedEvent signals subscriber backpressure
Can be used to auto-scale subscribers

Test Cases

Publish 100 events: metrics show 100 published
All delivered: metrics show 100 delivered
Full subscriber: next event dropped, metrics show 1 dropped
Query via bus.Metrics(): values accurate

Dependencies

Depends on: Issue 2.1 (Publish)

Issue 2.7: [Read Model] Implement GetSubscriptions query

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Query active subscriptions for operational visibility

User Story

As an operator, I want to list all active subscriptions, including patterns and filters, so that I can debug event routing and monitor subscriber health.

Acceptance Criteria

GetSubscriptions() returns []SubscriptionInfo
SubscriptionInfo contains: pattern, filter, subscriberID, createdAt
Works for both exact and wildcard subscriptions
Metrics accessible via SubscriberCount(namespace)
Example: "What subscriptions are listening to 'orders'?"

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: SubscriptionRegistry

Data:

Pattern: namespace pattern (e.g., "tenant-*")
Filter: optional filter criteria
SubscriberID: unique ID for each subscription
CreatedAt: timestamp

Implementation:

Track subscriptions in eventbus.go
Expose via GetSubscriptions() method

Technical Notes

Useful for debugging
Optional feature; not critical

Test Cases

Subscribe to "orders": GetSubscriptions shows it
Subscribe to "order*": GetSubscriptions shows it
Unsubscribe: GetSubscriptions removes it
Multiple subscribers: all listed

Dependencies

Depends on: Issue 2.2 (Subscribe)

Feature Set 2b: Buffering and Backpressure

Capability: Route and Filter Domain Events (non-blocking delivery)

Description: Event publication is non-blocking; full subscriber buffers cause events to be dropped (not delayed).

Success Condition: Publish returns immediately; dropped events recorded in metrics; subscriber never blocks publisher.

Issue 2.8: [Rule] Implement non-blocking event delivery

Type: New Feature Bounded Context: Event Bus Priority: P1

Title: Ensure event publication never blocks

User Story

As a publisher, I need events to be delivered non-blocking, so that a slow subscriber doesn't delay my operations.

Acceptance Criteria

Publish(namespace, event) returns immediately
If subscriber channel full, event dropped (non-blocking)
Dropped events counted in metrics
Buffered channel size is 100 (tunable)
Publisher never waits for subscriber
Metrics alert on high drop rate

Bounded Context: Event Bus

DDD Implementation Guidance

Type: New Feature (Policy)

Invariant: Publishers not blocked by slow subscribers

Implementation:

select { case ch <- event: ... default: ... }
Count drops in default case

Trade-off:

Pro: Publisher never blocks
Con: Events may be lost if subscriber can't keep up
Mitigation: Metrics alert on drops; subscriber can increase buffer or retry

Technical Notes

Already implemented in eventbus.go (deliverToSubscriber)
100-event buffer is reasonable default

Test Cases

Subscribe, receive 100 events: no drops
Publish 101st event immediately: dropped
Metrics show drop count
Publisher latency < 1ms regardless of subscribers

Dependencies

Depends on: Issue 2.1 (Publish)

Issue 2.9: [Documentation] Document EventBus backpressure handling

Type: New Feature Bounded Context: Event Bus Priority: P2

Title: Explain buffer management and recovery from drops

User Story

As a developer, I want to understand what happens when event buffers fill up, so that I can design robust event handlers.

Acceptance Criteria

Document buffer size (100 events default)
Explain what happens on overflow (event dropped)
Document recovery patterns (subscriber restarts, re-syncs)
Example: Subscriber catches up from JetStream after restart
Metrics to monitor (drop rate)
README section on backpressure

Bounded Context: Event Bus

DDD Implementation Guidance

Type: Documentation

Content:

Buffer size and behavior
Drop semantics
Recovery patterns
Metrics to monitor
When to increase buffer size

Examples:

Slow subscriber: increase buffer or fix handler
Network latency: events may be dropped
Handler panics: subscriber must restart and re-sync

Technical Notes

Events are lost if dropped; only durable via JetStream
Phase 3 (NATS) addresses durability

Test Cases

Documentation is clear
Examples work

Dependencies

Depends on: Issue 2.8 (non-blocking delivery)

Phase 3: Cluster Coordination

Feature Set 3a: Cluster Topology and Leadership

Capability: Coordinate Cluster Topology

Description: Cluster automatically discovers nodes, elects a leader, and detects failures. One leader holds a time-bound lease.

Success Condition: Three nodes start; one elected leader within 5s; leader's lease renews; lease expiration triggers re-election; failed node detected within 90s.

Issue 3.1: [Command] Implement JoinCluster protocol

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Enable node discovery via cluster join

User Story

As a deployment, I want new nodes to announce themselves and discover peers, so that the cluster topology updates automatically.

Acceptance Criteria

JoinCluster() announces node via NATS
Node info contains: NodeID, Address, Timestamp, Status
Other nodes receive join announcement
Cluster topology updated atomically
Rejoining node detected and updated
Tests verify multi-node discovery

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: JoinCluster()

Aggregates: Cluster (group of nodes)

Events: NodeJoined(nodeID, address, timestamp)

Technical Notes

NATS subject: "aether.cluster.nodes"
NodeDiscovery subscribes to announcements
ClusterManager.Start() initiates join

Test Cases

Single node joins: topology = [node-a]
Second node joins: topology = [node-a, node-b]
Third node joins: topology = [node-a, node-b, node-c]
Node rejoins: updates existing entry

Dependencies

None (first cluster feature)

Issue 3.2: [Command] Implement LeaderElection

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Elect single leader via NATS-based voting

User Story

As a cluster, I want one node to be elected leader so that it can coordinate shard assignments and rebalancing.

Acceptance Criteria

LeaderElection holds election every HeartbeatInterval (5s)
Nodes vote for themselves (no voting logic; first wins)
One leader elected per term
Leader holds lease (TTL = 2 * HeartbeatInterval)
All nodes converge on same leader
Lease renewal happens automatically

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: ElectLeader()

Aggregates: LeadershipLease (time-bound authority)

Events: LeaderElected(leaderID, term, leaseExpiration)

Technical Notes

NATS subject: "aether.cluster.election"
Each node publishes heartbeat with NodeID, Timestamp
First node to publish becomes leader
Lease expires if no heartbeat for TTL

Test Cases

Single node: elected immediately
Three nodes: exactly one elected
Leader dies: remaining nodes elect new leader within 2*interval
Leader comes back: may or may not stay leader

Dependencies

Depends on: Issue 3.1 (node discovery)

Issue 3.3: [Rule] Enforce single leader invariant

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Guarantee exactly one leader at any time

User Story

As a system, I need to ensure only one node is leader, so that coordination operations (shard assignment) are deterministic and don't conflict.

Acceptance Criteria

At most one leader at any time (lease-based)
If leader lease expires, no leader until re-election
All nodes see same leader (or none)
Tests verify invariant under various failure scenarios
Split-brain prevented by lease TTL < network latency

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: At most one leader (enforced by lease TTL)

Mechanism:

Leader publishes heartbeat every HeartbeatInterval
Other nodes trust leader if heartbeat < HeartbeatInterval old
If no heartbeat for 2*HeartbeatInterval, lease expired
New election begins

Technical Notes

Lease-based; not consensus-based (simpler)
Allows temporary split-brain until lease expires
Acceptable for Aether (eventual consistency)

Test Cases

Simulate leader death: lease expires, new leader elected
Simulate network partition: partition may have >1 leader until lease expires
Verify no coordination during lease expiration

Dependencies

Depends on: Issue 3.2 (leader election)

Issue 3.4: [Event] Publish LeaderElected on election

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Record leadership election outcomes

User Story

As an operator, I want to see when leaders are elected and terms change, so that I can debug leadership issues and monitor election frequency.

Acceptance Criteria

LeaderElected event published after successful election
Event contains: LeaderID, Term, LeaseExpiration, Timestamp
Metrics increment on election
Helpful for debugging split-brain scenarios
Track election frequency (ideally < 1 per minute)

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: LeaderElected(leaderID, term, leaseExpiration, timestamp)

Triggered by: Successful election

Consumers: Metrics, audit logs

Technical Notes

Event published locally to all observers
Infrastructure event (not domain event)

Test Cases

Election happens: event published
Term increments: event reflects new term
Metrics accurate

Dependencies

Depends on: Issue 3.2 (election)

Issue 3.5: [Event] Publish LeadershipLost on lease expiration

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Track leadership transitions

User Story

As an operator, I want to know when a leader loses its lease, so that I can correlate with rebalancing or failure events.

Acceptance Criteria

LeadershipLost event published when lease expires
Event contains: PreviousLeaderID, Timestamp, Reason
Metrics track leadership transitions
Helpful for debugging cascading failures

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: LeadershipLost(previousLeaderID, timestamp, reason)

Reason: "lease_expired", "node_failed", etc.

Technical Notes

Published when lease TTL expires
Useful for observability

Test Cases

Leader lease expires: LeadershipLost published
Metrics show transition

Dependencies

Depends on: Issue 3.2 (election)

Issue 3.6: [Read Model] Implement GetClusterTopology query

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Query current cluster members and status

User Story

As an operator, I want to see all cluster members, their status, and last heartbeat, so that I can diagnose connectivity issues.

Acceptance Criteria

GetNodes() returns map[nodeID]*NodeInfo
NodeInfo contains: ID, Address, Status, LastSeen, ShardIDs
Status is: Active, Degraded, Failed
LastSeen is accurate heartbeat timestamp
ShardIDs show shard ownership (filled in Phase 3b)
Example: "node-a is active; node-b failed 30s ago"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: ClusterTopology

Data:

NodeID → NodeInfo (status, heartbeat, shards)
LeaderID (current leader)
Term (election term)

Technical Notes

ClusterManager maintains topology in-memory
Update on each heartbeat/announcement

Test Cases

GetNodes() returns active nodes
Status accurate (Active, Failed, etc.)
LastSeen updates on heartbeat
Rejoining node updates existing entry

Dependencies

Depends on: Issue 3.1 (node discovery)

Issue 3.7: [Read Model] Implement GetLeader query

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Query current leader

User Story

As a client, I want to know who the leader is, so that I can route coordination requests to the right node.

Acceptance Criteria

GetLeader() returns current leader NodeID or ""
IsLeader() returns true if this node is leader
Both consistent with LeaderElection state
Updated immediately on election
Example: "node-b is leader (term 5)"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: LeadershipRegistry

Data: CurrentLeader, CurrentTerm, LeaseExpiration

Implementation:

LeaderElection maintains this
ClusterManager queries it

Technical Notes

Critical for routing coordination work
Must be consistent across cluster

Test Cases

No leader: GetLeader returns ""
Leader elected: GetLeader returns leader ID
IsLeader true on leader, false on others
Changes on re-election

Dependencies

Depends on: Issue 3.2 (election)

Feature Set 3b: Shard Distribution

Capability: Distribute Actors Across Cluster Nodes

Description: Actors hash to shards using consistent hashing. Shards map to nodes. Topology changes minimize reshuffling.

Success Condition: 3 nodes, 100 shards distributed evenly; add node: ~25 shards rebalance; actor routes consistently.

Issue 3.8: [Command] Implement consistent hash ring

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Distribute shards across nodes with minimal reshuffling

User Story

As a cluster coordinator, I want to use consistent hashing to distribute shards, so that adding/removing nodes doesn't require full reshuffling.

Acceptance Criteria

ConsistentHashRing(numShards=1024) creates ring
GetShard(actorID) returns consistent shard [0, 1024)
AddNode(nodeID) rebalances ~numShards/numNodes shards
RemoveNode(nodeID) rebalances shards evenly
Same actor always maps to same shard
Reshuffling < 40% on node add/remove

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: AssignShards(nodes)

Aggregates: ConsistentHashRing (distribution algorithm)

Invariants:

Each shard [0, 1024) assigned to exactly one node
ActorID hashes consistently to shard
Topology changes minimize reassignment

Technical Notes

hashring.go already implements this
Use crypto/md5 or compatible hash
1024 shards is tunable (P1 default)

Test Cases

Single node: all shards assigned to it
Two nodes: ~512 shards each
Three nodes: ~341 shards each
Add fourth node: ~256 shards each (~20% reshuffled)
Remove node: remaining nodes rebalance evenly
Same actor-id always hashes to same shard

Dependencies

Depends on: Issue 3.1 (node discovery)

Issue 3.9: [Rule] Enforce single shard owner invariant

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Guarantee each shard has exactly one owner

User Story

As the cluster coordinator, I need each shard to have exactly one owner node, so that actor requests route deterministically.

Acceptance Criteria

ShardMap tracks shard → nodeID assignment
No shard is unassigned (every shard has owner)
No shard assigned to multiple nodes
Reassignment is atomic (no in-between state)
Tests verify invariant after topology changes

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: Each shard [0, 1024) assigned to exactly one active node

Mechanism:

ShardMap[shardID] = [nodeID]
Maintained by leader
Updated atomically on rebalancing

Technical Notes

shard.go implements ShardManager
Validated after each rebalancing

Test Cases

After rebalancing: all shards assigned
No orphaned shards
No multiply-assigned shards
Reassignment is atomic

Dependencies

Depends on: Issue 3.8 (consistent hashing)

Issue 3.10: [Event] Publish ShardAssigned on assignment

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Track shard-to-node assignments

User Story

As an operator, I want to see shard assignments, so that I can verify load distribution and debug routing issues.

Acceptance Criteria

ShardAssigned event published after assignment
Event contains: ShardID, NodeID, Timestamp
Metrics track: shards per node, rebalancing frequency
Example: Shard 42 assigned to node-b

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: ShardAssigned(shardID, nodeID, timestamp)

Triggered by: AssignShards command succeeds

Metrics: Shards per node, distribution evenness

Technical Notes

Infrastructure event
Useful for monitoring load distribution

Test Cases

Assignment published on rebalancing
Metrics reflect distribution

Dependencies

Depends on: Issue 3.9 (shard ownership)

Issue 3.11: [Read Model] Implement GetShardAssignments query

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Query shard-to-node mapping

User Story

As a client, I want to know which node owns a shard, so that I can route actor requests correctly.

Acceptance Criteria

GetShardAssignments() returns ShardMap
ShardMap[shardID] returns owning nodeID
GetShard(actorID) returns shard for actor
Routing decision: actorID → shard → nodeID
Cached locally; refreshed on each rebalancing

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Query)

Read Model: ShardMap

Data:

ShardID → NodeID (primary owner)
Version (incremented on rebalancing)
UpdateTime

Implementation:

ClusterManager.GetShardMap()
Cached; updated on assignment changes

Technical Notes

Critical for routing
Must be consistent across cluster
Version helps detect stale caches

Test Cases

GetShardAssignments returns current map
GetShard(actorID) returns consistent shard
Routing: actor ID → shard → node owner

Dependencies

Depends on: Issue 3.9 (shard ownership)

Feature Set 3c: Failure Detection and Recovery

Capability: Recover from Node Failures

Description: Failed nodes are detected via heartbeat timeout. Their shards are reassigned. Actors replay on new nodes.

Success Condition: Node dies → failure detected within 90s → shards reassigned → actors replay automatically.

Issue 3.12: [Command] Implement node health checks

Type: New Feature Bounded Context: Cluster Coordination Priority: P1

Title: Detect node failures via heartbeat timeout

User Story

As the cluster, I want to detect failed nodes automatically, so that shards can be reassigned and actors moved to healthy nodes.

Acceptance Criteria

Each node publishes heartbeat every 30s
Nodes without heartbeat for 90s marked as Failed
checkNodeHealth() runs every 30s
Failed node's status updates atomically
Tests verify failure detection timing
Failed node can rejoin cluster

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: MarkNodeFailed(nodeID)

Trigger: monitorNodes detects missing heartbeat

Events: NodeFailed(nodeID, lastSeenTimestamp)

Technical Notes

monitorNodes() loop in manager.go
Check LastSeen timestamp
Update status if stale (>90s)

Test Cases

Active node: status stays Active
No heartbeat for 90s: status → Failed
Rejoin: status → Active
Failure detected < 100s (ideally 90-120s)

Dependencies

Depends on: Issue 3.1 (node discovery)

Issue 3.13: [Command] Implement RebalanceShards after node failure

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Reassign failed node's shards to healthy nodes

User Story

As the cluster, I want to reassign failed node's shards automatically, so that actors are available on new nodes.

Acceptance Criteria

Leader detects node failure
Leader triggers RebalanceShards
Failed node's shards reassigned evenly
No shard left orphaned
ShardMap updated atomically
Rebalancing completes within 5 seconds

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Command)

Command: RebalanceShards(failedNodeID)

Aggregates: ShardMap, ConsistentHashRing

Events: RebalanceStarted, ShardMigrated

Technical Notes

Leader only (IsLeader() check)
Use consistent hashing to assign
Calculate new assignments atomically

Test Cases

Node-a fails with shards [1, 2, 3]
Leader reassigns [1, 2, 3] to remaining nodes
No orphaned shards
Rebalancing < 5s

Dependencies

Depends on: Issue 3.8 (consistent hashing)
Depends on: Issue 3.12 (failure detection)

Issue 3.14: [Rule] Enforce no-orphan invariant

Type: New Feature Bounded Context: Cluster Coordination Priority: P0

Title: Guarantee all shards have owners after rebalancing

User Story

As the cluster, I need all shards to have owners after any topology change, so that no actor is unreachable.

Acceptance Criteria

Before rebalancing: verify no orphaned shards
After rebalancing: verify all shards assigned
Tests fail if invariant violated
Rebalancing aborted if invariant would be violated

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: All shards [0, 1024) have owners after any rebalancing

Check:

Count assigned shards
Verify = 1024
Abort if not

Technical Notes

Validate before committing ShardMap
Logs errors but doesn't assert (graceful degradation)

Test Cases

Rebalancing completes: all shards assigned
Orphaned shard detected: rebalancing rolled back
Tests verify count = 1024

Dependencies

Depends on: Issue 3.13 (rebalancing)

Issue 3.15: [Event] Publish NodeFailed on failure detection

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Record node failure for observability

User Story

As an operator, I want to see when nodes fail, so that I can correlate with service degradation and debug issues.

Acceptance Criteria

NodeFailed event published when failure detected
Event contains: NodeID, LastSeenTimestamp, AffectedShards
Metrics track failure frequency
Example: "node-a failed; 341 shards affected"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: NodeFailed(nodeID, lastSeenTimestamp, affectedShardIDs)

Triggered by: checkNodeHealth marks node failed

Consumers: Metrics, alerts, audit logs

Technical Notes

Infrastructure event
AffectedShards helps assess impact

Test Cases

Node failure detected: event published
Metrics show affected shard count

Dependencies

Depends on: Issue 3.12 (failure detection)

Issue 3.16: [Event] Publish ShardMigrated on shard movement

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Track shard migrations

User Story

As an operator, I want to see shard migrations, so that I can track rebalancing progress and debug stuck migrations.

Acceptance Criteria

ShardMigrated event published on each shard movement
Event contains: ShardID, FromNodeID, ToNodeID, Status
Status: "Started", "InProgress", "Completed", "Failed"
Metrics track migration count and duration
Example: "Shard 42 migrated from node-a to node-b (2.3s)"

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: New Feature (Event)

Event: ShardMigrated(shardID, fromNodeID, toNodeID, status, durationMs)

Status: Started → InProgress → Completed

Consumers: Metrics, progress tracking

Technical Notes

Published for each shard move
Helps track rebalancing progress
Useful for SLO monitoring

Test Cases

Shard moves: event published
Metrics track duration
Status transitions correct

Dependencies

Depends on: Issue 3.13 (rebalancing)

Issue 3.17: [Documentation] Document actor migration and replay

Type: New Feature Bounded Context: Cluster Coordination Priority: P2

Title: Explain how actors move and recover state

User Story

As a developer, I want to understand how actors survive node failures, so that I can implement recovery workflows in my application.

Acceptance Criteria

Design doc: cluster/ACTOR_MIGRATION.md
Explain shard reassignment process
Explain state rebuild via GetEvents + replay
Explain snapshot optimization
Example: Shard 42 moves to new node; 1000-event actor replays in <100ms
Explain out-of-order message handling

Bounded Context: Cluster Coordination

DDD Implementation Guidance

Type: Documentation

Content:

Shard assignment (consistent hashing)
Actor discovery (routing via shard map)
State rebuild (replay from JetStream)
Snapshots (optional optimization)
In-flight messages (may arrive before replay completes)

Examples:

Manual failover: reassign shards manually
Auto failover: leader initiates on failure detection

Technical Notes

Complex topic; good documentation prevents bugs

Test Cases

Documentation is clear
Examples correct

Dependencies

Depends on: Issue 3.13 (rebalancing)
Depends on: Phase 1 (event replay)

Phase 4: Namespace Isolation and NATS Event Delivery

Feature Set 4a: Namespace Storage Isolation

Capability: Isolate Logical Domains Using Namespaces

Description: Events in one namespace are completely invisible to another namespace. Storage prefixes enforce isolation at persistence layer.

Success Condition: Two stores with namespaces "tenant-a", "tenant-b"; event saved in "tenant-a" invisible to "tenant-b" queries.

Issue 4.1: [Rule] Enforce namespace-based stream naming

Type: New Feature Bounded Context: Namespace Isolation Priority: P1

Title: Use namespace prefixes in JetStream stream names

User Story

As a system architect, I want events from different namespaces stored in separate JetStream streams, so that I can guarantee no cross-namespace leakage.

Acceptance Criteria

Namespace "tenant-a" → stream "tenant-a_events"
Namespace "tenant-b" → stream "tenant-b_events"
Empty namespace → stream "events" (default)
JetStreamConfig.Namespace sets prefix
NewJetStreamEventStoreWithNamespace convenience function
Tests verify stream names have namespace prefix

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Configuration)

Value Object: Namespace (string identifier)

Implementation:

JetStreamConfig.Namespace field
StreamName = namespace + "_events" if namespace set
StreamName = "events" if namespace empty

Technical Notes

Already partially implemented in jetstream.go
Ensure safe characters (sanitize spaces, dots, wildcards)

Test Cases

NewJetStreamEventStoreWithNamespace("tenant-a"): creates stream "tenant-a_events"
NewJetStreamEventStoreWithNamespace(""): creates stream "events"
Stream name verified

Dependencies

None (orthogonal to other contexts)

Issue 4.2: [Rule] Enforce storage-level namespace isolation

Type: New Feature Bounded Context: Namespace Isolation Priority: P0

Title: Prevent cross-namespace data leakage at storage layer

User Story

As a security-conscious architect, I need events from one namespace to be completely invisible to GetEvents queries on another namespace, so that I can safely deploy multi-tenant systems.

Acceptance Criteria

SaveEvent to "tenant-a_events" cannot be read from "tenant-b_events"
GetEvents("tenant-a") queries "tenant-a_events" stream only
No possibility of accidental cross-namespace leakage
JetStream subject filtering enforces isolation
Integration tests verify with multiple namespaces

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: Events from namespace X are invisible to namespace Y

Mechanism:

Separate JetStream streams per namespace
Subject prefixing: "tenant-a.events.actor-123"
Subscribe filters by subject prefix

Technical Notes

jetstream.go: SubscribeToActorEvents uses subject prefix
Consumer created with subject filter matching namespace

Test Cases

SaveEvent to tenant-a: visible in tenant-a queries
Same event invisible to tenant-b queries
GetLatestVersion scoped to namespace
GetEvents scoped to namespace
Multi-namespace integration test

Dependencies

Depends on: Issue 4.1 (stream naming)

Issue 4.3: [Documentation] Document namespace design patterns

Type: New Feature Bounded Context: Namespace Isolation Priority: P1

Title: Provide guidance on namespace naming and use

User Story

As an architect, I want namespace design patterns, so that I can choose the right granularity for my multi-tenant system.

Acceptance Criteria

Design doc: NAMESPACE_DESIGN_PATTERNS.md
Pattern 1: "tenant-{id}" (per-customer)
Pattern 2: "env.domain" (per-env, per-bounded-context)
Pattern 3: "env.domain.customer" (most granular)
Examples of each pattern
Guidance on choosing granularity
Anti-patterns (wildcards, spaces, dots)

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: Documentation

Content:

Multi-tenant patterns
Granularity decisions
Namespace naming rules
Examples
Anti-patterns
Performance implications

Examples:

SaaS: "tenant-uuid"
Microservices: "service.orders"
Complex: "env.service.tenant"

Technical Notes

No hard restrictions; naming is flexible
Sanitization (spaces → underscores)

Test Cases

Documentation is clear
Examples valid

Dependencies

Depends on: Issue 4.1 (stream naming)

Issue 4.4: [Validation] Add namespace format validation (P2)

Type: New Feature Bounded Context: Namespace Isolation Priority: P2

Title: Validate namespace names to prevent invalid streams

User Story

As a developer, I want validation that rejects invalid namespace names (wildcards, spaces), so that I avoid silent failures from invalid stream names.

Acceptance Criteria

ValidateNamespace(ns string) returns error for invalid names
Rejects: "tenant-*", "tenant a", "tenant."
Accepts: "tenant-abc", "prod.orders", "tenant_123"
Called on NewJetStreamEventStoreWithNamespace
Clear error messages
Tests verify validation rules

Bounded Context: Namespace Isolation

DDD Implementation Guidance

Type: New Feature (Validation)

Validation Rules:

No wildcards (*, >)
No spaces
No leading/trailing dots
Alphanumeric, hyphens, underscores, dots only

Implementation:

ValidateNamespace regex
Called before stream creation

Technical Notes

Nice-to-have; currently strings accepted as-is
Could sanitize instead of rejecting (replace _ for spaces)

Test Cases

Valid: "tenant-abc", "prod.orders"
Invalid: "tenant-*", "tenant a", ".prod"
Error messages clear

Dependencies

Depends on: Issue 4.1 (stream naming)

Feature Set 4b: Cross-Node Event Delivery via NATS

Capability: Deliver Events Across Cluster Nodes

Description: Events published on one node reach subscribers on other nodes. NATS JetStream provides durability and ordering.

Success Condition: Node-a publishes → node-b subscriber receives (same as local EventBus, but distributed via NATS).

Issue 4.5: [Command] Implement NATSEventBus wrapper

Type: New Feature Bounded Context: Event Bus (with NATS) Priority: P1

Title: Extend EventBus with NATS-native pub/sub

User Story

As a distributed application, I want events published on any node to reach subscribers on all nodes, so that I can implement cross-node workflows and aggregations.

Acceptance Criteria

NATSEventBus embeds EventBus
Publish(namespace, event) sends to local EventBus AND NATS
NATS subject: "aether.events.{namespace}"
SubscribeWithFilter works across nodes
Self-published events not re-delivered (avoid loops)
Tests verify cross-node delivery

Bounded Context: Event Bus (NATS extension)

DDD Implementation Guidance

Type: New Feature (Extension)

Aggregate: EventBus extended with NATSEventBus

Commands: Publish(namespace, event) [same interface, distributed]

Implementation:

NATSEventBus composes EventBus
Override Publish to also publish to NATS
Subscribe to NATS subjects matching namespace

Technical Notes

nats_eventbus.go already partially implemented
NATS subject: "aether.events.orders" for namespace "orders"
Include sourceNodeID in event to prevent redelivery

Test Cases

Publish on node-a: local subscribers on node-a receive
Same publish: node-b subscribers receive via NATS
Self-loop prevented: node-a doesn't re-receive own publish
Multi-node: all nodes converge on same events

Dependencies

Depends on: Issue 2.1 (EventBus.Publish)
Depends on: Issue 3.1 (cluster setup for multi-node tests)

Issue 4.6: [Rule] Enforce exactly-once delivery across cluster

Type: New Feature Bounded Context: Event Bus (NATS) Priority: P1

Title: Guarantee events delivered to all cluster subscribers

User Story

As a distributed system, I want each event delivered exactly once to each subscriber group, so that I avoid duplicates and lost events.

Acceptance Criteria

Event published to NATS with JetStream consumer
Consumer acknowledges delivery
Redelivery on network failure (JetStream handles)
No duplicate delivery to same subscriber
All nodes see same events in same order

Bounded Context: Event Bus (NATS)

DDD Implementation Guidance

Type: New Feature (Invariant)

Invariant: Exactly-once delivery to each subscriber

Mechanism:

JetStream consumer per subscriber group
Acknowledgment on delivery
Automatic redelivery on timeout

Technical Notes

JetStream handles durability and ordering
Consumer name = subscriber ID
Push consumer model (events pushed to subscriber)

Test Cases

Publish event: all subscribers receive once
Network failure: redelivery after timeout
No duplicates on subscriber
Order preserved across nodes

Dependencies

Depends on: Issue 4.5 (NATSEventBus)

Issue 4.7: [Event] Publish EventPublished (via NATS)

Type: New Feature Bounded Context: Event Bus (NATS) Priority: P2

Title: Route published events to NATS subjects

User Story

As a monitoring system, I want all events published through NATS, so that I can observe cross-node delivery and detect bottlenecks.

Acceptance Criteria

EventPublished event published to NATS
Subject: "aether.events.{namespace}.published"
Message contains: eventID, timestamp, sourceNodeID
Metrics track: events published, delivered, dropped
Helps identify partition/latency issues

Bounded Context: Event Bus (NATS)

DDD Implementation Guidance

Type: New Feature (Event)

Event: EventPublished (infrastructure)

Subject: aether.events.{namespace}.published

Consumers: Metrics, monitoring

Technical Notes

Published after NATS publish succeeds
Separate from local EventPublished (for clarity)

Test Cases

Publish event: EventPublished message on NATS
Metrics count delivery
Cross-node visibility works

Dependencies

Depends on: Issue 4.5 (NATSEventBus)

Issue 4.8: [Read Model] Implement cross-node subscription

Type: New Feature Bounded Context: Event Bus (NATS) Priority: P1

Title: Receive events from other nodes via NATS

User Story

As an application, I want to subscribe to events and receive them from all cluster nodes, so that I can implement distributed workflows.

Acceptance Criteria

NATSEventBus.Subscribe(namespace) receives local + NATS events
SubscribeWithFilter works with NATS
Events from local node: delivered via local EventBus
Events from remote nodes: delivered via NATS consumer
Subscriber sees unified stream (no duplication)

Bounded Context: Event Bus (NATS)

DDD Implementation Guidance

Type: New Feature (Query/Subscription)

Read Model: UnifiedEventStream (local + remote)

Implementation:

Subscribe creates local channel
NATSEventBus subscribes to NATS subject
Both feed into subscriber channel

Technical Notes

Unified view is transparent to subscriber
No need to know if event is local or remote

Test Cases

Subscribe to namespace: receive local events
Subscribe to namespace: receive remote events
Filter works across both sources
No duplication

Dependencies

Depends on: Issue 4.5 (NATSEventBus)

Summary

This backlog contains 67 executable issues across 5 bounded contexts organized into 4 implementation phases. Each issue:

Is decomposed using DDD-informed order (commands → rules → events → reads)
References domain concepts (aggregates, commands, events, value objects)
Includes acceptance criteria (testable, specific)
States dependencies (enabling parallel work)
Is sized to 1-3 days of work

Recommended Build Order:

Phase 1 (17 issues): Event Sourcing Foundation - everything depends on this
Phase 2 (9 issues): Local Event Bus - enables observability before clustering
Phase 3 (20 issues): Cluster Coordination - enables distributed deployment
Phase 4 (21 issues): Namespace & NATS - enables multi-tenancy and cross-node delivery

Total Scope: ~670 day-pairs of work (conservative estimate: 10-15 dev-weeks for small team)

Next Steps

Create Gitea issues from this backlog
Assign to team members
Set up dependency tracking in Gitea
Use /spawn-issues skill to parallelize implementation
Iterate on acceptance criteria with domain experts

See /issue-writing skill for proper issue formatting in Gitea.

69 KiB Raw Blame History

Aether Executable Backlog

Backlog Overview

Phase 1: Event Sourcing Foundation

Feature Set 1a: Event Storage with Version Conflict Detection

Issue 1.1: [Command] Implement SaveEvent with monotonic version validation

Issue 1.2: [Rule] Enforce append-only and immutability invariants

Issue 1.3: [Event] Publish EventStored after successful save

Issue 1.4: [Event] Publish VersionConflict error with full context

Issue 1.5: [Read Model] Implement GetLatestVersion query

Feature Set 1b: State Rebuild from Event History

Issue 1.6: [Command] Implement GetEvents for replay

Issue 1.7: [Rule] Define and enforce snapshot validity

Issue 1.8: [Event] Publish SnapshotCreated for observability

Issue 1.9: [Read Model] Implement GetEventsWithErrors for robust replay

Issue 1.10: [Interface] Implement SnapshotStore interface

Feature Set 1c: Optimistic Concurrency Control

Issue 1.11: [Rule] Enforce fail-fast on version conflict

Issue 1.12: [Documentation] Document concurrent write patterns

Phase 2: Local Event Bus

Feature Set 2a: Event Routing and Filtering

Issue 2.1: [Command] Implement Publish to local subscribers

Issue 2.2: [Command] Implement Subscribe with optional filter

Issue 2.3: [Rule] Enforce exact subscription isolation

Issue 2.4: [Rule] Document wildcard subscription security

Issue 2.5: [Event] Publish SubscriptionCreated for tracking

Issue 2.6: [Event] Publish EventPublished for delivery tracking

Issue 2.7: [Read Model] Implement GetSubscriptions query

Feature Set 2b: Buffering and Backpressure

Issue 2.8: [Rule] Implement non-blocking event delivery

Issue 2.9: [Documentation] Document EventBus backpressure handling

Phase 3: Cluster Coordination

Feature Set 3a: Cluster Topology and Leadership

Issue 3.1: [Command] Implement JoinCluster protocol

Issue 3.2: [Command] Implement LeaderElection

Issue 3.3: [Rule] Enforce single leader invariant

Issue 3.4: [Event] Publish LeaderElected on election

Issue 3.5: [Event] Publish LeadershipLost on lease expiration

Issue 3.6: [Read Model] Implement GetClusterTopology query

Issue 3.7: [Read Model] Implement GetLeader query

Feature Set 3b: Shard Distribution

Issue 3.8: [Command] Implement consistent hash ring

Issue 3.9: [Rule] Enforce single shard owner invariant

Issue 3.10: [Event] Publish ShardAssigned on assignment

Issue 3.11: [Read Model] Implement GetShardAssignments query

Feature Set 3c: Failure Detection and Recovery

Issue 3.12: [Command] Implement node health checks

Issue 3.13: [Command] Implement RebalanceShards after node failure

Issue 3.14: [Rule] Enforce no-orphan invariant

Issue 3.15: [Event] Publish NodeFailed on failure detection

Issue 3.16: [Event] Publish ShardMigrated on shard movement

Issue 3.17: [Documentation] Document actor migration and replay

Phase 4: Namespace Isolation and NATS Event Delivery

Feature Set 4a: Namespace Storage Isolation

Issue 4.1: [Rule] Enforce namespace-based stream naming

Issue 4.2: [Rule] Enforce storage-level namespace isolation

Issue 4.3: [Documentation] Document namespace design patterns

Issue 4.4: [Validation] Add namespace format validation (P2)

Feature Set 4b: Cross-Node Event Delivery via NATS

Issue 4.5: [Command] Implement NATSEventBus wrapper

Issue 4.6: [Rule] Enforce exactly-once delivery across cluster

Issue 4.7: [Event] Publish EventPublished (via NATS)

Issue 4.8: [Read Model] Implement cross-node subscription

Summary

Next Steps

69 KiB

Raw Blame History