# Bounded Context Map: Aether Distributed Actor System ## Summary Aether has **five distinct bounded contexts** cut by language boundaries, lifecycle differences, ownership patterns, and scaling needs. The contexts emerge from the problem space: single-node event sourcing, distributed clustering, logical isolation, optimistic concurrency control, and event distribution. **Key insight:** Each context has its own ubiquitous language (different meanings for similar terms) and its own lifecycle (actors persist forever; leases expire; subscriptions have independent lifetimes). Boundaries are enforced by language/data ownership, not by organizational structure. --- ## Bounded Contexts ### Context 1: Event Sourcing **Purpose:** Persist events as immutable source of truth; enable state rebuild through replay. **Core Responsibility:** - Events are facts (immutable, append-only) - Versions are monotonically increasing per actor - Snapshots are optional optimization hints, not required - Replay reconstructs state from history **Language (Ubiquitous Language):** - **Event**: Immutable fact about what happened; identified by ID, type, actor, version - **Version**: Monotonically increasing sequence number per actor; used for optimistic locking - **Snapshot**: Point-in-time state capture at a specific version; optional; can always replay - **ActorID**: Identifier for the entity whose events we're storing; unique within namespace - **Replay**: Process of reading events from start version, applying each, to rebuild state **Key Entities (Event-Based, not Object-Based):** - Event (immutable, versioned) - ActorSnapshot (optional state cache) - EventStore interface (multiple implementations) **Key Events Published:** - `EventStored` - Event successfully persisted (triggered when SaveEvent succeeds) - `VersionConflict` - Attempted version <= current; optimistic lock lost (expensive mistake) - `SnapshotCreated` - State snapshot saved (optional; developers decide when) **Key Events Consumed:** - None (this context is a source of truth; others consume from it) **Interfaces to Other Contexts:** - **Cluster Coordination**: Cluster leader queries latest versions to assign shards - **Namespace Isolation**: Stores can be namespaced; queries filtered by namespace - **Optimistic Concurrency**: Version conflicts trigger retry logic in application - **Event Bus**: Events stored here are published to bus subscribers **Lifecycle:** - Event creation: Triggered by application business logic (domain events) - Event persistence: Synchronous SaveEvent call (writes to store) - Event durability: Persists forever (or until retention policy expires in JetStream) - Snapshot lifecycle: Optional; created by application decision or rebalancing; can be safely discarded (replay recovers) **Owner:** Developer (application layer) owns writing events; Aether library owns storage **Current Code Locations:** - `/aether/event.go` - Event struct, VersionConflictError, ReplayError - `/aether/store/memory.go` - InMemoryEventStore implementation - `/aether/store/jetstream.go` - JetStreamEventStore implementation (production) **Scaling Concerns:** - Single node: Full replay fast for actors with <100 events; snapshots help >100 events - Cluster: Events stored in JetStream (durable across nodes); replay happens on failover - Multi-tenant: Events namespaced; separate streams per namespace avoid cross-contamination **Alignment with Vision:** - **Primitives over Frameworks**: EventStore is interface; multiple implementations - **NATS-Native**: JetStreamEventStore uses JetStream durability - **Events as Complete History**: Events are source of truth; state is derived **Gaps/Observations:** - Snapshot strategy is entirely application's responsibility (no built-in triggering) - Schema evolution for events not discussed (backward compatibility on deserialization) - Corruption recovery (ReplayError handling) is application's responsibility **Boundary Rules:** - Inside: Event persistence, version validation, replay logic - Outside: Domain logic that generates events, retry policy on conflicts, snapshot triggering - Cannot cross: No shared models between Event Sourcing and other contexts; translation happens via events --- ### Context 2: Optimistic Concurrency Control **Purpose:** Detect and signal concurrent write conflicts; let application choose retry strategy. **Core Responsibility:** - Protect against lost writes from concurrent writers - Detect conflicts early (version mismatch) - Provide detailed error context for retry logic - Enable at-least-once semantics for idempotent operations **Language (Ubiquitous Language):** - **Version**: Sequential number tracking writer's view of current state - **Conflict**: Condition where attempted version <= current version (another writer won) - **Optimistic Lock**: Assumption that conflicts are rare; detect when they happen - **Retry**: Application's response to conflict; reload state and attempt again - **AttemptedVersion**: Version proposed by current writer - **CurrentVersion**: Version that actually won the race **Key Entities:** - VersionConflictError (detailed error with actor ID, attempted, current versions) - OptimisticLock pattern (implicit; not a first-class entity) **Key Events Published:** - `VersionConflict` - SaveEvent rejected due to version <= current (developer retries) **Key Events Consumed:** - None directly; consumes version state from Event Sourcing **Interfaces to Other Contexts:** - **Event Sourcing**: Reads latest version; detects conflicts on save - **Application Logic**: Application handles conflict and decides retry strategy **Lifecycle:** - Conflict detection: Synchronous in SaveEvent (fast check: version > current) - Conflict lifecycle: Temporary; conflict happens then application retries with new version - Error lifecycle: Returned immediately; application decides next action **Owner:** Aether library (detects conflicts); Application (implements retry strategy) **Current Code Locations:** - `/aether/event.go` - ErrVersionConflict sentinel, VersionConflictError type - `/aether/store/jetstream.go` - SaveEvent validation (lines checking version) - `/aether/store/memory.go` - SaveEvent validation **Scaling Concerns:** - High contention: If many writers target same actor, conflicts spike; application must implement backoff - Retry storms: Naive retry (tight loop) causes cascade failures; exponential backoff mitigates - Metrics: Track conflict rate to detect unexpected contention **Alignment with Vision:** - **Primitives over Frameworks**: Aether returns error; application decides what to do - Does NOT impose retry strategy (that would be a framework opinion) **Gaps/Observations:** - No built-in retry mechanism (intentional design choice) - No conflict metrics in library (application must instrument) - No guidance on retry backoff strategies in code (documented in PROBLEM_MAP, not in API) **Boundary Rules:** - Inside: Detect conflict, validate version > current, return detailed error - Outside: Retry logic, backoff strategy, exponential delays, giving up after N attempts - Cannot cross: Each context owns its retry behavior; no global retry handler --- ### Context 3: Namespace Isolation **Purpose:** Provide logical data boundaries without opinionated multi-tenancy framework. **Core Responsibility:** - Route events to subscribers matching namespace pattern - Isolate event stores by namespace prefix - Support hierarchical namespace naming (e.g., "prod.tenant-abc", "staging.orders") - Warn about wildcard bypass of isolation (explicit decision) **Language (Ubiquitous Language):** - **Namespace**: Logical boundary (tenant, domain, environment, bounded context) - **Namespace Pattern**: NATS-style wildcard matching: "*" (single token), ">" (multi-token) - **Isolation**: Guarantee that events in namespace-A cannot be read from namespace-B (except via wildcard) - **Wildcard Subscription**: Cross-namespace visibility for trusted components (logging, monitoring) - **Subject**: NATS subject for routing (e.g., "aether.events.{namespace}") **Key Entities:** - Namespace (just a string; meaning is application's) - JetStreamConfig with Namespace field (storage isolation) - SubscriptionFilter with namespace pattern (matching) - NATSEventBus subject routing **Key Events Published:** - `EventPublished` - Event sent to namespace subscribers (via EventBus.Publish) **Key Events Consumed:** - Events from Event Sourcing, filtered by namespace pattern **Interfaces to Other Contexts:** - **Event Sourcing**: Stores can be namespaced (prefix in stream name) - **Event Bus**: Publishes to namespace; subscribers match by pattern - **Cluster Coordination**: Might use namespaced subscriptions to isolate tenant events **Lifecycle:** - Namespace definition: Application decides; typically per-tenant or per-domain - Namespace creation: Implicit when first store/subscription uses it (no explicit schema) - Namespace deletion: Not supported; namespaces persist if events exist - Stream lifetime: JetStream stream "namespace_events" persists until deleted **Owner:** Application layer (defines namespace boundaries); Library (enforces routing) **Current Code Locations:** - `/aether/eventbus.go` - EventBus exact vs wildcard subscriber routing - `/aether/nats_eventbus.go` - NATSEventBus subject formatting (line 89: `fmt.Sprintf("aether.events.%s", namespacePattern)`) - `/aether/store/jetstream.go` - JetStreamConfig.Namespace field, stream name sanitization (line 83) - `/aether/pattern.go` - MatchNamespacePattern, IsWildcardPattern functions **Scaling Concerns:** - Single namespace: All events in one stream; scales with event volume - Multi-namespace: Separate streams per namespace; scales horizontally (add namespaces independently) - Wildcard subscriptions: Cross-namespace visibility; careful with security (documented warnings) **Alignment with Vision:** - **Primitives over Frameworks**: Namespaces are primitives; no opinionated multi-tenancy layer - Non-goal: "Opinionated multi-tenancy" - this library provides isolation primitives, not tenant management **Gaps/Observations:** - Namespace collision: No validation that namespace names are unique (risk: "orders" used by two teams) - Wildcard security: Extensively documented in code (SECURITY WARNING appears multiple times); good - No namespace registry or allow-list (application must enforce naming conventions) - Sanitization of namespace names happens in JetStreamEventStore (spaces → underscores) but not documented **Boundary Rules:** - Inside: Namespace pattern matching, subject routing, stream prefixing - Outside: Defining namespace semantics (tenant, domain, environment), enforcing conventions - Cannot cross: Events in namespace-A published to namespace-A only (except wildcard subscribers) --- ### Context 4: Cluster Coordination **Purpose:** Distribute actors across cluster nodes; elect leader; rebalance on topology changes. **Core Responsibility:** - Discover nodes in cluster (NATS-based, no external coordinator) - Elect one leader using lease-based coordination - Distribute shards across nodes via consistent hash ring - Detect node failures and trigger rebalancing - Provide shard assignment for actor placement **Language (Ubiquitous Language):** - **Node**: Physical or logical computer in cluster; has ID, address, capacity, status - **Leader**: Single node responsible for coordination and rebalancing decisions - **Term**: Monotonically increasing leadership election round (prevents split-brain) - **Shard**: Virtual partition (1024 by default); actors hash to shards; shards assigned to nodes - **Consistent Hash Ring**: Algorithm mapping shards to nodes such that node failures cause minimal rebalancing - **Rebalancing**: Reassignment of shards when topology changes (node join/fail) - **ShardMap**: Current state of which shards live on which nodes - **Heartbeat**: Periodic signal from leader renewing its lease (proves still alive) - **Lease**: Time window during which leader's authority is valid (TTL-based, not quorum) **Key Entities:** - NodeInfo (cluster node details: ID, address, capacity, status) - ShardMap (shard → nodes mapping; versioned) - LeadershipLease (leader ID, term, expiration) - ActorMigration (migration record for actor during rebalancing) **Key Events Published:** - `NodeJoined` - New node added to cluster - `NodeFailed` - Node stopped responding (detected by heartbeat timeout) - `LeaderElected` - Leader selected (term incremented) - `LeadershipLost` - Leader lease expired (old leader can no longer coordinate) - `ShardAssigned` - Leader assigns shard to nodes - `ShardMigrated` - Shard moved from one node to another (during rebalancing) **Key Events Consumed:** - Node topology changes (new nodes, failures) → trigger rebalancing - Leader election results → shard assignments **Interfaces to Other Contexts:** - **Namespace Isolation**: Could use namespaced subscriptions for cluster-internal events - **Event Sourcing**: Cluster queries latest version to assign shards; failures trigger replay on new node - **Event Bus**: Cluster messages published to event bus; subscribers on each node act on them **Lifecycle:** - Cluster formation: Nodes join; first leader elected - Leadership duration: Until lease expires (~10 seconds in config) - Shard assignment: Decided by leader; persists in ShardMap - Node failure: Detected after heartbeat timeout (~90 seconds implied by lease config) - Rebalancing: Triggered by topology change; completes when ShardMap versioned and distributed **Owner:** ClusterManager (coordination); LeaderElection (election); ShardManager (placement) **Current Code Locations:** - `/aether/cluster/types.go` - NodeInfo, ShardMap, LeadershipLease, ActorMigration types - `/aether/cluster/manager.go` - ClusterManager, node discovery, rebalancing loop - `/aether/cluster/leader.go` - LeaderElection (lease-based using NATS KV) - `/aether/cluster/hashring.go` - ConsistentHashRing (shard → node mapping) - `/aether/cluster/shard.go` - ShardManager (actor placement, shard assignment) **Scaling Concerns:** - Leader election latency: 10s lease, 3s heartbeat → ~13s to detect failure (tunable) - Rebalancing overhead: Consistent hash minimizes movements (only affects shards from failed node) - Shard count: 1024 default; tune based on cluster size and actor count **Alignment with Vision:** - **NATS-Native**: Leader election uses NATS KV store (lease-based); cluster discovery via NATS - **Primitives over Frameworks**: ShardManager and LeaderElection are composable; can swap algorithms **Gaps/Observations:** - Rebalancing is triggered but algorithm not fully shown in code excerpt ("would rebalance across N nodes") - Actor migration during rebalancing: ShardManager has PlacementStrategy interface but sample migration handler not shown - Split-brain prevention: Lease-based (no concurrent leaders) but old leader could execute stale rebalancing - No explicit actor state migration during shard rebalancing (where does actor state go during move?) **Boundary Rules:** - Inside: Node discovery, leader election, shard assignment, rebalancing decisions - Outside: Actor state migration (that's Event Sourcing's replay), actual actor message delivery - Cannot cross: Cluster decisions are made once per cluster (not per namespace or actor) --- ### Context 5: Event Bus (Pub/Sub Distribution) **Purpose:** Route events from producers to subscribers; support filtering and cross-node propagation. **Core Responsibility:** - Local event distribution (in-process subscriptions) - Cross-node event distribution via NATS - Filter events by type and actor pattern - Support exact and wildcard namespace patterns - Non-blocking delivery (drop event if channel full, don't block publisher) **Language (Ubiquitous Language):** - **Publish**: Send event to namespace (synchronous, non-blocking; may drop if subscribers slow) - **Subscribe**: Register interest in namespace pattern (returns channel) - **Filter**: Criteria for event delivery (EventTypes list, ActorPattern wildcard) - **Wildcard Pattern**: "*" (single token), ">" (multi-token) matching - **Subject**: NATS subject for routing (e.g., "aether.events.{namespace}") - **Subscriber**: Entity receiving events from channel (has local reference to channel) - **Deliver**: Attempt to send event to subscriber's channel; non-blocking (may drop) **Key Entities:** - EventBroadcaster interface (local or NATS-backed) - EventBus (in-memory, local subscriptions only) - NATSEventBus (extends EventBus; adds NATS forwarding) - SubscriptionFilter (event types + actor pattern) - filteredSubscription (internal; tracks channel, pattern, filter) **Key Events Published:** - `EventPublished` - Event sent via EventBus.Publish (may be delivered to subscribers) **Key Events Consumed:** - Events from Event Sourcing context **Interfaces to Other Contexts:** - **Event Sourcing**: Reads events to publish; triggered after SaveEvent - **Namespace Isolation**: Uses namespace pattern for routing - **Cluster Coordination**: Cluster messages flow through event bus **Lifecycle:** - Subscription creation: Caller invokes Subscribe/SubscribeWithFilter; gets channel - Subscription duration: Lifetime of channel (caller controls) - Subscription cleanup: Unsubscribe closes channel - Event delivery: Synchronous Publish → deliver to all matching subscribers - Dropped events: Non-blocking delivery; full channel = dropped event (metrics recorded) **Owner:** Library (EventBus implementation); Callers (subscribe/unsubscribe) **Current Code Locations:** - `/aether/eventbus.go` - EventBus (local in-process pub/sub) - `/aether/nats_eventbus.go` - NATSEventBus (NATS-backed cross-node) - `/aether/pattern.go` - MatchNamespacePattern, SubscriptionFilter matching logic - Metrics tracking in both implementations **Scaling Concerns:** - Local bus: In-memory channels; scales with subscriber count (no network overhead) - NATS bus: One NATS subscription per pattern; scales with unique patterns - Channel buffering: 100-element buffer (configurable); full = dropped events - Metrics: Track published, delivered, dropped per namespace **Alignment with Vision:** - **Primitives over Frameworks**: EventBroadcaster is interface; swappable implementations - **NATS-Native**: NATSEventBus uses NATS subjects for routing **Gaps/Observations:** - Dropped events are silent (metrics recorded but no callback); might surprise subscribers - Filter matching is string-based (no compile-time safety for event types) - Two-level filtering: Namespace at NATS level, EventTypes/ActorPattern at application level - NATSEventBus creates subscription per unique pattern (could be optimized with pattern hierarchy) **Boundary Rules:** - Inside: Event routing, filter matching, non-blocking delivery - Outside: Semantics of events (that's Event Sourcing); decisions on what to do when event received - Cannot cross: Subscribers are responsible for their channels; publisher doesn't know who consumes --- ## Context Relationships ### Event Sourcing ↔ Event Bus **Type:** Producer/Consumer (one-to-many) **Direction:** Event Sourcing produces events; Event Bus distributes them **Integration:** - Application saves event to store (SaveEvent) - Application publishes same event to bus (Publish) - Subscribers receive event from bus channel - Events are same object (Event struct) **Decoupling:** - Store and bus are independent (application coordinates) - Bus subscribers don't know about storage - Replay doesn't trigger bus publish (events already stored) **Safety:** - No shared transaction (save and publish are separate) - Risk: Event saved but publish fails (or vice versa) → bus has stale view - Mitigation: Application's responsibility to ensure consistency --- ### Event Sourcing → Optimistic Concurrency Control **Type:** Dependency (nested) **Direction:** SaveEvent validates version using Optimistic Concurrency **Integration:** - SaveEvent calls GetLatestVersion (read current) - Checks event.Version > currentVersion (optimistic lock) - Returns VersionConflictError if not **Decoupling:** - Optimistic Concurrency is not a separate context; it's logic within Event Sourcing - Version validation is inline in SaveEvent, not a separate call **Note:** Initially these seem like separate contexts (different language, different lifecycle). But Version is Event Sourcing's concern; Conflict is just an error condition (not a separate state machine). Optimistic locking is a **pattern**, not a **context**. --- ### Event Sourcing → Namespace Isolation **Type:** Containment (namespaces contain event streams) **Direction:** Namespace Isolation scopes Event Sourcing **Integration:** - JetStreamEventStore accepts Namespace in config - Actual stream name becomes "{namespace}_{streamName}" - GetEvents, GetLatestVersion, SaveEvent are namespace-scoped **Decoupling:** - Each namespace has independent version sequences - No cross-namespace reads in Event Sourcing context - EventBus.Publish specifies namespace **Safety:** - Complete isolation at storage level (different JetStream streams) - Events from namespace-A cannot appear in namespace-B queries - Wildcard subscriptions bypass this (documented risk) --- ### Cluster Coordination → Event Sourcing **Type:** Consumer (reads version state) **Direction:** Cluster queries Event Sourcing for actor state **Integration:** - ClusterManager might query GetLatestVersion to determine if shard can migrate - Nodes track which actors (shards) are assigned locally - On failover, new node replays events from store to rebuild state **Decoupling:** - Cluster doesn't manage event storage (Event Sourcing owns that) - Cluster doesn't decide when to snapshot - Cluster doesn't know about versions (Event Sourcing concept) --- ### Cluster Coordination → Namespace Isolation **Type:** Orthogonal (can combine, but not required) **Direction:** Cluster can use namespaced subscriptions; not required **Integration:** - Cluster could publish node-join events to namespaced topics (e.g., "cluster.{tenant}") - Different tenants can have independent clusters (each with own cluster messages) **Decoupling:** - Cluster doesn't care about namespace semantics - Namespace doesn't enforce cluster topology --- ### Event Bus → (All contexts) **Type:** Cross-cutting concern **Direction:** Event Bus distributes events from all contexts **Integration:** - Event Sourcing publishes to bus after SaveEvent - Cluster Coordination publishes shard assignments to bus - Namespace Isolation is a parameter to Publish/Subscribe - Subscribers receive events and can filter by type/actor **Decoupling:** - Bus is asynchronous (events may be lost if no subscribers) - Subscribers don't block publishers - No ordering guarantee across namespaces --- ## Boundary Rules Summary ### By Language | Language | Context | Meaning | |----------|---------|---------| | **Event** | Event Sourcing | Immutable fact; identified by ID, type, actor, version | | **Version** | Event Sourcing | Monotonically increasing sequence per actor; also used for optimistic locking | | **Snapshot** | Event Sourcing | Optional state cache at specific version; always disposable | | **Node** | Cluster Coordination | Physical computer in cluster; has ID, address, capacity | | **Leader** | Cluster Coordination | Single node elected for coordination (not per-namespace, not per-actor) | | **Shard** | Cluster Coordination | Virtual partition for actor placement; 1024 by default | | **Namespace** | Namespace Isolation | Logical boundary (tenant, domain, context); application-defined meaning | | **Wildcard** | Both Event Bus & Namespace | "*" (single token) and ">" (multi-token) NATS pattern matching | | **Subject** | Event Bus | NATS subject for message routing | | **Conflict** | Optimistic Concurrency | Condition where write failed due to version being stale | | **Retry** | Optimistic Concurrency | Application's decision to reload and try again | | **Subscribe** | Event Bus | Register interest in namespace pattern; returns channel | | **Publish** | Event Bus | Send event to namespace subscribers; non-blocking | ### By Lifecycle | Entity | Created | Destroyed | Owner | Context | |--------|---------|-----------|-------|---------| | Event | SaveEvent | Never (persists forever) | Application writes, Aether stores | Event Sourcing | | Version | Per-event | With event | Automatic (monotonic) | Event Sourcing | | Snapshot | Application decision | Application decision | Application | Event Sourcing | | Node | Join cluster | Explicit leave | Infrastructure | Cluster Coordination | | Leader | Election completes | Lease expires | Automatic (election) | Cluster Coordination | | Shard | Created with cluster | With cluster | ClusterManager | Cluster Coordination | | Namespace | First use | Never (persist) | Application | Namespace Isolation | | Subscription | Subscribe() call | Unsubscribe() call | Caller | Event Bus | | Channel | Subscribe() returns | Unsubscribe() closes | Caller | Event Bus | ### By Ownership | Context | Who Decides | What They Decide | |---------|-------------|------------------| | Event Sourcing | Application (developer) | When to save events, event schema, snapshot strategy | | Optimistic Concurrency | Application | Retry strategy, backoff, giving up | | Namespace Isolation | Application | Namespace semantics (tenant, domain, env), naming convention | | Cluster Coordination | ClusterManager & LeaderElection | Node discovery, leader election, shard assignment | | Event Bus | Application | What to subscribe to, filtering criteria | ### By Scaling Boundary | Context | Scales By | Limits | Tuning | |---------|-----------|--------|--------| | Event Sourcing | Event volume per actor | Replay latency grows with version count | Snapshots help | | Cluster Coordination | Node count | Leader election latency, rebalancing overhead | Lease TTL, heartbeat interval | | Namespace Isolation | Namespace count | Stream count, NATS resource usage | Separate JetStream streams | | Event Bus | Subscriber count | Channel buffering (100 elements) | Queue depth, metrics | --- ## Code vs. Intended: Alignment Analysis ### Intended → Actual: Good Alignment **Context: Event Sourcing** - Intended: EventStore interface with multiple implementations - Actual: InMemoryEventStore (testing) and JetStreamEventStore (production) both exist - ✓ Good: Matches vision of "primitives over frameworks" **Context: Optimistic Concurrency** - Intended: Detect conflicts, return error, let app retry - Actual: SaveEvent returns VersionConflictError; no built-in retry - ✓ Good: Aligns with vision of primitives (app owns retry logic) **Context: Namespace Isolation** - Intended: Logical boundaries without opinionated multi-tenancy - Actual: JetStreamConfig.Namespace, EventBus namespace patterns - ✓ Good: Primitives provided; semantics left to app **Context: Cluster Coordination** - Intended: Node discovery, leader election, shard assignment - Actual: ClusterManager, LeaderElection, ConsistentHashRing all present - ✓ Good: Primitives implemented **Context: Event Bus** - Intended: Local and cross-node pub/sub with filtering - Actual: EventBus (local) and NATSEventBus (NATS) both present - ✓ Good: Extensible via interface ### Intended → Actual: Gaps **Context: Cluster Coordination** - Intended: Actor migration during shard rebalancing - Actual: ShardManager has PlacementStrategy; ActorMigration type defined - Gap: Migration handler logic not shown; where does actor state transition during rebalance? - Impact: Cluster context is foundational but incomplete; application must implement actor handoff **Context: Event Sourcing** - Intended: Snapshot strategy guidance - Actual: SnapshotStore interface; SaveSnapshot exists; no built-in strategy - Gap: No adaptive snapshotting, no time-based snapshotting - Impact: App must choose snapshot frequency (documented in PROBLEM_MAP, not enforced) **Context: Namespace Isolation** - Intended: Warn about wildcard security risks - Actual: SECURITY WARNING in docstrings (excellent) - Gap: No namespace registry or allow-list to prevent collisions - Impact: Risk of two teams using same namespace (e.g., "orders") unintentionally **Context: Optimistic Concurrency** - Intended: Guide app on retry strategy - Actual: Returns VersionConflictError with details - Gap: No retry helper, no backoff library - Impact: Each app implements own retry (fine; primitives approach) --- ## Refactoring Backlog (if brownfield) ### No Major Refactoring Required The code structure already aligns well with intended bounded contexts: - Event Sourcing lives in `/event.go` and `/store/` - Cluster lives in `/cluster/` - Event Bus lives in `/eventbus.go` and `/nats_eventbus.go` - Pattern matching lives in `/pattern.go` ### Minor Improvements **Issue 1: Document Actor Migration During Rebalancing** - Current: ShardManager.AssignShard exists; ActorMigration type defined - Gap: No example code showing how actor state moves between nodes - Suggestion: Add sample migration handler in cluster package **Issue 2: Add Namespace Validation/Registry** - Current: Namespace is just a string; no collision detection - Gap: Risk of two teams using same namespace - Suggestion: Document naming convention (e.g., "env.team.context"); optionally add schema/enum **Issue 3: Snapshot Strategy Recipes** - Current: SnapshotStore interface; app responsible for strategy - Gap: Documentation could provide sample strategies (time-based, count-based, adaptive) - Suggestion: Add `/examples/snapshot_strategies.go` with reference implementations **Issue 4: Metrics for Concurrency Context** - Current: Version conflict detection exists; no metrics - Gap: Apps can't easily observe conflict rate - Suggestion: Add conflict metrics to EventStore (or provide hooks) --- ## Recommendations ### For Product Strategy 1. **Confirm Bounded Contexts**: Review this map with team. Are these five contexts the right cut? Missing any? Too many? 2. **Define Invariants per Context**: - Event Sourcing: "Version must be strictly monotonic per actor" ✓ (enforced) - Cluster Coordination: "Only one leader can have valid lease at a time" ✓ (lease-based) - Namespace Isolation: "Events in namespace-A cannot be queried from namespace-B context" ✓ (separate streams) - Optimistic Concurrency: "Conflict detection is synchronous; resolution is async" ✓ (error returned immediately) - Event Bus: "Delivery is non-blocking; events may be dropped if subscriber slow" ✓ (metrics track this) 3. **Map Capabilities to Contexts**: - "Store events durably" → Event Sourcing context - "Detect concurrent writes" → Optimistic Concurrency context - "Isolate logical domains" → Namespace Isolation context - "Distribute actors across nodes" → Cluster Coordination context - "Route events to subscribers" → Event Bus context 4. **Test Boundaries**: - Single-node: Event Sourcing + Optimistic Concurrency + Event Bus (no Cluster) - Multi-node: Add Cluster Coordination (but cluster decisions don't affect other contexts) - Multi-tenant: Add Namespace Isolation (orthogonal to other contexts) ### For Architecture 1. **Complete Cluster Context Documentation**: - Show actor migration lifecycle during shard rebalancing - Document when state moves (during rebalance, during failover) - Provide sample ShardManager implementation 2. **Add Snapshot Strategy Guidance**: - Time-based: Snapshot every hour - Count-based: Snapshot every 100 events - Adaptive: Snapshot when replay latency exceeds threshold 3. **Namespace Isolation Checklist**: - Define naming convention (document in README) - Add compile-time checks (optional enum for known namespaces) - Test multi-tenant isolation (integration test suite) 4. **Concurrency Context Testing**: - Add concurrent writer tests to store tests - Verify VersionConflictError details are accurate - Benchmark conflict detection performance ### For Docs 1. **Add Context Diagram**: Show five contexts as boxes; arrows for relationships 2. **Add Per-Context Glossary**: Define ubiquitous language per context (terms table above) 3. **Add Lifecycle Diagrams**: Show event lifetime, node lifetime, subscription lifetime, shard lifetime 4. **Security Section**: Expand wildcard subscription warnings; document trust model --- ## Anti-Patterns Avoided ### Pattern: "One Big Event Model" - **Anti-pattern**: Single Event struct used everywhere with union types - **What we do**: Event is generic; domain language lives in EventType strings and Data map - **Why**: Primitives approach; library doesn't impose domain model ### Pattern: "Shared Mutable State Across Contexts" - **Anti-pattern**: ClusterManager directly mutates EventStore data structures - **What we do**: Contexts communicate via events (if they need to) or via explicit queries - **Why**: Clean boundaries; each context owns its data ### Pattern: "Automatic Retry for Optimistic Locks" - **Anti-pattern**: Library retries internally on version conflict - **What we do**: Return error to caller; caller decides retry strategy - **Why**: Primitives approach; retry policy is app's concern, not library's ### Pattern: "Opinionated Snapshot Strategy" - **Anti-pattern**: "Snapshot every 100 events" hardcoded - **What we do**: SnapshotStore interface; app decides when to snapshot - **Why**: Different apps have different replay latency requirements ### Pattern: "Wildcard Subscriptions by Default" - **Anti-pattern**: All subscriptions use ">" by default (receive everything) - **What we do**: Explicit namespaces; wildcard is optional and warned about - **Why**: Security-first; isolation is default --- ## Conclusion Aether's five bounded contexts are **well-aligned** with the problem space and the codebase: 1. **Event Sourcing** - Store events as immutable history; enable replay 2. **Optimistic Concurrency** - Detect conflicts; let app retry 3. **Namespace Isolation** - Logical boundaries without opinionated multi-tenancy 4. **Cluster Coordination** - Distribute actors, elect leader, rebalance on failure 5. **Event Bus** - Route events from producers to subscribers Each context has: - Clear **language boundaries** (different terms, different meanings) - Clear **lifecycle boundaries** (different creation/deletion patterns) - Clear **ownership** (who decides what within each context) - Clear **scaling boundaries** (why this context must be separate) The implementation **matches the vision** of "primitives over frameworks." Library provides composition points (interfaces); applications wire them together. Next step in product strategy: **Define domain models within each context** (Step 4 of strategy chain). For now, Aether provides primitives; applications build their domain models on top.