Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
34 KiB
Bounded Context Map: Aether Distributed Actor System
Summary
Aether has five distinct bounded contexts cut by language boundaries, lifecycle differences, ownership patterns, and scaling needs. The contexts emerge from the problem space: single-node event sourcing, distributed clustering, logical isolation, optimistic concurrency control, and event distribution.
Key insight: Each context has its own ubiquitous language (different meanings for similar terms) and its own lifecycle (actors persist forever; leases expire; subscriptions have independent lifetimes). Boundaries are enforced by language/data ownership, not by organizational structure.
Bounded Contexts
Context 1: Event Sourcing
Purpose: Persist events as immutable source of truth; enable state rebuild through replay.
Core Responsibility:
- Events are facts (immutable, append-only)
- Versions are monotonically increasing per actor
- Snapshots are optional optimization hints, not required
- Replay reconstructs state from history
Language (Ubiquitous Language):
- Event: Immutable fact about what happened; identified by ID, type, actor, version
- Version: Monotonically increasing sequence number per actor; used for optimistic locking
- Snapshot: Point-in-time state capture at a specific version; optional; can always replay
- ActorID: Identifier for the entity whose events we're storing; unique within namespace
- Replay: Process of reading events from start version, applying each, to rebuild state
Key Entities (Event-Based, not Object-Based):
- Event (immutable, versioned)
- ActorSnapshot (optional state cache)
- EventStore interface (multiple implementations)
Key Events Published:
EventStored- Event successfully persisted (triggered when SaveEvent succeeds)VersionConflict- Attempted version <= current; optimistic lock lost (expensive mistake)SnapshotCreated- State snapshot saved (optional; developers decide when)
Key Events Consumed:
- None (this context is a source of truth; others consume from it)
Interfaces to Other Contexts:
- Cluster Coordination: Cluster leader queries latest versions to assign shards
- Namespace Isolation: Stores can be namespaced; queries filtered by namespace
- Optimistic Concurrency: Version conflicts trigger retry logic in application
- Event Bus: Events stored here are published to bus subscribers
Lifecycle:
- Event creation: Triggered by application business logic (domain events)
- Event persistence: Synchronous SaveEvent call (writes to store)
- Event durability: Persists forever (or until retention policy expires in JetStream)
- Snapshot lifecycle: Optional; created by application decision or rebalancing; can be safely discarded (replay recovers)
Owner: Developer (application layer) owns writing events; Aether library owns storage
Current Code Locations:
/aether/event.go- Event struct, VersionConflictError, ReplayError/aether/store/memory.go- InMemoryEventStore implementation/aether/store/jetstream.go- JetStreamEventStore implementation (production)
Scaling Concerns:
- Single node: Full replay fast for actors with <100 events; snapshots help >100 events
- Cluster: Events stored in JetStream (durable across nodes); replay happens on failover
- Multi-tenant: Events namespaced; separate streams per namespace avoid cross-contamination
Alignment with Vision:
- Primitives over Frameworks: EventStore is interface; multiple implementations
- NATS-Native: JetStreamEventStore uses JetStream durability
- Events as Complete History: Events are source of truth; state is derived
Gaps/Observations:
- Snapshot strategy is entirely application's responsibility (no built-in triggering)
- Schema evolution for events not discussed (backward compatibility on deserialization)
- Corruption recovery (ReplayError handling) is application's responsibility
Boundary Rules:
- Inside: Event persistence, version validation, replay logic
- Outside: Domain logic that generates events, retry policy on conflicts, snapshot triggering
- Cannot cross: No shared models between Event Sourcing and other contexts; translation happens via events
Context 2: Optimistic Concurrency Control
Purpose: Detect and signal concurrent write conflicts; let application choose retry strategy.
Core Responsibility:
- Protect against lost writes from concurrent writers
- Detect conflicts early (version mismatch)
- Provide detailed error context for retry logic
- Enable at-least-once semantics for idempotent operations
Language (Ubiquitous Language):
- Version: Sequential number tracking writer's view of current state
- Conflict: Condition where attempted version <= current version (another writer won)
- Optimistic Lock: Assumption that conflicts are rare; detect when they happen
- Retry: Application's response to conflict; reload state and attempt again
- AttemptedVersion: Version proposed by current writer
- CurrentVersion: Version that actually won the race
Key Entities:
- VersionConflictError (detailed error with actor ID, attempted, current versions)
- OptimisticLock pattern (implicit; not a first-class entity)
Key Events Published:
VersionConflict- SaveEvent rejected due to version <= current (developer retries)
Key Events Consumed:
- None directly; consumes version state from Event Sourcing
Interfaces to Other Contexts:
- Event Sourcing: Reads latest version; detects conflicts on save
- Application Logic: Application handles conflict and decides retry strategy
Lifecycle:
- Conflict detection: Synchronous in SaveEvent (fast check: version > current)
- Conflict lifecycle: Temporary; conflict happens then application retries with new version
- Error lifecycle: Returned immediately; application decides next action
Owner: Aether library (detects conflicts); Application (implements retry strategy)
Current Code Locations:
/aether/event.go- ErrVersionConflict sentinel, VersionConflictError type/aether/store/jetstream.go- SaveEvent validation (lines checking version)/aether/store/memory.go- SaveEvent validation
Scaling Concerns:
- High contention: If many writers target same actor, conflicts spike; application must implement backoff
- Retry storms: Naive retry (tight loop) causes cascade failures; exponential backoff mitigates
- Metrics: Track conflict rate to detect unexpected contention
Alignment with Vision:
- Primitives over Frameworks: Aether returns error; application decides what to do
- Does NOT impose retry strategy (that would be a framework opinion)
Gaps/Observations:
- No built-in retry mechanism (intentional design choice)
- No conflict metrics in library (application must instrument)
- No guidance on retry backoff strategies in code (documented in PROBLEM_MAP, not in API)
Boundary Rules:
- Inside: Detect conflict, validate version > current, return detailed error
- Outside: Retry logic, backoff strategy, exponential delays, giving up after N attempts
- Cannot cross: Each context owns its retry behavior; no global retry handler
Context 3: Namespace Isolation
Purpose: Provide logical data boundaries without opinionated multi-tenancy framework.
Core Responsibility:
- Route events to subscribers matching namespace pattern
- Isolate event stores by namespace prefix
- Support hierarchical namespace naming (e.g., "prod.tenant-abc", "staging.orders")
- Warn about wildcard bypass of isolation (explicit decision)
Language (Ubiquitous Language):
- Namespace: Logical boundary (tenant, domain, environment, bounded context)
- Namespace Pattern: NATS-style wildcard matching: "*" (single token), ">" (multi-token)
- Isolation: Guarantee that events in namespace-A cannot be read from namespace-B (except via wildcard)
- Wildcard Subscription: Cross-namespace visibility for trusted components (logging, monitoring)
- Subject: NATS subject for routing (e.g., "aether.events.{namespace}")
Key Entities:
- Namespace (just a string; meaning is application's)
- JetStreamConfig with Namespace field (storage isolation)
- SubscriptionFilter with namespace pattern (matching)
- NATSEventBus subject routing
Key Events Published:
EventPublished- Event sent to namespace subscribers (via EventBus.Publish)
Key Events Consumed:
- Events from Event Sourcing, filtered by namespace pattern
Interfaces to Other Contexts:
- Event Sourcing: Stores can be namespaced (prefix in stream name)
- Event Bus: Publishes to namespace; subscribers match by pattern
- Cluster Coordination: Might use namespaced subscriptions to isolate tenant events
Lifecycle:
- Namespace definition: Application decides; typically per-tenant or per-domain
- Namespace creation: Implicit when first store/subscription uses it (no explicit schema)
- Namespace deletion: Not supported; namespaces persist if events exist
- Stream lifetime: JetStream stream "namespace_events" persists until deleted
Owner: Application layer (defines namespace boundaries); Library (enforces routing)
Current Code Locations:
/aether/eventbus.go- EventBus exact vs wildcard subscriber routing/aether/nats_eventbus.go- NATSEventBus subject formatting (line 89:fmt.Sprintf("aether.events.%s", namespacePattern))/aether/store/jetstream.go- JetStreamConfig.Namespace field, stream name sanitization (line 83)/aether/pattern.go- MatchNamespacePattern, IsWildcardPattern functions
Scaling Concerns:
- Single namespace: All events in one stream; scales with event volume
- Multi-namespace: Separate streams per namespace; scales horizontally (add namespaces independently)
- Wildcard subscriptions: Cross-namespace visibility; careful with security (documented warnings)
Alignment with Vision:
- Primitives over Frameworks: Namespaces are primitives; no opinionated multi-tenancy layer
- Non-goal: "Opinionated multi-tenancy" - this library provides isolation primitives, not tenant management
Gaps/Observations:
- Namespace collision: No validation that namespace names are unique (risk: "orders" used by two teams)
- Wildcard security: Extensively documented in code (SECURITY WARNING appears multiple times); good
- No namespace registry or allow-list (application must enforce naming conventions)
- Sanitization of namespace names happens in JetStreamEventStore (spaces → underscores) but not documented
Boundary Rules:
- Inside: Namespace pattern matching, subject routing, stream prefixing
- Outside: Defining namespace semantics (tenant, domain, environment), enforcing conventions
- Cannot cross: Events in namespace-A published to namespace-A only (except wildcard subscribers)
Context 4: Cluster Coordination
Purpose: Distribute actors across cluster nodes; elect leader; rebalance on topology changes.
Core Responsibility:
- Discover nodes in cluster (NATS-based, no external coordinator)
- Elect one leader using lease-based coordination
- Distribute shards across nodes via consistent hash ring
- Detect node failures and trigger rebalancing
- Provide shard assignment for actor placement
Language (Ubiquitous Language):
- Node: Physical or logical computer in cluster; has ID, address, capacity, status
- Leader: Single node responsible for coordination and rebalancing decisions
- Term: Monotonically increasing leadership election round (prevents split-brain)
- Shard: Virtual partition (1024 by default); actors hash to shards; shards assigned to nodes
- Consistent Hash Ring: Algorithm mapping shards to nodes such that node failures cause minimal rebalancing
- Rebalancing: Reassignment of shards when topology changes (node join/fail)
- ShardMap: Current state of which shards live on which nodes
- Heartbeat: Periodic signal from leader renewing its lease (proves still alive)
- Lease: Time window during which leader's authority is valid (TTL-based, not quorum)
Key Entities:
- NodeInfo (cluster node details: ID, address, capacity, status)
- ShardMap (shard → nodes mapping; versioned)
- LeadershipLease (leader ID, term, expiration)
- ActorMigration (migration record for actor during rebalancing)
Key Events Published:
NodeJoined- New node added to clusterNodeFailed- Node stopped responding (detected by heartbeat timeout)LeaderElected- Leader selected (term incremented)LeadershipLost- Leader lease expired (old leader can no longer coordinate)ShardAssigned- Leader assigns shard to nodesShardMigrated- Shard moved from one node to another (during rebalancing)
Key Events Consumed:
- Node topology changes (new nodes, failures) → trigger rebalancing
- Leader election results → shard assignments
Interfaces to Other Contexts:
- Namespace Isolation: Could use namespaced subscriptions for cluster-internal events
- Event Sourcing: Cluster queries latest version to assign shards; failures trigger replay on new node
- Event Bus: Cluster messages published to event bus; subscribers on each node act on them
Lifecycle:
- Cluster formation: Nodes join; first leader elected
- Leadership duration: Until lease expires (~10 seconds in config)
- Shard assignment: Decided by leader; persists in ShardMap
- Node failure: Detected after heartbeat timeout (~90 seconds implied by lease config)
- Rebalancing: Triggered by topology change; completes when ShardMap versioned and distributed
Owner: ClusterManager (coordination); LeaderElection (election); ShardManager (placement)
Current Code Locations:
/aether/cluster/types.go- NodeInfo, ShardMap, LeadershipLease, ActorMigration types/aether/cluster/manager.go- ClusterManager, node discovery, rebalancing loop/aether/cluster/leader.go- LeaderElection (lease-based using NATS KV)/aether/cluster/hashring.go- ConsistentHashRing (shard → node mapping)/aether/cluster/shard.go- ShardManager (actor placement, shard assignment)
Scaling Concerns:
- Leader election latency: 10s lease, 3s heartbeat → ~13s to detect failure (tunable)
- Rebalancing overhead: Consistent hash minimizes movements (only affects shards from failed node)
- Shard count: 1024 default; tune based on cluster size and actor count
Alignment with Vision:
- NATS-Native: Leader election uses NATS KV store (lease-based); cluster discovery via NATS
- Primitives over Frameworks: ShardManager and LeaderElection are composable; can swap algorithms
Gaps/Observations:
- Rebalancing is triggered but algorithm not fully shown in code excerpt ("would rebalance across N nodes")
- Actor migration during rebalancing: ShardManager has PlacementStrategy interface but sample migration handler not shown
- Split-brain prevention: Lease-based (no concurrent leaders) but old leader could execute stale rebalancing
- No explicit actor state migration during shard rebalancing (where does actor state go during move?)
Boundary Rules:
- Inside: Node discovery, leader election, shard assignment, rebalancing decisions
- Outside: Actor state migration (that's Event Sourcing's replay), actual actor message delivery
- Cannot cross: Cluster decisions are made once per cluster (not per namespace or actor)
Context 5: Event Bus (Pub/Sub Distribution)
Purpose: Route events from producers to subscribers; support filtering and cross-node propagation.
Core Responsibility:
- Local event distribution (in-process subscriptions)
- Cross-node event distribution via NATS
- Filter events by type and actor pattern
- Support exact and wildcard namespace patterns
- Non-blocking delivery (drop event if channel full, don't block publisher)
Language (Ubiquitous Language):
- Publish: Send event to namespace (synchronous, non-blocking; may drop if subscribers slow)
- Subscribe: Register interest in namespace pattern (returns channel)
- Filter: Criteria for event delivery (EventTypes list, ActorPattern wildcard)
- Wildcard Pattern: "*" (single token), ">" (multi-token) matching
- Subject: NATS subject for routing (e.g., "aether.events.{namespace}")
- Subscriber: Entity receiving events from channel (has local reference to channel)
- Deliver: Attempt to send event to subscriber's channel; non-blocking (may drop)
Key Entities:
- EventBroadcaster interface (local or NATS-backed)
- EventBus (in-memory, local subscriptions only)
- NATSEventBus (extends EventBus; adds NATS forwarding)
- SubscriptionFilter (event types + actor pattern)
- filteredSubscription (internal; tracks channel, pattern, filter)
Key Events Published:
EventPublished- Event sent via EventBus.Publish (may be delivered to subscribers)
Key Events Consumed:
- Events from Event Sourcing context
Interfaces to Other Contexts:
- Event Sourcing: Reads events to publish; triggered after SaveEvent
- Namespace Isolation: Uses namespace pattern for routing
- Cluster Coordination: Cluster messages flow through event bus
Lifecycle:
- Subscription creation: Caller invokes Subscribe/SubscribeWithFilter; gets channel
- Subscription duration: Lifetime of channel (caller controls)
- Subscription cleanup: Unsubscribe closes channel
- Event delivery: Synchronous Publish → deliver to all matching subscribers
- Dropped events: Non-blocking delivery; full channel = dropped event (metrics recorded)
Owner: Library (EventBus implementation); Callers (subscribe/unsubscribe)
Current Code Locations:
/aether/eventbus.go- EventBus (local in-process pub/sub)/aether/nats_eventbus.go- NATSEventBus (NATS-backed cross-node)/aether/pattern.go- MatchNamespacePattern, SubscriptionFilter matching logic- Metrics tracking in both implementations
Scaling Concerns:
- Local bus: In-memory channels; scales with subscriber count (no network overhead)
- NATS bus: One NATS subscription per pattern; scales with unique patterns
- Channel buffering: 100-element buffer (configurable); full = dropped events
- Metrics: Track published, delivered, dropped per namespace
Alignment with Vision:
- Primitives over Frameworks: EventBroadcaster is interface; swappable implementations
- NATS-Native: NATSEventBus uses NATS subjects for routing
Gaps/Observations:
- Dropped events are silent (metrics recorded but no callback); might surprise subscribers
- Filter matching is string-based (no compile-time safety for event types)
- Two-level filtering: Namespace at NATS level, EventTypes/ActorPattern at application level
- NATSEventBus creates subscription per unique pattern (could be optimized with pattern hierarchy)
Boundary Rules:
- Inside: Event routing, filter matching, non-blocking delivery
- Outside: Semantics of events (that's Event Sourcing); decisions on what to do when event received
- Cannot cross: Subscribers are responsible for their channels; publisher doesn't know who consumes
Context Relationships
Event Sourcing ↔ Event Bus
Type: Producer/Consumer (one-to-many)
Direction: Event Sourcing produces events; Event Bus distributes them
Integration:
- Application saves event to store (SaveEvent)
- Application publishes same event to bus (Publish)
- Subscribers receive event from bus channel
- Events are same object (Event struct)
Decoupling:
- Store and bus are independent (application coordinates)
- Bus subscribers don't know about storage
- Replay doesn't trigger bus publish (events already stored)
Safety:
- No shared transaction (save and publish are separate)
- Risk: Event saved but publish fails (or vice versa) → bus has stale view
- Mitigation: Application's responsibility to ensure consistency
Event Sourcing → Optimistic Concurrency Control
Type: Dependency (nested)
Direction: SaveEvent validates version using Optimistic Concurrency
Integration:
- SaveEvent calls GetLatestVersion (read current)
- Checks event.Version > currentVersion (optimistic lock)
- Returns VersionConflictError if not
Decoupling:
- Optimistic Concurrency is not a separate context; it's logic within Event Sourcing
- Version validation is inline in SaveEvent, not a separate call
Note: Initially these seem like separate contexts (different language, different lifecycle). But Version is Event Sourcing's concern; Conflict is just an error condition (not a separate state machine). Optimistic locking is a pattern, not a context.
Event Sourcing → Namespace Isolation
Type: Containment (namespaces contain event streams)
Direction: Namespace Isolation scopes Event Sourcing
Integration:
- JetStreamEventStore accepts Namespace in config
- Actual stream name becomes "{namespace}_{streamName}"
- GetEvents, GetLatestVersion, SaveEvent are namespace-scoped
Decoupling:
- Each namespace has independent version sequences
- No cross-namespace reads in Event Sourcing context
- EventBus.Publish specifies namespace
Safety:
- Complete isolation at storage level (different JetStream streams)
- Events from namespace-A cannot appear in namespace-B queries
- Wildcard subscriptions bypass this (documented risk)
Cluster Coordination → Event Sourcing
Type: Consumer (reads version state)
Direction: Cluster queries Event Sourcing for actor state
Integration:
- ClusterManager might query GetLatestVersion to determine if shard can migrate
- Nodes track which actors (shards) are assigned locally
- On failover, new node replays events from store to rebuild state
Decoupling:
- Cluster doesn't manage event storage (Event Sourcing owns that)
- Cluster doesn't decide when to snapshot
- Cluster doesn't know about versions (Event Sourcing concept)
Cluster Coordination → Namespace Isolation
Type: Orthogonal (can combine, but not required)
Direction: Cluster can use namespaced subscriptions; not required
Integration:
- Cluster could publish node-join events to namespaced topics (e.g., "cluster.{tenant}")
- Different tenants can have independent clusters (each with own cluster messages)
Decoupling:
- Cluster doesn't care about namespace semantics
- Namespace doesn't enforce cluster topology
Event Bus → (All contexts)
Type: Cross-cutting concern
Direction: Event Bus distributes events from all contexts
Integration:
- Event Sourcing publishes to bus after SaveEvent
- Cluster Coordination publishes shard assignments to bus
- Namespace Isolation is a parameter to Publish/Subscribe
- Subscribers receive events and can filter by type/actor
Decoupling:
- Bus is asynchronous (events may be lost if no subscribers)
- Subscribers don't block publishers
- No ordering guarantee across namespaces
Boundary Rules Summary
By Language
| Language | Context | Meaning |
|---|---|---|
| Event | Event Sourcing | Immutable fact; identified by ID, type, actor, version |
| Version | Event Sourcing | Monotonically increasing sequence per actor; also used for optimistic locking |
| Snapshot | Event Sourcing | Optional state cache at specific version; always disposable |
| Node | Cluster Coordination | Physical computer in cluster; has ID, address, capacity |
| Leader | Cluster Coordination | Single node elected for coordination (not per-namespace, not per-actor) |
| Shard | Cluster Coordination | Virtual partition for actor placement; 1024 by default |
| Namespace | Namespace Isolation | Logical boundary (tenant, domain, context); application-defined meaning |
| Wildcard | Both Event Bus & Namespace | "*" (single token) and ">" (multi-token) NATS pattern matching |
| Subject | Event Bus | NATS subject for message routing |
| Conflict | Optimistic Concurrency | Condition where write failed due to version being stale |
| Retry | Optimistic Concurrency | Application's decision to reload and try again |
| Subscribe | Event Bus | Register interest in namespace pattern; returns channel |
| Publish | Event Bus | Send event to namespace subscribers; non-blocking |
By Lifecycle
| Entity | Created | Destroyed | Owner | Context |
|---|---|---|---|---|
| Event | SaveEvent | Never (persists forever) | Application writes, Aether stores | Event Sourcing |
| Version | Per-event | With event | Automatic (monotonic) | Event Sourcing |
| Snapshot | Application decision | Application decision | Application | Event Sourcing |
| Node | Join cluster | Explicit leave | Infrastructure | Cluster Coordination |
| Leader | Election completes | Lease expires | Automatic (election) | Cluster Coordination |
| Shard | Created with cluster | With cluster | ClusterManager | Cluster Coordination |
| Namespace | First use | Never (persist) | Application | Namespace Isolation |
| Subscription | Subscribe() call | Unsubscribe() call | Caller | Event Bus |
| Channel | Subscribe() returns | Unsubscribe() closes | Caller | Event Bus |
By Ownership
| Context | Who Decides | What They Decide |
|---|---|---|
| Event Sourcing | Application (developer) | When to save events, event schema, snapshot strategy |
| Optimistic Concurrency | Application | Retry strategy, backoff, giving up |
| Namespace Isolation | Application | Namespace semantics (tenant, domain, env), naming convention |
| Cluster Coordination | ClusterManager & LeaderElection | Node discovery, leader election, shard assignment |
| Event Bus | Application | What to subscribe to, filtering criteria |
By Scaling Boundary
| Context | Scales By | Limits | Tuning |
|---|---|---|---|
| Event Sourcing | Event volume per actor | Replay latency grows with version count | Snapshots help |
| Cluster Coordination | Node count | Leader election latency, rebalancing overhead | Lease TTL, heartbeat interval |
| Namespace Isolation | Namespace count | Stream count, NATS resource usage | Separate JetStream streams |
| Event Bus | Subscriber count | Channel buffering (100 elements) | Queue depth, metrics |
Code vs. Intended: Alignment Analysis
Intended → Actual: Good Alignment
Context: Event Sourcing
- Intended: EventStore interface with multiple implementations
- Actual: InMemoryEventStore (testing) and JetStreamEventStore (production) both exist
- ✓ Good: Matches vision of "primitives over frameworks"
Context: Optimistic Concurrency
- Intended: Detect conflicts, return error, let app retry
- Actual: SaveEvent returns VersionConflictError; no built-in retry
- ✓ Good: Aligns with vision of primitives (app owns retry logic)
Context: Namespace Isolation
- Intended: Logical boundaries without opinionated multi-tenancy
- Actual: JetStreamConfig.Namespace, EventBus namespace patterns
- ✓ Good: Primitives provided; semantics left to app
Context: Cluster Coordination
- Intended: Node discovery, leader election, shard assignment
- Actual: ClusterManager, LeaderElection, ConsistentHashRing all present
- ✓ Good: Primitives implemented
Context: Event Bus
- Intended: Local and cross-node pub/sub with filtering
- Actual: EventBus (local) and NATSEventBus (NATS) both present
- ✓ Good: Extensible via interface
Intended → Actual: Gaps
Context: Cluster Coordination
- Intended: Actor migration during shard rebalancing
- Actual: ShardManager has PlacementStrategy; ActorMigration type defined
- Gap: Migration handler logic not shown; where does actor state transition during rebalance?
- Impact: Cluster context is foundational but incomplete; application must implement actor handoff
Context: Event Sourcing
- Intended: Snapshot strategy guidance
- Actual: SnapshotStore interface; SaveSnapshot exists; no built-in strategy
- Gap: No adaptive snapshotting, no time-based snapshotting
- Impact: App must choose snapshot frequency (documented in PROBLEM_MAP, not enforced)
Context: Namespace Isolation
- Intended: Warn about wildcard security risks
- Actual: SECURITY WARNING in docstrings (excellent)
- Gap: No namespace registry or allow-list to prevent collisions
- Impact: Risk of two teams using same namespace (e.g., "orders") unintentionally
Context: Optimistic Concurrency
- Intended: Guide app on retry strategy
- Actual: Returns VersionConflictError with details
- Gap: No retry helper, no backoff library
- Impact: Each app implements own retry (fine; primitives approach)
Refactoring Backlog (if brownfield)
No Major Refactoring Required
The code structure already aligns well with intended bounded contexts:
- Event Sourcing lives in
/event.goand/store/ - Cluster lives in
/cluster/ - Event Bus lives in
/eventbus.goand/nats_eventbus.go - Pattern matching lives in
/pattern.go
Minor Improvements
Issue 1: Document Actor Migration During Rebalancing
- Current: ShardManager.AssignShard exists; ActorMigration type defined
- Gap: No example code showing how actor state moves between nodes
- Suggestion: Add sample migration handler in cluster package
Issue 2: Add Namespace Validation/Registry
- Current: Namespace is just a string; no collision detection
- Gap: Risk of two teams using same namespace
- Suggestion: Document naming convention (e.g., "env.team.context"); optionally add schema/enum
Issue 3: Snapshot Strategy Recipes
- Current: SnapshotStore interface; app responsible for strategy
- Gap: Documentation could provide sample strategies (time-based, count-based, adaptive)
- Suggestion: Add
/examples/snapshot_strategies.gowith reference implementations
Issue 4: Metrics for Concurrency Context
- Current: Version conflict detection exists; no metrics
- Gap: Apps can't easily observe conflict rate
- Suggestion: Add conflict metrics to EventStore (or provide hooks)
Recommendations
For Product Strategy
-
Confirm Bounded Contexts: Review this map with team. Are these five contexts the right cut? Missing any? Too many?
-
Define Invariants per Context:
- Event Sourcing: "Version must be strictly monotonic per actor" ✓ (enforced)
- Cluster Coordination: "Only one leader can have valid lease at a time" ✓ (lease-based)
- Namespace Isolation: "Events in namespace-A cannot be queried from namespace-B context" ✓ (separate streams)
- Optimistic Concurrency: "Conflict detection is synchronous; resolution is async" ✓ (error returned immediately)
- Event Bus: "Delivery is non-blocking; events may be dropped if subscriber slow" ✓ (metrics track this)
-
Map Capabilities to Contexts:
- "Store events durably" → Event Sourcing context
- "Detect concurrent writes" → Optimistic Concurrency context
- "Isolate logical domains" → Namespace Isolation context
- "Distribute actors across nodes" → Cluster Coordination context
- "Route events to subscribers" → Event Bus context
-
Test Boundaries:
- Single-node: Event Sourcing + Optimistic Concurrency + Event Bus (no Cluster)
- Multi-node: Add Cluster Coordination (but cluster decisions don't affect other contexts)
- Multi-tenant: Add Namespace Isolation (orthogonal to other contexts)
For Architecture
-
Complete Cluster Context Documentation:
- Show actor migration lifecycle during shard rebalancing
- Document when state moves (during rebalance, during failover)
- Provide sample ShardManager implementation
-
Add Snapshot Strategy Guidance:
- Time-based: Snapshot every hour
- Count-based: Snapshot every 100 events
- Adaptive: Snapshot when replay latency exceeds threshold
-
Namespace Isolation Checklist:
- Define naming convention (document in README)
- Add compile-time checks (optional enum for known namespaces)
- Test multi-tenant isolation (integration test suite)
-
Concurrency Context Testing:
- Add concurrent writer tests to store tests
- Verify VersionConflictError details are accurate
- Benchmark conflict detection performance
For Docs
-
Add Context Diagram: Show five contexts as boxes; arrows for relationships
-
Add Per-Context Glossary: Define ubiquitous language per context (terms table above)
-
Add Lifecycle Diagrams: Show event lifetime, node lifetime, subscription lifetime, shard lifetime
-
Security Section: Expand wildcard subscription warnings; document trust model
Anti-Patterns Avoided
Pattern: "One Big Event Model"
- Anti-pattern: Single Event struct used everywhere with union types
- What we do: Event is generic; domain language lives in EventType strings and Data map
- Why: Primitives approach; library doesn't impose domain model
Pattern: "Shared Mutable State Across Contexts"
- Anti-pattern: ClusterManager directly mutates EventStore data structures
- What we do: Contexts communicate via events (if they need to) or via explicit queries
- Why: Clean boundaries; each context owns its data
Pattern: "Automatic Retry for Optimistic Locks"
- Anti-pattern: Library retries internally on version conflict
- What we do: Return error to caller; caller decides retry strategy
- Why: Primitives approach; retry policy is app's concern, not library's
Pattern: "Opinionated Snapshot Strategy"
- Anti-pattern: "Snapshot every 100 events" hardcoded
- What we do: SnapshotStore interface; app decides when to snapshot
- Why: Different apps have different replay latency requirements
Pattern: "Wildcard Subscriptions by Default"
- Anti-pattern: All subscriptions use ">" by default (receive everything)
- What we do: Explicit namespaces; wildcard is optional and warned about
- Why: Security-first; isolation is default
Conclusion
Aether's five bounded contexts are well-aligned with the problem space and the codebase:
- Event Sourcing - Store events as immutable history; enable replay
- Optimistic Concurrency - Detect conflicts; let app retry
- Namespace Isolation - Logical boundaries without opinionated multi-tenancy
- Cluster Coordination - Distribute actors, elect leader, rebalance on failure
- Event Bus - Route events from producers to subscribers
Each context has:
- Clear language boundaries (different terms, different meanings)
- Clear lifecycle boundaries (different creation/deletion patterns)
- Clear ownership (who decides what within each context)
- Clear scaling boundaries (why this context must be separate)
The implementation matches the vision of "primitives over frameworks." Library provides composition points (interfaces); applications wire them together.
Next step in product strategy: Define domain models within each context (Step 4 of strategy chain). For now, Aether provides primitives; applications build their domain models on top.