Move product strategy documentation to .product-strategy directory
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s

Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-12 23:57:11 +01:00
parent 18ea677585
commit 271f5db444
26 changed files with 16521 additions and 0 deletions

View File

@@ -0,0 +1,751 @@
# Bounded Context Map: Aether Distributed Actor System
## Summary
Aether has **five distinct bounded contexts** cut by language boundaries, lifecycle differences, ownership patterns, and scaling needs. The contexts emerge from the problem space: single-node event sourcing, distributed clustering, logical isolation, optimistic concurrency control, and event distribution.
**Key insight:** Each context has its own ubiquitous language (different meanings for similar terms) and its own lifecycle (actors persist forever; leases expire; subscriptions have independent lifetimes). Boundaries are enforced by language/data ownership, not by organizational structure.
---
## Bounded Contexts
### Context 1: Event Sourcing
**Purpose:** Persist events as immutable source of truth; enable state rebuild through replay.
**Core Responsibility:**
- Events are facts (immutable, append-only)
- Versions are monotonically increasing per actor
- Snapshots are optional optimization hints, not required
- Replay reconstructs state from history
**Language (Ubiquitous Language):**
- **Event**: Immutable fact about what happened; identified by ID, type, actor, version
- **Version**: Monotonically increasing sequence number per actor; used for optimistic locking
- **Snapshot**: Point-in-time state capture at a specific version; optional; can always replay
- **ActorID**: Identifier for the entity whose events we're storing; unique within namespace
- **Replay**: Process of reading events from start version, applying each, to rebuild state
**Key Entities (Event-Based, not Object-Based):**
- Event (immutable, versioned)
- ActorSnapshot (optional state cache)
- EventStore interface (multiple implementations)
**Key Events Published:**
- `EventStored` - Event successfully persisted (triggered when SaveEvent succeeds)
- `VersionConflict` - Attempted version <= current; optimistic lock lost (expensive mistake)
- `SnapshotCreated` - State snapshot saved (optional; developers decide when)
**Key Events Consumed:**
- None (this context is a source of truth; others consume from it)
**Interfaces to Other Contexts:**
- **Cluster Coordination**: Cluster leader queries latest versions to assign shards
- **Namespace Isolation**: Stores can be namespaced; queries filtered by namespace
- **Optimistic Concurrency**: Version conflicts trigger retry logic in application
- **Event Bus**: Events stored here are published to bus subscribers
**Lifecycle:**
- Event creation: Triggered by application business logic (domain events)
- Event persistence: Synchronous SaveEvent call (writes to store)
- Event durability: Persists forever (or until retention policy expires in JetStream)
- Snapshot lifecycle: Optional; created by application decision or rebalancing; can be safely discarded (replay recovers)
**Owner:** Developer (application layer) owns writing events; Aether library owns storage
**Current Code Locations:**
- `/aether/event.go` - Event struct, VersionConflictError, ReplayError
- `/aether/store/memory.go` - InMemoryEventStore implementation
- `/aether/store/jetstream.go` - JetStreamEventStore implementation (production)
**Scaling Concerns:**
- Single node: Full replay fast for actors with <100 events; snapshots help >100 events
- Cluster: Events stored in JetStream (durable across nodes); replay happens on failover
- Multi-tenant: Events namespaced; separate streams per namespace avoid cross-contamination
**Alignment with Vision:**
- **Primitives over Frameworks**: EventStore is interface; multiple implementations
- **NATS-Native**: JetStreamEventStore uses JetStream durability
- **Events as Complete History**: Events are source of truth; state is derived
**Gaps/Observations:**
- Snapshot strategy is entirely application's responsibility (no built-in triggering)
- Schema evolution for events not discussed (backward compatibility on deserialization)
- Corruption recovery (ReplayError handling) is application's responsibility
**Boundary Rules:**
- Inside: Event persistence, version validation, replay logic
- Outside: Domain logic that generates events, retry policy on conflicts, snapshot triggering
- Cannot cross: No shared models between Event Sourcing and other contexts; translation happens via events
---
### Context 2: Optimistic Concurrency Control
**Purpose:** Detect and signal concurrent write conflicts; let application choose retry strategy.
**Core Responsibility:**
- Protect against lost writes from concurrent writers
- Detect conflicts early (version mismatch)
- Provide detailed error context for retry logic
- Enable at-least-once semantics for idempotent operations
**Language (Ubiquitous Language):**
- **Version**: Sequential number tracking writer's view of current state
- **Conflict**: Condition where attempted version <= current version (another writer won)
- **Optimistic Lock**: Assumption that conflicts are rare; detect when they happen
- **Retry**: Application's response to conflict; reload state and attempt again
- **AttemptedVersion**: Version proposed by current writer
- **CurrentVersion**: Version that actually won the race
**Key Entities:**
- VersionConflictError (detailed error with actor ID, attempted, current versions)
- OptimisticLock pattern (implicit; not a first-class entity)
**Key Events Published:**
- `VersionConflict` - SaveEvent rejected due to version <= current (developer retries)
**Key Events Consumed:**
- None directly; consumes version state from Event Sourcing
**Interfaces to Other Contexts:**
- **Event Sourcing**: Reads latest version; detects conflicts on save
- **Application Logic**: Application handles conflict and decides retry strategy
**Lifecycle:**
- Conflict detection: Synchronous in SaveEvent (fast check: version > current)
- Conflict lifecycle: Temporary; conflict happens then application retries with new version
- Error lifecycle: Returned immediately; application decides next action
**Owner:** Aether library (detects conflicts); Application (implements retry strategy)
**Current Code Locations:**
- `/aether/event.go` - ErrVersionConflict sentinel, VersionConflictError type
- `/aether/store/jetstream.go` - SaveEvent validation (lines checking version)
- `/aether/store/memory.go` - SaveEvent validation
**Scaling Concerns:**
- High contention: If many writers target same actor, conflicts spike; application must implement backoff
- Retry storms: Naive retry (tight loop) causes cascade failures; exponential backoff mitigates
- Metrics: Track conflict rate to detect unexpected contention
**Alignment with Vision:**
- **Primitives over Frameworks**: Aether returns error; application decides what to do
- Does NOT impose retry strategy (that would be a framework opinion)
**Gaps/Observations:**
- No built-in retry mechanism (intentional design choice)
- No conflict metrics in library (application must instrument)
- No guidance on retry backoff strategies in code (documented in PROBLEM_MAP, not in API)
**Boundary Rules:**
- Inside: Detect conflict, validate version > current, return detailed error
- Outside: Retry logic, backoff strategy, exponential delays, giving up after N attempts
- Cannot cross: Each context owns its retry behavior; no global retry handler
---
### Context 3: Namespace Isolation
**Purpose:** Provide logical data boundaries without opinionated multi-tenancy framework.
**Core Responsibility:**
- Route events to subscribers matching namespace pattern
- Isolate event stores by namespace prefix
- Support hierarchical namespace naming (e.g., "prod.tenant-abc", "staging.orders")
- Warn about wildcard bypass of isolation (explicit decision)
**Language (Ubiquitous Language):**
- **Namespace**: Logical boundary (tenant, domain, environment, bounded context)
- **Namespace Pattern**: NATS-style wildcard matching: "*" (single token), ">" (multi-token)
- **Isolation**: Guarantee that events in namespace-A cannot be read from namespace-B (except via wildcard)
- **Wildcard Subscription**: Cross-namespace visibility for trusted components (logging, monitoring)
- **Subject**: NATS subject for routing (e.g., "aether.events.{namespace}")
**Key Entities:**
- Namespace (just a string; meaning is application's)
- JetStreamConfig with Namespace field (storage isolation)
- SubscriptionFilter with namespace pattern (matching)
- NATSEventBus subject routing
**Key Events Published:**
- `EventPublished` - Event sent to namespace subscribers (via EventBus.Publish)
**Key Events Consumed:**
- Events from Event Sourcing, filtered by namespace pattern
**Interfaces to Other Contexts:**
- **Event Sourcing**: Stores can be namespaced (prefix in stream name)
- **Event Bus**: Publishes to namespace; subscribers match by pattern
- **Cluster Coordination**: Might use namespaced subscriptions to isolate tenant events
**Lifecycle:**
- Namespace definition: Application decides; typically per-tenant or per-domain
- Namespace creation: Implicit when first store/subscription uses it (no explicit schema)
- Namespace deletion: Not supported; namespaces persist if events exist
- Stream lifetime: JetStream stream "namespace_events" persists until deleted
**Owner:** Application layer (defines namespace boundaries); Library (enforces routing)
**Current Code Locations:**
- `/aether/eventbus.go` - EventBus exact vs wildcard subscriber routing
- `/aether/nats_eventbus.go` - NATSEventBus subject formatting (line 89: `fmt.Sprintf("aether.events.%s", namespacePattern)`)
- `/aether/store/jetstream.go` - JetStreamConfig.Namespace field, stream name sanitization (line 83)
- `/aether/pattern.go` - MatchNamespacePattern, IsWildcardPattern functions
**Scaling Concerns:**
- Single namespace: All events in one stream; scales with event volume
- Multi-namespace: Separate streams per namespace; scales horizontally (add namespaces independently)
- Wildcard subscriptions: Cross-namespace visibility; careful with security (documented warnings)
**Alignment with Vision:**
- **Primitives over Frameworks**: Namespaces are primitives; no opinionated multi-tenancy layer
- Non-goal: "Opinionated multi-tenancy" - this library provides isolation primitives, not tenant management
**Gaps/Observations:**
- Namespace collision: No validation that namespace names are unique (risk: "orders" used by two teams)
- Wildcard security: Extensively documented in code (SECURITY WARNING appears multiple times); good
- No namespace registry or allow-list (application must enforce naming conventions)
- Sanitization of namespace names happens in JetStreamEventStore (spaces → underscores) but not documented
**Boundary Rules:**
- Inside: Namespace pattern matching, subject routing, stream prefixing
- Outside: Defining namespace semantics (tenant, domain, environment), enforcing conventions
- Cannot cross: Events in namespace-A published to namespace-A only (except wildcard subscribers)
---
### Context 4: Cluster Coordination
**Purpose:** Distribute actors across cluster nodes; elect leader; rebalance on topology changes.
**Core Responsibility:**
- Discover nodes in cluster (NATS-based, no external coordinator)
- Elect one leader using lease-based coordination
- Distribute shards across nodes via consistent hash ring
- Detect node failures and trigger rebalancing
- Provide shard assignment for actor placement
**Language (Ubiquitous Language):**
- **Node**: Physical or logical computer in cluster; has ID, address, capacity, status
- **Leader**: Single node responsible for coordination and rebalancing decisions
- **Term**: Monotonically increasing leadership election round (prevents split-brain)
- **Shard**: Virtual partition (1024 by default); actors hash to shards; shards assigned to nodes
- **Consistent Hash Ring**: Algorithm mapping shards to nodes such that node failures cause minimal rebalancing
- **Rebalancing**: Reassignment of shards when topology changes (node join/fail)
- **ShardMap**: Current state of which shards live on which nodes
- **Heartbeat**: Periodic signal from leader renewing its lease (proves still alive)
- **Lease**: Time window during which leader's authority is valid (TTL-based, not quorum)
**Key Entities:**
- NodeInfo (cluster node details: ID, address, capacity, status)
- ShardMap (shard → nodes mapping; versioned)
- LeadershipLease (leader ID, term, expiration)
- ActorMigration (migration record for actor during rebalancing)
**Key Events Published:**
- `NodeJoined` - New node added to cluster
- `NodeFailed` - Node stopped responding (detected by heartbeat timeout)
- `LeaderElected` - Leader selected (term incremented)
- `LeadershipLost` - Leader lease expired (old leader can no longer coordinate)
- `ShardAssigned` - Leader assigns shard to nodes
- `ShardMigrated` - Shard moved from one node to another (during rebalancing)
**Key Events Consumed:**
- Node topology changes (new nodes, failures) → trigger rebalancing
- Leader election results → shard assignments
**Interfaces to Other Contexts:**
- **Namespace Isolation**: Could use namespaced subscriptions for cluster-internal events
- **Event Sourcing**: Cluster queries latest version to assign shards; failures trigger replay on new node
- **Event Bus**: Cluster messages published to event bus; subscribers on each node act on them
**Lifecycle:**
- Cluster formation: Nodes join; first leader elected
- Leadership duration: Until lease expires (~10 seconds in config)
- Shard assignment: Decided by leader; persists in ShardMap
- Node failure: Detected after heartbeat timeout (~90 seconds implied by lease config)
- Rebalancing: Triggered by topology change; completes when ShardMap versioned and distributed
**Owner:** ClusterManager (coordination); LeaderElection (election); ShardManager (placement)
**Current Code Locations:**
- `/aether/cluster/types.go` - NodeInfo, ShardMap, LeadershipLease, ActorMigration types
- `/aether/cluster/manager.go` - ClusterManager, node discovery, rebalancing loop
- `/aether/cluster/leader.go` - LeaderElection (lease-based using NATS KV)
- `/aether/cluster/hashring.go` - ConsistentHashRing (shard → node mapping)
- `/aether/cluster/shard.go` - ShardManager (actor placement, shard assignment)
**Scaling Concerns:**
- Leader election latency: 10s lease, 3s heartbeat → ~13s to detect failure (tunable)
- Rebalancing overhead: Consistent hash minimizes movements (only affects shards from failed node)
- Shard count: 1024 default; tune based on cluster size and actor count
**Alignment with Vision:**
- **NATS-Native**: Leader election uses NATS KV store (lease-based); cluster discovery via NATS
- **Primitives over Frameworks**: ShardManager and LeaderElection are composable; can swap algorithms
**Gaps/Observations:**
- Rebalancing is triggered but algorithm not fully shown in code excerpt ("would rebalance across N nodes")
- Actor migration during rebalancing: ShardManager has PlacementStrategy interface but sample migration handler not shown
- Split-brain prevention: Lease-based (no concurrent leaders) but old leader could execute stale rebalancing
- No explicit actor state migration during shard rebalancing (where does actor state go during move?)
**Boundary Rules:**
- Inside: Node discovery, leader election, shard assignment, rebalancing decisions
- Outside: Actor state migration (that's Event Sourcing's replay), actual actor message delivery
- Cannot cross: Cluster decisions are made once per cluster (not per namespace or actor)
---
### Context 5: Event Bus (Pub/Sub Distribution)
**Purpose:** Route events from producers to subscribers; support filtering and cross-node propagation.
**Core Responsibility:**
- Local event distribution (in-process subscriptions)
- Cross-node event distribution via NATS
- Filter events by type and actor pattern
- Support exact and wildcard namespace patterns
- Non-blocking delivery (drop event if channel full, don't block publisher)
**Language (Ubiquitous Language):**
- **Publish**: Send event to namespace (synchronous, non-blocking; may drop if subscribers slow)
- **Subscribe**: Register interest in namespace pattern (returns channel)
- **Filter**: Criteria for event delivery (EventTypes list, ActorPattern wildcard)
- **Wildcard Pattern**: "*" (single token), ">" (multi-token) matching
- **Subject**: NATS subject for routing (e.g., "aether.events.{namespace}")
- **Subscriber**: Entity receiving events from channel (has local reference to channel)
- **Deliver**: Attempt to send event to subscriber's channel; non-blocking (may drop)
**Key Entities:**
- EventBroadcaster interface (local or NATS-backed)
- EventBus (in-memory, local subscriptions only)
- NATSEventBus (extends EventBus; adds NATS forwarding)
- SubscriptionFilter (event types + actor pattern)
- filteredSubscription (internal; tracks channel, pattern, filter)
**Key Events Published:**
- `EventPublished` - Event sent via EventBus.Publish (may be delivered to subscribers)
**Key Events Consumed:**
- Events from Event Sourcing context
**Interfaces to Other Contexts:**
- **Event Sourcing**: Reads events to publish; triggered after SaveEvent
- **Namespace Isolation**: Uses namespace pattern for routing
- **Cluster Coordination**: Cluster messages flow through event bus
**Lifecycle:**
- Subscription creation: Caller invokes Subscribe/SubscribeWithFilter; gets channel
- Subscription duration: Lifetime of channel (caller controls)
- Subscription cleanup: Unsubscribe closes channel
- Event delivery: Synchronous Publish → deliver to all matching subscribers
- Dropped events: Non-blocking delivery; full channel = dropped event (metrics recorded)
**Owner:** Library (EventBus implementation); Callers (subscribe/unsubscribe)
**Current Code Locations:**
- `/aether/eventbus.go` - EventBus (local in-process pub/sub)
- `/aether/nats_eventbus.go` - NATSEventBus (NATS-backed cross-node)
- `/aether/pattern.go` - MatchNamespacePattern, SubscriptionFilter matching logic
- Metrics tracking in both implementations
**Scaling Concerns:**
- Local bus: In-memory channels; scales with subscriber count (no network overhead)
- NATS bus: One NATS subscription per pattern; scales with unique patterns
- Channel buffering: 100-element buffer (configurable); full = dropped events
- Metrics: Track published, delivered, dropped per namespace
**Alignment with Vision:**
- **Primitives over Frameworks**: EventBroadcaster is interface; swappable implementations
- **NATS-Native**: NATSEventBus uses NATS subjects for routing
**Gaps/Observations:**
- Dropped events are silent (metrics recorded but no callback); might surprise subscribers
- Filter matching is string-based (no compile-time safety for event types)
- Two-level filtering: Namespace at NATS level, EventTypes/ActorPattern at application level
- NATSEventBus creates subscription per unique pattern (could be optimized with pattern hierarchy)
**Boundary Rules:**
- Inside: Event routing, filter matching, non-blocking delivery
- Outside: Semantics of events (that's Event Sourcing); decisions on what to do when event received
- Cannot cross: Subscribers are responsible for their channels; publisher doesn't know who consumes
---
## Context Relationships
### Event Sourcing ↔ Event Bus
**Type:** Producer/Consumer (one-to-many)
**Direction:** Event Sourcing produces events; Event Bus distributes them
**Integration:**
- Application saves event to store (SaveEvent)
- Application publishes same event to bus (Publish)
- Subscribers receive event from bus channel
- Events are same object (Event struct)
**Decoupling:**
- Store and bus are independent (application coordinates)
- Bus subscribers don't know about storage
- Replay doesn't trigger bus publish (events already stored)
**Safety:**
- No shared transaction (save and publish are separate)
- Risk: Event saved but publish fails (or vice versa) → bus has stale view
- Mitigation: Application's responsibility to ensure consistency
---
### Event Sourcing → Optimistic Concurrency Control
**Type:** Dependency (nested)
**Direction:** SaveEvent validates version using Optimistic Concurrency
**Integration:**
- SaveEvent calls GetLatestVersion (read current)
- Checks event.Version > currentVersion (optimistic lock)
- Returns VersionConflictError if not
**Decoupling:**
- Optimistic Concurrency is not a separate context; it's logic within Event Sourcing
- Version validation is inline in SaveEvent, not a separate call
**Note:** Initially these seem like separate contexts (different language, different lifecycle). But Version is Event Sourcing's concern; Conflict is just an error condition (not a separate state machine). Optimistic locking is a **pattern**, not a **context**.
---
### Event Sourcing → Namespace Isolation
**Type:** Containment (namespaces contain event streams)
**Direction:** Namespace Isolation scopes Event Sourcing
**Integration:**
- JetStreamEventStore accepts Namespace in config
- Actual stream name becomes "{namespace}_{streamName}"
- GetEvents, GetLatestVersion, SaveEvent are namespace-scoped
**Decoupling:**
- Each namespace has independent version sequences
- No cross-namespace reads in Event Sourcing context
- EventBus.Publish specifies namespace
**Safety:**
- Complete isolation at storage level (different JetStream streams)
- Events from namespace-A cannot appear in namespace-B queries
- Wildcard subscriptions bypass this (documented risk)
---
### Cluster Coordination → Event Sourcing
**Type:** Consumer (reads version state)
**Direction:** Cluster queries Event Sourcing for actor state
**Integration:**
- ClusterManager might query GetLatestVersion to determine if shard can migrate
- Nodes track which actors (shards) are assigned locally
- On failover, new node replays events from store to rebuild state
**Decoupling:**
- Cluster doesn't manage event storage (Event Sourcing owns that)
- Cluster doesn't decide when to snapshot
- Cluster doesn't know about versions (Event Sourcing concept)
---
### Cluster Coordination → Namespace Isolation
**Type:** Orthogonal (can combine, but not required)
**Direction:** Cluster can use namespaced subscriptions; not required
**Integration:**
- Cluster could publish node-join events to namespaced topics (e.g., "cluster.{tenant}")
- Different tenants can have independent clusters (each with own cluster messages)
**Decoupling:**
- Cluster doesn't care about namespace semantics
- Namespace doesn't enforce cluster topology
---
### Event Bus → (All contexts)
**Type:** Cross-cutting concern
**Direction:** Event Bus distributes events from all contexts
**Integration:**
- Event Sourcing publishes to bus after SaveEvent
- Cluster Coordination publishes shard assignments to bus
- Namespace Isolation is a parameter to Publish/Subscribe
- Subscribers receive events and can filter by type/actor
**Decoupling:**
- Bus is asynchronous (events may be lost if no subscribers)
- Subscribers don't block publishers
- No ordering guarantee across namespaces
---
## Boundary Rules Summary
### By Language
| Language | Context | Meaning |
|----------|---------|---------|
| **Event** | Event Sourcing | Immutable fact; identified by ID, type, actor, version |
| **Version** | Event Sourcing | Monotonically increasing sequence per actor; also used for optimistic locking |
| **Snapshot** | Event Sourcing | Optional state cache at specific version; always disposable |
| **Node** | Cluster Coordination | Physical computer in cluster; has ID, address, capacity |
| **Leader** | Cluster Coordination | Single node elected for coordination (not per-namespace, not per-actor) |
| **Shard** | Cluster Coordination | Virtual partition for actor placement; 1024 by default |
| **Namespace** | Namespace Isolation | Logical boundary (tenant, domain, context); application-defined meaning |
| **Wildcard** | Both Event Bus & Namespace | "*" (single token) and ">" (multi-token) NATS pattern matching |
| **Subject** | Event Bus | NATS subject for message routing |
| **Conflict** | Optimistic Concurrency | Condition where write failed due to version being stale |
| **Retry** | Optimistic Concurrency | Application's decision to reload and try again |
| **Subscribe** | Event Bus | Register interest in namespace pattern; returns channel |
| **Publish** | Event Bus | Send event to namespace subscribers; non-blocking |
### By Lifecycle
| Entity | Created | Destroyed | Owner | Context |
|--------|---------|-----------|-------|---------|
| Event | SaveEvent | Never (persists forever) | Application writes, Aether stores | Event Sourcing |
| Version | Per-event | With event | Automatic (monotonic) | Event Sourcing |
| Snapshot | Application decision | Application decision | Application | Event Sourcing |
| Node | Join cluster | Explicit leave | Infrastructure | Cluster Coordination |
| Leader | Election completes | Lease expires | Automatic (election) | Cluster Coordination |
| Shard | Created with cluster | With cluster | ClusterManager | Cluster Coordination |
| Namespace | First use | Never (persist) | Application | Namespace Isolation |
| Subscription | Subscribe() call | Unsubscribe() call | Caller | Event Bus |
| Channel | Subscribe() returns | Unsubscribe() closes | Caller | Event Bus |
### By Ownership
| Context | Who Decides | What They Decide |
|---------|-------------|------------------|
| Event Sourcing | Application (developer) | When to save events, event schema, snapshot strategy |
| Optimistic Concurrency | Application | Retry strategy, backoff, giving up |
| Namespace Isolation | Application | Namespace semantics (tenant, domain, env), naming convention |
| Cluster Coordination | ClusterManager & LeaderElection | Node discovery, leader election, shard assignment |
| Event Bus | Application | What to subscribe to, filtering criteria |
### By Scaling Boundary
| Context | Scales By | Limits | Tuning |
|---------|-----------|--------|--------|
| Event Sourcing | Event volume per actor | Replay latency grows with version count | Snapshots help |
| Cluster Coordination | Node count | Leader election latency, rebalancing overhead | Lease TTL, heartbeat interval |
| Namespace Isolation | Namespace count | Stream count, NATS resource usage | Separate JetStream streams |
| Event Bus | Subscriber count | Channel buffering (100 elements) | Queue depth, metrics |
---
## Code vs. Intended: Alignment Analysis
### Intended → Actual: Good Alignment
**Context: Event Sourcing**
- Intended: EventStore interface with multiple implementations
- Actual: InMemoryEventStore (testing) and JetStreamEventStore (production) both exist
- ✓ Good: Matches vision of "primitives over frameworks"
**Context: Optimistic Concurrency**
- Intended: Detect conflicts, return error, let app retry
- Actual: SaveEvent returns VersionConflictError; no built-in retry
- ✓ Good: Aligns with vision of primitives (app owns retry logic)
**Context: Namespace Isolation**
- Intended: Logical boundaries without opinionated multi-tenancy
- Actual: JetStreamConfig.Namespace, EventBus namespace patterns
- ✓ Good: Primitives provided; semantics left to app
**Context: Cluster Coordination**
- Intended: Node discovery, leader election, shard assignment
- Actual: ClusterManager, LeaderElection, ConsistentHashRing all present
- ✓ Good: Primitives implemented
**Context: Event Bus**
- Intended: Local and cross-node pub/sub with filtering
- Actual: EventBus (local) and NATSEventBus (NATS) both present
- ✓ Good: Extensible via interface
### Intended → Actual: Gaps
**Context: Cluster Coordination**
- Intended: Actor migration during shard rebalancing
- Actual: ShardManager has PlacementStrategy; ActorMigration type defined
- Gap: Migration handler logic not shown; where does actor state transition during rebalance?
- Impact: Cluster context is foundational but incomplete; application must implement actor handoff
**Context: Event Sourcing**
- Intended: Snapshot strategy guidance
- Actual: SnapshotStore interface; SaveSnapshot exists; no built-in strategy
- Gap: No adaptive snapshotting, no time-based snapshotting
- Impact: App must choose snapshot frequency (documented in PROBLEM_MAP, not enforced)
**Context: Namespace Isolation**
- Intended: Warn about wildcard security risks
- Actual: SECURITY WARNING in docstrings (excellent)
- Gap: No namespace registry or allow-list to prevent collisions
- Impact: Risk of two teams using same namespace (e.g., "orders") unintentionally
**Context: Optimistic Concurrency**
- Intended: Guide app on retry strategy
- Actual: Returns VersionConflictError with details
- Gap: No retry helper, no backoff library
- Impact: Each app implements own retry (fine; primitives approach)
---
## Refactoring Backlog (if brownfield)
### No Major Refactoring Required
The code structure already aligns well with intended bounded contexts:
- Event Sourcing lives in `/event.go` and `/store/`
- Cluster lives in `/cluster/`
- Event Bus lives in `/eventbus.go` and `/nats_eventbus.go`
- Pattern matching lives in `/pattern.go`
### Minor Improvements
**Issue 1: Document Actor Migration During Rebalancing**
- Current: ShardManager.AssignShard exists; ActorMigration type defined
- Gap: No example code showing how actor state moves between nodes
- Suggestion: Add sample migration handler in cluster package
**Issue 2: Add Namespace Validation/Registry**
- Current: Namespace is just a string; no collision detection
- Gap: Risk of two teams using same namespace
- Suggestion: Document naming convention (e.g., "env.team.context"); optionally add schema/enum
**Issue 3: Snapshot Strategy Recipes**
- Current: SnapshotStore interface; app responsible for strategy
- Gap: Documentation could provide sample strategies (time-based, count-based, adaptive)
- Suggestion: Add `/examples/snapshot_strategies.go` with reference implementations
**Issue 4: Metrics for Concurrency Context**
- Current: Version conflict detection exists; no metrics
- Gap: Apps can't easily observe conflict rate
- Suggestion: Add conflict metrics to EventStore (or provide hooks)
---
## Recommendations
### For Product Strategy
1. **Confirm Bounded Contexts**: Review this map with team. Are these five contexts the right cut? Missing any? Too many?
2. **Define Invariants per Context**:
- Event Sourcing: "Version must be strictly monotonic per actor" ✓ (enforced)
- Cluster Coordination: "Only one leader can have valid lease at a time" ✓ (lease-based)
- Namespace Isolation: "Events in namespace-A cannot be queried from namespace-B context" ✓ (separate streams)
- Optimistic Concurrency: "Conflict detection is synchronous; resolution is async" ✓ (error returned immediately)
- Event Bus: "Delivery is non-blocking; events may be dropped if subscriber slow" ✓ (metrics track this)
3. **Map Capabilities to Contexts**:
- "Store events durably" → Event Sourcing context
- "Detect concurrent writes" → Optimistic Concurrency context
- "Isolate logical domains" → Namespace Isolation context
- "Distribute actors across nodes" → Cluster Coordination context
- "Route events to subscribers" → Event Bus context
4. **Test Boundaries**:
- Single-node: Event Sourcing + Optimistic Concurrency + Event Bus (no Cluster)
- Multi-node: Add Cluster Coordination (but cluster decisions don't affect other contexts)
- Multi-tenant: Add Namespace Isolation (orthogonal to other contexts)
### For Architecture
1. **Complete Cluster Context Documentation**:
- Show actor migration lifecycle during shard rebalancing
- Document when state moves (during rebalance, during failover)
- Provide sample ShardManager implementation
2. **Add Snapshot Strategy Guidance**:
- Time-based: Snapshot every hour
- Count-based: Snapshot every 100 events
- Adaptive: Snapshot when replay latency exceeds threshold
3. **Namespace Isolation Checklist**:
- Define naming convention (document in README)
- Add compile-time checks (optional enum for known namespaces)
- Test multi-tenant isolation (integration test suite)
4. **Concurrency Context Testing**:
- Add concurrent writer tests to store tests
- Verify VersionConflictError details are accurate
- Benchmark conflict detection performance
### For Docs
1. **Add Context Diagram**: Show five contexts as boxes; arrows for relationships
2. **Add Per-Context Glossary**: Define ubiquitous language per context (terms table above)
3. **Add Lifecycle Diagrams**: Show event lifetime, node lifetime, subscription lifetime, shard lifetime
4. **Security Section**: Expand wildcard subscription warnings; document trust model
---
## Anti-Patterns Avoided
### Pattern: "One Big Event Model"
- **Anti-pattern**: Single Event struct used everywhere with union types
- **What we do**: Event is generic; domain language lives in EventType strings and Data map
- **Why**: Primitives approach; library doesn't impose domain model
### Pattern: "Shared Mutable State Across Contexts"
- **Anti-pattern**: ClusterManager directly mutates EventStore data structures
- **What we do**: Contexts communicate via events (if they need to) or via explicit queries
- **Why**: Clean boundaries; each context owns its data
### Pattern: "Automatic Retry for Optimistic Locks"
- **Anti-pattern**: Library retries internally on version conflict
- **What we do**: Return error to caller; caller decides retry strategy
- **Why**: Primitives approach; retry policy is app's concern, not library's
### Pattern: "Opinionated Snapshot Strategy"
- **Anti-pattern**: "Snapshot every 100 events" hardcoded
- **What we do**: SnapshotStore interface; app decides when to snapshot
- **Why**: Different apps have different replay latency requirements
### Pattern: "Wildcard Subscriptions by Default"
- **Anti-pattern**: All subscriptions use ">" by default (receive everything)
- **What we do**: Explicit namespaces; wildcard is optional and warned about
- **Why**: Security-first; isolation is default
---
## Conclusion
Aether's five bounded contexts are **well-aligned** with the problem space and the codebase:
1. **Event Sourcing** - Store events as immutable history; enable replay
2. **Optimistic Concurrency** - Detect conflicts; let app retry
3. **Namespace Isolation** - Logical boundaries without opinionated multi-tenancy
4. **Cluster Coordination** - Distribute actors, elect leader, rebalance on failure
5. **Event Bus** - Route events from producers to subscribers
Each context has:
- Clear **language boundaries** (different terms, different meanings)
- Clear **lifecycle boundaries** (different creation/deletion patterns)
- Clear **ownership** (who decides what within each context)
- Clear **scaling boundaries** (why this context must be separate)
The implementation **matches the vision** of "primitives over frameworks." Library provides composition points (interfaces); applications wire them together.
Next step in product strategy: **Define domain models within each context** (Step 4 of strategy chain). For now, Aether provides primitives; applications build their domain models on top.