Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
751
.product-strategy/BOUNDED_CONTEXT_MAP.md
Normal file
751
.product-strategy/BOUNDED_CONTEXT_MAP.md
Normal file
@@ -0,0 +1,751 @@
|
||||
# Bounded Context Map: Aether Distributed Actor System
|
||||
|
||||
## Summary
|
||||
|
||||
Aether has **five distinct bounded contexts** cut by language boundaries, lifecycle differences, ownership patterns, and scaling needs. The contexts emerge from the problem space: single-node event sourcing, distributed clustering, logical isolation, optimistic concurrency control, and event distribution.
|
||||
|
||||
**Key insight:** Each context has its own ubiquitous language (different meanings for similar terms) and its own lifecycle (actors persist forever; leases expire; subscriptions have independent lifetimes). Boundaries are enforced by language/data ownership, not by organizational structure.
|
||||
|
||||
---
|
||||
|
||||
## Bounded Contexts
|
||||
|
||||
### Context 1: Event Sourcing
|
||||
|
||||
**Purpose:** Persist events as immutable source of truth; enable state rebuild through replay.
|
||||
|
||||
**Core Responsibility:**
|
||||
- Events are facts (immutable, append-only)
|
||||
- Versions are monotonically increasing per actor
|
||||
- Snapshots are optional optimization hints, not required
|
||||
- Replay reconstructs state from history
|
||||
|
||||
**Language (Ubiquitous Language):**
|
||||
- **Event**: Immutable fact about what happened; identified by ID, type, actor, version
|
||||
- **Version**: Monotonically increasing sequence number per actor; used for optimistic locking
|
||||
- **Snapshot**: Point-in-time state capture at a specific version; optional; can always replay
|
||||
- **ActorID**: Identifier for the entity whose events we're storing; unique within namespace
|
||||
- **Replay**: Process of reading events from start version, applying each, to rebuild state
|
||||
|
||||
**Key Entities (Event-Based, not Object-Based):**
|
||||
- Event (immutable, versioned)
|
||||
- ActorSnapshot (optional state cache)
|
||||
- EventStore interface (multiple implementations)
|
||||
|
||||
**Key Events Published:**
|
||||
- `EventStored` - Event successfully persisted (triggered when SaveEvent succeeds)
|
||||
- `VersionConflict` - Attempted version <= current; optimistic lock lost (expensive mistake)
|
||||
- `SnapshotCreated` - State snapshot saved (optional; developers decide when)
|
||||
|
||||
**Key Events Consumed:**
|
||||
- None (this context is a source of truth; others consume from it)
|
||||
|
||||
**Interfaces to Other Contexts:**
|
||||
- **Cluster Coordination**: Cluster leader queries latest versions to assign shards
|
||||
- **Namespace Isolation**: Stores can be namespaced; queries filtered by namespace
|
||||
- **Optimistic Concurrency**: Version conflicts trigger retry logic in application
|
||||
- **Event Bus**: Events stored here are published to bus subscribers
|
||||
|
||||
**Lifecycle:**
|
||||
- Event creation: Triggered by application business logic (domain events)
|
||||
- Event persistence: Synchronous SaveEvent call (writes to store)
|
||||
- Event durability: Persists forever (or until retention policy expires in JetStream)
|
||||
- Snapshot lifecycle: Optional; created by application decision or rebalancing; can be safely discarded (replay recovers)
|
||||
|
||||
**Owner:** Developer (application layer) owns writing events; Aether library owns storage
|
||||
|
||||
**Current Code Locations:**
|
||||
- `/aether/event.go` - Event struct, VersionConflictError, ReplayError
|
||||
- `/aether/store/memory.go` - InMemoryEventStore implementation
|
||||
- `/aether/store/jetstream.go` - JetStreamEventStore implementation (production)
|
||||
|
||||
**Scaling Concerns:**
|
||||
- Single node: Full replay fast for actors with <100 events; snapshots help >100 events
|
||||
- Cluster: Events stored in JetStream (durable across nodes); replay happens on failover
|
||||
- Multi-tenant: Events namespaced; separate streams per namespace avoid cross-contamination
|
||||
|
||||
**Alignment with Vision:**
|
||||
- **Primitives over Frameworks**: EventStore is interface; multiple implementations
|
||||
- **NATS-Native**: JetStreamEventStore uses JetStream durability
|
||||
- **Events as Complete History**: Events are source of truth; state is derived
|
||||
|
||||
**Gaps/Observations:**
|
||||
- Snapshot strategy is entirely application's responsibility (no built-in triggering)
|
||||
- Schema evolution for events not discussed (backward compatibility on deserialization)
|
||||
- Corruption recovery (ReplayError handling) is application's responsibility
|
||||
|
||||
**Boundary Rules:**
|
||||
- Inside: Event persistence, version validation, replay logic
|
||||
- Outside: Domain logic that generates events, retry policy on conflicts, snapshot triggering
|
||||
- Cannot cross: No shared models between Event Sourcing and other contexts; translation happens via events
|
||||
|
||||
---
|
||||
|
||||
### Context 2: Optimistic Concurrency Control
|
||||
|
||||
**Purpose:** Detect and signal concurrent write conflicts; let application choose retry strategy.
|
||||
|
||||
**Core Responsibility:**
|
||||
- Protect against lost writes from concurrent writers
|
||||
- Detect conflicts early (version mismatch)
|
||||
- Provide detailed error context for retry logic
|
||||
- Enable at-least-once semantics for idempotent operations
|
||||
|
||||
**Language (Ubiquitous Language):**
|
||||
- **Version**: Sequential number tracking writer's view of current state
|
||||
- **Conflict**: Condition where attempted version <= current version (another writer won)
|
||||
- **Optimistic Lock**: Assumption that conflicts are rare; detect when they happen
|
||||
- **Retry**: Application's response to conflict; reload state and attempt again
|
||||
- **AttemptedVersion**: Version proposed by current writer
|
||||
- **CurrentVersion**: Version that actually won the race
|
||||
|
||||
**Key Entities:**
|
||||
- VersionConflictError (detailed error with actor ID, attempted, current versions)
|
||||
- OptimisticLock pattern (implicit; not a first-class entity)
|
||||
|
||||
**Key Events Published:**
|
||||
- `VersionConflict` - SaveEvent rejected due to version <= current (developer retries)
|
||||
|
||||
**Key Events Consumed:**
|
||||
- None directly; consumes version state from Event Sourcing
|
||||
|
||||
**Interfaces to Other Contexts:**
|
||||
- **Event Sourcing**: Reads latest version; detects conflicts on save
|
||||
- **Application Logic**: Application handles conflict and decides retry strategy
|
||||
|
||||
**Lifecycle:**
|
||||
- Conflict detection: Synchronous in SaveEvent (fast check: version > current)
|
||||
- Conflict lifecycle: Temporary; conflict happens then application retries with new version
|
||||
- Error lifecycle: Returned immediately; application decides next action
|
||||
|
||||
**Owner:** Aether library (detects conflicts); Application (implements retry strategy)
|
||||
|
||||
**Current Code Locations:**
|
||||
- `/aether/event.go` - ErrVersionConflict sentinel, VersionConflictError type
|
||||
- `/aether/store/jetstream.go` - SaveEvent validation (lines checking version)
|
||||
- `/aether/store/memory.go` - SaveEvent validation
|
||||
|
||||
**Scaling Concerns:**
|
||||
- High contention: If many writers target same actor, conflicts spike; application must implement backoff
|
||||
- Retry storms: Naive retry (tight loop) causes cascade failures; exponential backoff mitigates
|
||||
- Metrics: Track conflict rate to detect unexpected contention
|
||||
|
||||
**Alignment with Vision:**
|
||||
- **Primitives over Frameworks**: Aether returns error; application decides what to do
|
||||
- Does NOT impose retry strategy (that would be a framework opinion)
|
||||
|
||||
**Gaps/Observations:**
|
||||
- No built-in retry mechanism (intentional design choice)
|
||||
- No conflict metrics in library (application must instrument)
|
||||
- No guidance on retry backoff strategies in code (documented in PROBLEM_MAP, not in API)
|
||||
|
||||
**Boundary Rules:**
|
||||
- Inside: Detect conflict, validate version > current, return detailed error
|
||||
- Outside: Retry logic, backoff strategy, exponential delays, giving up after N attempts
|
||||
- Cannot cross: Each context owns its retry behavior; no global retry handler
|
||||
|
||||
---
|
||||
|
||||
### Context 3: Namespace Isolation
|
||||
|
||||
**Purpose:** Provide logical data boundaries without opinionated multi-tenancy framework.
|
||||
|
||||
**Core Responsibility:**
|
||||
- Route events to subscribers matching namespace pattern
|
||||
- Isolate event stores by namespace prefix
|
||||
- Support hierarchical namespace naming (e.g., "prod.tenant-abc", "staging.orders")
|
||||
- Warn about wildcard bypass of isolation (explicit decision)
|
||||
|
||||
**Language (Ubiquitous Language):**
|
||||
- **Namespace**: Logical boundary (tenant, domain, environment, bounded context)
|
||||
- **Namespace Pattern**: NATS-style wildcard matching: "*" (single token), ">" (multi-token)
|
||||
- **Isolation**: Guarantee that events in namespace-A cannot be read from namespace-B (except via wildcard)
|
||||
- **Wildcard Subscription**: Cross-namespace visibility for trusted components (logging, monitoring)
|
||||
- **Subject**: NATS subject for routing (e.g., "aether.events.{namespace}")
|
||||
|
||||
**Key Entities:**
|
||||
- Namespace (just a string; meaning is application's)
|
||||
- JetStreamConfig with Namespace field (storage isolation)
|
||||
- SubscriptionFilter with namespace pattern (matching)
|
||||
- NATSEventBus subject routing
|
||||
|
||||
**Key Events Published:**
|
||||
- `EventPublished` - Event sent to namespace subscribers (via EventBus.Publish)
|
||||
|
||||
**Key Events Consumed:**
|
||||
- Events from Event Sourcing, filtered by namespace pattern
|
||||
|
||||
**Interfaces to Other Contexts:**
|
||||
- **Event Sourcing**: Stores can be namespaced (prefix in stream name)
|
||||
- **Event Bus**: Publishes to namespace; subscribers match by pattern
|
||||
- **Cluster Coordination**: Might use namespaced subscriptions to isolate tenant events
|
||||
|
||||
**Lifecycle:**
|
||||
- Namespace definition: Application decides; typically per-tenant or per-domain
|
||||
- Namespace creation: Implicit when first store/subscription uses it (no explicit schema)
|
||||
- Namespace deletion: Not supported; namespaces persist if events exist
|
||||
- Stream lifetime: JetStream stream "namespace_events" persists until deleted
|
||||
|
||||
**Owner:** Application layer (defines namespace boundaries); Library (enforces routing)
|
||||
|
||||
**Current Code Locations:**
|
||||
- `/aether/eventbus.go` - EventBus exact vs wildcard subscriber routing
|
||||
- `/aether/nats_eventbus.go` - NATSEventBus subject formatting (line 89: `fmt.Sprintf("aether.events.%s", namespacePattern)`)
|
||||
- `/aether/store/jetstream.go` - JetStreamConfig.Namespace field, stream name sanitization (line 83)
|
||||
- `/aether/pattern.go` - MatchNamespacePattern, IsWildcardPattern functions
|
||||
|
||||
**Scaling Concerns:**
|
||||
- Single namespace: All events in one stream; scales with event volume
|
||||
- Multi-namespace: Separate streams per namespace; scales horizontally (add namespaces independently)
|
||||
- Wildcard subscriptions: Cross-namespace visibility; careful with security (documented warnings)
|
||||
|
||||
**Alignment with Vision:**
|
||||
- **Primitives over Frameworks**: Namespaces are primitives; no opinionated multi-tenancy layer
|
||||
- Non-goal: "Opinionated multi-tenancy" - this library provides isolation primitives, not tenant management
|
||||
|
||||
**Gaps/Observations:**
|
||||
- Namespace collision: No validation that namespace names are unique (risk: "orders" used by two teams)
|
||||
- Wildcard security: Extensively documented in code (SECURITY WARNING appears multiple times); good
|
||||
- No namespace registry or allow-list (application must enforce naming conventions)
|
||||
- Sanitization of namespace names happens in JetStreamEventStore (spaces → underscores) but not documented
|
||||
|
||||
**Boundary Rules:**
|
||||
- Inside: Namespace pattern matching, subject routing, stream prefixing
|
||||
- Outside: Defining namespace semantics (tenant, domain, environment), enforcing conventions
|
||||
- Cannot cross: Events in namespace-A published to namespace-A only (except wildcard subscribers)
|
||||
|
||||
---
|
||||
|
||||
### Context 4: Cluster Coordination
|
||||
|
||||
**Purpose:** Distribute actors across cluster nodes; elect leader; rebalance on topology changes.
|
||||
|
||||
**Core Responsibility:**
|
||||
- Discover nodes in cluster (NATS-based, no external coordinator)
|
||||
- Elect one leader using lease-based coordination
|
||||
- Distribute shards across nodes via consistent hash ring
|
||||
- Detect node failures and trigger rebalancing
|
||||
- Provide shard assignment for actor placement
|
||||
|
||||
**Language (Ubiquitous Language):**
|
||||
- **Node**: Physical or logical computer in cluster; has ID, address, capacity, status
|
||||
- **Leader**: Single node responsible for coordination and rebalancing decisions
|
||||
- **Term**: Monotonically increasing leadership election round (prevents split-brain)
|
||||
- **Shard**: Virtual partition (1024 by default); actors hash to shards; shards assigned to nodes
|
||||
- **Consistent Hash Ring**: Algorithm mapping shards to nodes such that node failures cause minimal rebalancing
|
||||
- **Rebalancing**: Reassignment of shards when topology changes (node join/fail)
|
||||
- **ShardMap**: Current state of which shards live on which nodes
|
||||
- **Heartbeat**: Periodic signal from leader renewing its lease (proves still alive)
|
||||
- **Lease**: Time window during which leader's authority is valid (TTL-based, not quorum)
|
||||
|
||||
**Key Entities:**
|
||||
- NodeInfo (cluster node details: ID, address, capacity, status)
|
||||
- ShardMap (shard → nodes mapping; versioned)
|
||||
- LeadershipLease (leader ID, term, expiration)
|
||||
- ActorMigration (migration record for actor during rebalancing)
|
||||
|
||||
**Key Events Published:**
|
||||
- `NodeJoined` - New node added to cluster
|
||||
- `NodeFailed` - Node stopped responding (detected by heartbeat timeout)
|
||||
- `LeaderElected` - Leader selected (term incremented)
|
||||
- `LeadershipLost` - Leader lease expired (old leader can no longer coordinate)
|
||||
- `ShardAssigned` - Leader assigns shard to nodes
|
||||
- `ShardMigrated` - Shard moved from one node to another (during rebalancing)
|
||||
|
||||
**Key Events Consumed:**
|
||||
- Node topology changes (new nodes, failures) → trigger rebalancing
|
||||
- Leader election results → shard assignments
|
||||
|
||||
**Interfaces to Other Contexts:**
|
||||
- **Namespace Isolation**: Could use namespaced subscriptions for cluster-internal events
|
||||
- **Event Sourcing**: Cluster queries latest version to assign shards; failures trigger replay on new node
|
||||
- **Event Bus**: Cluster messages published to event bus; subscribers on each node act on them
|
||||
|
||||
**Lifecycle:**
|
||||
- Cluster formation: Nodes join; first leader elected
|
||||
- Leadership duration: Until lease expires (~10 seconds in config)
|
||||
- Shard assignment: Decided by leader; persists in ShardMap
|
||||
- Node failure: Detected after heartbeat timeout (~90 seconds implied by lease config)
|
||||
- Rebalancing: Triggered by topology change; completes when ShardMap versioned and distributed
|
||||
|
||||
**Owner:** ClusterManager (coordination); LeaderElection (election); ShardManager (placement)
|
||||
|
||||
**Current Code Locations:**
|
||||
- `/aether/cluster/types.go` - NodeInfo, ShardMap, LeadershipLease, ActorMigration types
|
||||
- `/aether/cluster/manager.go` - ClusterManager, node discovery, rebalancing loop
|
||||
- `/aether/cluster/leader.go` - LeaderElection (lease-based using NATS KV)
|
||||
- `/aether/cluster/hashring.go` - ConsistentHashRing (shard → node mapping)
|
||||
- `/aether/cluster/shard.go` - ShardManager (actor placement, shard assignment)
|
||||
|
||||
**Scaling Concerns:**
|
||||
- Leader election latency: 10s lease, 3s heartbeat → ~13s to detect failure (tunable)
|
||||
- Rebalancing overhead: Consistent hash minimizes movements (only affects shards from failed node)
|
||||
- Shard count: 1024 default; tune based on cluster size and actor count
|
||||
|
||||
**Alignment with Vision:**
|
||||
- **NATS-Native**: Leader election uses NATS KV store (lease-based); cluster discovery via NATS
|
||||
- **Primitives over Frameworks**: ShardManager and LeaderElection are composable; can swap algorithms
|
||||
|
||||
**Gaps/Observations:**
|
||||
- Rebalancing is triggered but algorithm not fully shown in code excerpt ("would rebalance across N nodes")
|
||||
- Actor migration during rebalancing: ShardManager has PlacementStrategy interface but sample migration handler not shown
|
||||
- Split-brain prevention: Lease-based (no concurrent leaders) but old leader could execute stale rebalancing
|
||||
- No explicit actor state migration during shard rebalancing (where does actor state go during move?)
|
||||
|
||||
**Boundary Rules:**
|
||||
- Inside: Node discovery, leader election, shard assignment, rebalancing decisions
|
||||
- Outside: Actor state migration (that's Event Sourcing's replay), actual actor message delivery
|
||||
- Cannot cross: Cluster decisions are made once per cluster (not per namespace or actor)
|
||||
|
||||
---
|
||||
|
||||
### Context 5: Event Bus (Pub/Sub Distribution)
|
||||
|
||||
**Purpose:** Route events from producers to subscribers; support filtering and cross-node propagation.
|
||||
|
||||
**Core Responsibility:**
|
||||
- Local event distribution (in-process subscriptions)
|
||||
- Cross-node event distribution via NATS
|
||||
- Filter events by type and actor pattern
|
||||
- Support exact and wildcard namespace patterns
|
||||
- Non-blocking delivery (drop event if channel full, don't block publisher)
|
||||
|
||||
**Language (Ubiquitous Language):**
|
||||
- **Publish**: Send event to namespace (synchronous, non-blocking; may drop if subscribers slow)
|
||||
- **Subscribe**: Register interest in namespace pattern (returns channel)
|
||||
- **Filter**: Criteria for event delivery (EventTypes list, ActorPattern wildcard)
|
||||
- **Wildcard Pattern**: "*" (single token), ">" (multi-token) matching
|
||||
- **Subject**: NATS subject for routing (e.g., "aether.events.{namespace}")
|
||||
- **Subscriber**: Entity receiving events from channel (has local reference to channel)
|
||||
- **Deliver**: Attempt to send event to subscriber's channel; non-blocking (may drop)
|
||||
|
||||
**Key Entities:**
|
||||
- EventBroadcaster interface (local or NATS-backed)
|
||||
- EventBus (in-memory, local subscriptions only)
|
||||
- NATSEventBus (extends EventBus; adds NATS forwarding)
|
||||
- SubscriptionFilter (event types + actor pattern)
|
||||
- filteredSubscription (internal; tracks channel, pattern, filter)
|
||||
|
||||
**Key Events Published:**
|
||||
- `EventPublished` - Event sent via EventBus.Publish (may be delivered to subscribers)
|
||||
|
||||
**Key Events Consumed:**
|
||||
- Events from Event Sourcing context
|
||||
|
||||
**Interfaces to Other Contexts:**
|
||||
- **Event Sourcing**: Reads events to publish; triggered after SaveEvent
|
||||
- **Namespace Isolation**: Uses namespace pattern for routing
|
||||
- **Cluster Coordination**: Cluster messages flow through event bus
|
||||
|
||||
**Lifecycle:**
|
||||
- Subscription creation: Caller invokes Subscribe/SubscribeWithFilter; gets channel
|
||||
- Subscription duration: Lifetime of channel (caller controls)
|
||||
- Subscription cleanup: Unsubscribe closes channel
|
||||
- Event delivery: Synchronous Publish → deliver to all matching subscribers
|
||||
- Dropped events: Non-blocking delivery; full channel = dropped event (metrics recorded)
|
||||
|
||||
**Owner:** Library (EventBus implementation); Callers (subscribe/unsubscribe)
|
||||
|
||||
**Current Code Locations:**
|
||||
- `/aether/eventbus.go` - EventBus (local in-process pub/sub)
|
||||
- `/aether/nats_eventbus.go` - NATSEventBus (NATS-backed cross-node)
|
||||
- `/aether/pattern.go` - MatchNamespacePattern, SubscriptionFilter matching logic
|
||||
- Metrics tracking in both implementations
|
||||
|
||||
**Scaling Concerns:**
|
||||
- Local bus: In-memory channels; scales with subscriber count (no network overhead)
|
||||
- NATS bus: One NATS subscription per pattern; scales with unique patterns
|
||||
- Channel buffering: 100-element buffer (configurable); full = dropped events
|
||||
- Metrics: Track published, delivered, dropped per namespace
|
||||
|
||||
**Alignment with Vision:**
|
||||
- **Primitives over Frameworks**: EventBroadcaster is interface; swappable implementations
|
||||
- **NATS-Native**: NATSEventBus uses NATS subjects for routing
|
||||
|
||||
**Gaps/Observations:**
|
||||
- Dropped events are silent (metrics recorded but no callback); might surprise subscribers
|
||||
- Filter matching is string-based (no compile-time safety for event types)
|
||||
- Two-level filtering: Namespace at NATS level, EventTypes/ActorPattern at application level
|
||||
- NATSEventBus creates subscription per unique pattern (could be optimized with pattern hierarchy)
|
||||
|
||||
**Boundary Rules:**
|
||||
- Inside: Event routing, filter matching, non-blocking delivery
|
||||
- Outside: Semantics of events (that's Event Sourcing); decisions on what to do when event received
|
||||
- Cannot cross: Subscribers are responsible for their channels; publisher doesn't know who consumes
|
||||
|
||||
---
|
||||
|
||||
## Context Relationships
|
||||
|
||||
### Event Sourcing ↔ Event Bus
|
||||
|
||||
**Type:** Producer/Consumer (one-to-many)
|
||||
|
||||
**Direction:** Event Sourcing produces events; Event Bus distributes them
|
||||
|
||||
**Integration:**
|
||||
- Application saves event to store (SaveEvent)
|
||||
- Application publishes same event to bus (Publish)
|
||||
- Subscribers receive event from bus channel
|
||||
- Events are same object (Event struct)
|
||||
|
||||
**Decoupling:**
|
||||
- Store and bus are independent (application coordinates)
|
||||
- Bus subscribers don't know about storage
|
||||
- Replay doesn't trigger bus publish (events already stored)
|
||||
|
||||
**Safety:**
|
||||
- No shared transaction (save and publish are separate)
|
||||
- Risk: Event saved but publish fails (or vice versa) → bus has stale view
|
||||
- Mitigation: Application's responsibility to ensure consistency
|
||||
|
||||
---
|
||||
|
||||
### Event Sourcing → Optimistic Concurrency Control
|
||||
|
||||
**Type:** Dependency (nested)
|
||||
|
||||
**Direction:** SaveEvent validates version using Optimistic Concurrency
|
||||
|
||||
**Integration:**
|
||||
- SaveEvent calls GetLatestVersion (read current)
|
||||
- Checks event.Version > currentVersion (optimistic lock)
|
||||
- Returns VersionConflictError if not
|
||||
|
||||
**Decoupling:**
|
||||
- Optimistic Concurrency is not a separate context; it's logic within Event Sourcing
|
||||
- Version validation is inline in SaveEvent, not a separate call
|
||||
|
||||
**Note:** Initially these seem like separate contexts (different language, different lifecycle). But Version is Event Sourcing's concern; Conflict is just an error condition (not a separate state machine). Optimistic locking is a **pattern**, not a **context**.
|
||||
|
||||
---
|
||||
|
||||
### Event Sourcing → Namespace Isolation
|
||||
|
||||
**Type:** Containment (namespaces contain event streams)
|
||||
|
||||
**Direction:** Namespace Isolation scopes Event Sourcing
|
||||
|
||||
**Integration:**
|
||||
- JetStreamEventStore accepts Namespace in config
|
||||
- Actual stream name becomes "{namespace}_{streamName}"
|
||||
- GetEvents, GetLatestVersion, SaveEvent are namespace-scoped
|
||||
|
||||
**Decoupling:**
|
||||
- Each namespace has independent version sequences
|
||||
- No cross-namespace reads in Event Sourcing context
|
||||
- EventBus.Publish specifies namespace
|
||||
|
||||
**Safety:**
|
||||
- Complete isolation at storage level (different JetStream streams)
|
||||
- Events from namespace-A cannot appear in namespace-B queries
|
||||
- Wildcard subscriptions bypass this (documented risk)
|
||||
|
||||
---
|
||||
|
||||
### Cluster Coordination → Event Sourcing
|
||||
|
||||
**Type:** Consumer (reads version state)
|
||||
|
||||
**Direction:** Cluster queries Event Sourcing for actor state
|
||||
|
||||
**Integration:**
|
||||
- ClusterManager might query GetLatestVersion to determine if shard can migrate
|
||||
- Nodes track which actors (shards) are assigned locally
|
||||
- On failover, new node replays events from store to rebuild state
|
||||
|
||||
**Decoupling:**
|
||||
- Cluster doesn't manage event storage (Event Sourcing owns that)
|
||||
- Cluster doesn't decide when to snapshot
|
||||
- Cluster doesn't know about versions (Event Sourcing concept)
|
||||
|
||||
---
|
||||
|
||||
### Cluster Coordination → Namespace Isolation
|
||||
|
||||
**Type:** Orthogonal (can combine, but not required)
|
||||
|
||||
**Direction:** Cluster can use namespaced subscriptions; not required
|
||||
|
||||
**Integration:**
|
||||
- Cluster could publish node-join events to namespaced topics (e.g., "cluster.{tenant}")
|
||||
- Different tenants can have independent clusters (each with own cluster messages)
|
||||
|
||||
**Decoupling:**
|
||||
- Cluster doesn't care about namespace semantics
|
||||
- Namespace doesn't enforce cluster topology
|
||||
|
||||
---
|
||||
|
||||
### Event Bus → (All contexts)
|
||||
|
||||
**Type:** Cross-cutting concern
|
||||
|
||||
**Direction:** Event Bus distributes events from all contexts
|
||||
|
||||
**Integration:**
|
||||
- Event Sourcing publishes to bus after SaveEvent
|
||||
- Cluster Coordination publishes shard assignments to bus
|
||||
- Namespace Isolation is a parameter to Publish/Subscribe
|
||||
- Subscribers receive events and can filter by type/actor
|
||||
|
||||
**Decoupling:**
|
||||
- Bus is asynchronous (events may be lost if no subscribers)
|
||||
- Subscribers don't block publishers
|
||||
- No ordering guarantee across namespaces
|
||||
|
||||
---
|
||||
|
||||
## Boundary Rules Summary
|
||||
|
||||
### By Language
|
||||
|
||||
| Language | Context | Meaning |
|
||||
|----------|---------|---------|
|
||||
| **Event** | Event Sourcing | Immutable fact; identified by ID, type, actor, version |
|
||||
| **Version** | Event Sourcing | Monotonically increasing sequence per actor; also used for optimistic locking |
|
||||
| **Snapshot** | Event Sourcing | Optional state cache at specific version; always disposable |
|
||||
| **Node** | Cluster Coordination | Physical computer in cluster; has ID, address, capacity |
|
||||
| **Leader** | Cluster Coordination | Single node elected for coordination (not per-namespace, not per-actor) |
|
||||
| **Shard** | Cluster Coordination | Virtual partition for actor placement; 1024 by default |
|
||||
| **Namespace** | Namespace Isolation | Logical boundary (tenant, domain, context); application-defined meaning |
|
||||
| **Wildcard** | Both Event Bus & Namespace | "*" (single token) and ">" (multi-token) NATS pattern matching |
|
||||
| **Subject** | Event Bus | NATS subject for message routing |
|
||||
| **Conflict** | Optimistic Concurrency | Condition where write failed due to version being stale |
|
||||
| **Retry** | Optimistic Concurrency | Application's decision to reload and try again |
|
||||
| **Subscribe** | Event Bus | Register interest in namespace pattern; returns channel |
|
||||
| **Publish** | Event Bus | Send event to namespace subscribers; non-blocking |
|
||||
|
||||
### By Lifecycle
|
||||
|
||||
| Entity | Created | Destroyed | Owner | Context |
|
||||
|--------|---------|-----------|-------|---------|
|
||||
| Event | SaveEvent | Never (persists forever) | Application writes, Aether stores | Event Sourcing |
|
||||
| Version | Per-event | With event | Automatic (monotonic) | Event Sourcing |
|
||||
| Snapshot | Application decision | Application decision | Application | Event Sourcing |
|
||||
| Node | Join cluster | Explicit leave | Infrastructure | Cluster Coordination |
|
||||
| Leader | Election completes | Lease expires | Automatic (election) | Cluster Coordination |
|
||||
| Shard | Created with cluster | With cluster | ClusterManager | Cluster Coordination |
|
||||
| Namespace | First use | Never (persist) | Application | Namespace Isolation |
|
||||
| Subscription | Subscribe() call | Unsubscribe() call | Caller | Event Bus |
|
||||
| Channel | Subscribe() returns | Unsubscribe() closes | Caller | Event Bus |
|
||||
|
||||
### By Ownership
|
||||
|
||||
| Context | Who Decides | What They Decide |
|
||||
|---------|-------------|------------------|
|
||||
| Event Sourcing | Application (developer) | When to save events, event schema, snapshot strategy |
|
||||
| Optimistic Concurrency | Application | Retry strategy, backoff, giving up |
|
||||
| Namespace Isolation | Application | Namespace semantics (tenant, domain, env), naming convention |
|
||||
| Cluster Coordination | ClusterManager & LeaderElection | Node discovery, leader election, shard assignment |
|
||||
| Event Bus | Application | What to subscribe to, filtering criteria |
|
||||
|
||||
### By Scaling Boundary
|
||||
|
||||
| Context | Scales By | Limits | Tuning |
|
||||
|---------|-----------|--------|--------|
|
||||
| Event Sourcing | Event volume per actor | Replay latency grows with version count | Snapshots help |
|
||||
| Cluster Coordination | Node count | Leader election latency, rebalancing overhead | Lease TTL, heartbeat interval |
|
||||
| Namespace Isolation | Namespace count | Stream count, NATS resource usage | Separate JetStream streams |
|
||||
| Event Bus | Subscriber count | Channel buffering (100 elements) | Queue depth, metrics |
|
||||
|
||||
---
|
||||
|
||||
## Code vs. Intended: Alignment Analysis
|
||||
|
||||
### Intended → Actual: Good Alignment
|
||||
|
||||
**Context: Event Sourcing**
|
||||
- Intended: EventStore interface with multiple implementations
|
||||
- Actual: InMemoryEventStore (testing) and JetStreamEventStore (production) both exist
|
||||
- ✓ Good: Matches vision of "primitives over frameworks"
|
||||
|
||||
**Context: Optimistic Concurrency**
|
||||
- Intended: Detect conflicts, return error, let app retry
|
||||
- Actual: SaveEvent returns VersionConflictError; no built-in retry
|
||||
- ✓ Good: Aligns with vision of primitives (app owns retry logic)
|
||||
|
||||
**Context: Namespace Isolation**
|
||||
- Intended: Logical boundaries without opinionated multi-tenancy
|
||||
- Actual: JetStreamConfig.Namespace, EventBus namespace patterns
|
||||
- ✓ Good: Primitives provided; semantics left to app
|
||||
|
||||
**Context: Cluster Coordination**
|
||||
- Intended: Node discovery, leader election, shard assignment
|
||||
- Actual: ClusterManager, LeaderElection, ConsistentHashRing all present
|
||||
- ✓ Good: Primitives implemented
|
||||
|
||||
**Context: Event Bus**
|
||||
- Intended: Local and cross-node pub/sub with filtering
|
||||
- Actual: EventBus (local) and NATSEventBus (NATS) both present
|
||||
- ✓ Good: Extensible via interface
|
||||
|
||||
### Intended → Actual: Gaps
|
||||
|
||||
**Context: Cluster Coordination**
|
||||
- Intended: Actor migration during shard rebalancing
|
||||
- Actual: ShardManager has PlacementStrategy; ActorMigration type defined
|
||||
- Gap: Migration handler logic not shown; where does actor state transition during rebalance?
|
||||
- Impact: Cluster context is foundational but incomplete; application must implement actor handoff
|
||||
|
||||
**Context: Event Sourcing**
|
||||
- Intended: Snapshot strategy guidance
|
||||
- Actual: SnapshotStore interface; SaveSnapshot exists; no built-in strategy
|
||||
- Gap: No adaptive snapshotting, no time-based snapshotting
|
||||
- Impact: App must choose snapshot frequency (documented in PROBLEM_MAP, not enforced)
|
||||
|
||||
**Context: Namespace Isolation**
|
||||
- Intended: Warn about wildcard security risks
|
||||
- Actual: SECURITY WARNING in docstrings (excellent)
|
||||
- Gap: No namespace registry or allow-list to prevent collisions
|
||||
- Impact: Risk of two teams using same namespace (e.g., "orders") unintentionally
|
||||
|
||||
**Context: Optimistic Concurrency**
|
||||
- Intended: Guide app on retry strategy
|
||||
- Actual: Returns VersionConflictError with details
|
||||
- Gap: No retry helper, no backoff library
|
||||
- Impact: Each app implements own retry (fine; primitives approach)
|
||||
|
||||
---
|
||||
|
||||
## Refactoring Backlog (if brownfield)
|
||||
|
||||
### No Major Refactoring Required
|
||||
|
||||
The code structure already aligns well with intended bounded contexts:
|
||||
- Event Sourcing lives in `/event.go` and `/store/`
|
||||
- Cluster lives in `/cluster/`
|
||||
- Event Bus lives in `/eventbus.go` and `/nats_eventbus.go`
|
||||
- Pattern matching lives in `/pattern.go`
|
||||
|
||||
### Minor Improvements
|
||||
|
||||
**Issue 1: Document Actor Migration During Rebalancing**
|
||||
- Current: ShardManager.AssignShard exists; ActorMigration type defined
|
||||
- Gap: No example code showing how actor state moves between nodes
|
||||
- Suggestion: Add sample migration handler in cluster package
|
||||
|
||||
**Issue 2: Add Namespace Validation/Registry**
|
||||
- Current: Namespace is just a string; no collision detection
|
||||
- Gap: Risk of two teams using same namespace
|
||||
- Suggestion: Document naming convention (e.g., "env.team.context"); optionally add schema/enum
|
||||
|
||||
**Issue 3: Snapshot Strategy Recipes**
|
||||
- Current: SnapshotStore interface; app responsible for strategy
|
||||
- Gap: Documentation could provide sample strategies (time-based, count-based, adaptive)
|
||||
- Suggestion: Add `/examples/snapshot_strategies.go` with reference implementations
|
||||
|
||||
**Issue 4: Metrics for Concurrency Context**
|
||||
- Current: Version conflict detection exists; no metrics
|
||||
- Gap: Apps can't easily observe conflict rate
|
||||
- Suggestion: Add conflict metrics to EventStore (or provide hooks)
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### For Product Strategy
|
||||
|
||||
1. **Confirm Bounded Contexts**: Review this map with team. Are these five contexts the right cut? Missing any? Too many?
|
||||
|
||||
2. **Define Invariants per Context**:
|
||||
- Event Sourcing: "Version must be strictly monotonic per actor" ✓ (enforced)
|
||||
- Cluster Coordination: "Only one leader can have valid lease at a time" ✓ (lease-based)
|
||||
- Namespace Isolation: "Events in namespace-A cannot be queried from namespace-B context" ✓ (separate streams)
|
||||
- Optimistic Concurrency: "Conflict detection is synchronous; resolution is async" ✓ (error returned immediately)
|
||||
- Event Bus: "Delivery is non-blocking; events may be dropped if subscriber slow" ✓ (metrics track this)
|
||||
|
||||
3. **Map Capabilities to Contexts**:
|
||||
- "Store events durably" → Event Sourcing context
|
||||
- "Detect concurrent writes" → Optimistic Concurrency context
|
||||
- "Isolate logical domains" → Namespace Isolation context
|
||||
- "Distribute actors across nodes" → Cluster Coordination context
|
||||
- "Route events to subscribers" → Event Bus context
|
||||
|
||||
4. **Test Boundaries**:
|
||||
- Single-node: Event Sourcing + Optimistic Concurrency + Event Bus (no Cluster)
|
||||
- Multi-node: Add Cluster Coordination (but cluster decisions don't affect other contexts)
|
||||
- Multi-tenant: Add Namespace Isolation (orthogonal to other contexts)
|
||||
|
||||
### For Architecture
|
||||
|
||||
1. **Complete Cluster Context Documentation**:
|
||||
- Show actor migration lifecycle during shard rebalancing
|
||||
- Document when state moves (during rebalance, during failover)
|
||||
- Provide sample ShardManager implementation
|
||||
|
||||
2. **Add Snapshot Strategy Guidance**:
|
||||
- Time-based: Snapshot every hour
|
||||
- Count-based: Snapshot every 100 events
|
||||
- Adaptive: Snapshot when replay latency exceeds threshold
|
||||
|
||||
3. **Namespace Isolation Checklist**:
|
||||
- Define naming convention (document in README)
|
||||
- Add compile-time checks (optional enum for known namespaces)
|
||||
- Test multi-tenant isolation (integration test suite)
|
||||
|
||||
4. **Concurrency Context Testing**:
|
||||
- Add concurrent writer tests to store tests
|
||||
- Verify VersionConflictError details are accurate
|
||||
- Benchmark conflict detection performance
|
||||
|
||||
### For Docs
|
||||
|
||||
1. **Add Context Diagram**: Show five contexts as boxes; arrows for relationships
|
||||
|
||||
2. **Add Per-Context Glossary**: Define ubiquitous language per context (terms table above)
|
||||
|
||||
3. **Add Lifecycle Diagrams**: Show event lifetime, node lifetime, subscription lifetime, shard lifetime
|
||||
|
||||
4. **Security Section**: Expand wildcard subscription warnings; document trust model
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns Avoided
|
||||
|
||||
### Pattern: "One Big Event Model"
|
||||
- **Anti-pattern**: Single Event struct used everywhere with union types
|
||||
- **What we do**: Event is generic; domain language lives in EventType strings and Data map
|
||||
- **Why**: Primitives approach; library doesn't impose domain model
|
||||
|
||||
### Pattern: "Shared Mutable State Across Contexts"
|
||||
- **Anti-pattern**: ClusterManager directly mutates EventStore data structures
|
||||
- **What we do**: Contexts communicate via events (if they need to) or via explicit queries
|
||||
- **Why**: Clean boundaries; each context owns its data
|
||||
|
||||
### Pattern: "Automatic Retry for Optimistic Locks"
|
||||
- **Anti-pattern**: Library retries internally on version conflict
|
||||
- **What we do**: Return error to caller; caller decides retry strategy
|
||||
- **Why**: Primitives approach; retry policy is app's concern, not library's
|
||||
|
||||
### Pattern: "Opinionated Snapshot Strategy"
|
||||
- **Anti-pattern**: "Snapshot every 100 events" hardcoded
|
||||
- **What we do**: SnapshotStore interface; app decides when to snapshot
|
||||
- **Why**: Different apps have different replay latency requirements
|
||||
|
||||
### Pattern: "Wildcard Subscriptions by Default"
|
||||
- **Anti-pattern**: All subscriptions use ">" by default (receive everything)
|
||||
- **What we do**: Explicit namespaces; wildcard is optional and warned about
|
||||
- **Why**: Security-first; isolation is default
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Aether's five bounded contexts are **well-aligned** with the problem space and the codebase:
|
||||
|
||||
1. **Event Sourcing** - Store events as immutable history; enable replay
|
||||
2. **Optimistic Concurrency** - Detect conflicts; let app retry
|
||||
3. **Namespace Isolation** - Logical boundaries without opinionated multi-tenancy
|
||||
4. **Cluster Coordination** - Distribute actors, elect leader, rebalance on failure
|
||||
5. **Event Bus** - Route events from producers to subscribers
|
||||
|
||||
Each context has:
|
||||
- Clear **language boundaries** (different terms, different meanings)
|
||||
- Clear **lifecycle boundaries** (different creation/deletion patterns)
|
||||
- Clear **ownership** (who decides what within each context)
|
||||
- Clear **scaling boundaries** (why this context must be separate)
|
||||
|
||||
The implementation **matches the vision** of "primitives over frameworks." Library provides composition points (interfaces); applications wire them together.
|
||||
|
||||
Next step in product strategy: **Define domain models within each context** (Step 4 of strategy chain). For now, Aether provides primitives; applications build their domain models on top.
|
||||
Reference in New Issue
Block a user