Files
aether/.product-strategy/BOUNDED_CONTEXT_MAP.md
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

34 KiB

Bounded Context Map: Aether Distributed Actor System

Summary

Aether has five distinct bounded contexts cut by language boundaries, lifecycle differences, ownership patterns, and scaling needs. The contexts emerge from the problem space: single-node event sourcing, distributed clustering, logical isolation, optimistic concurrency control, and event distribution.

Key insight: Each context has its own ubiquitous language (different meanings for similar terms) and its own lifecycle (actors persist forever; leases expire; subscriptions have independent lifetimes). Boundaries are enforced by language/data ownership, not by organizational structure.


Bounded Contexts

Context 1: Event Sourcing

Purpose: Persist events as immutable source of truth; enable state rebuild through replay.

Core Responsibility:

  • Events are facts (immutable, append-only)
  • Versions are monotonically increasing per actor
  • Snapshots are optional optimization hints, not required
  • Replay reconstructs state from history

Language (Ubiquitous Language):

  • Event: Immutable fact about what happened; identified by ID, type, actor, version
  • Version: Monotonically increasing sequence number per actor; used for optimistic locking
  • Snapshot: Point-in-time state capture at a specific version; optional; can always replay
  • ActorID: Identifier for the entity whose events we're storing; unique within namespace
  • Replay: Process of reading events from start version, applying each, to rebuild state

Key Entities (Event-Based, not Object-Based):

  • Event (immutable, versioned)
  • ActorSnapshot (optional state cache)
  • EventStore interface (multiple implementations)

Key Events Published:

  • EventStored - Event successfully persisted (triggered when SaveEvent succeeds)
  • VersionConflict - Attempted version <= current; optimistic lock lost (expensive mistake)
  • SnapshotCreated - State snapshot saved (optional; developers decide when)

Key Events Consumed:

  • None (this context is a source of truth; others consume from it)

Interfaces to Other Contexts:

  • Cluster Coordination: Cluster leader queries latest versions to assign shards
  • Namespace Isolation: Stores can be namespaced; queries filtered by namespace
  • Optimistic Concurrency: Version conflicts trigger retry logic in application
  • Event Bus: Events stored here are published to bus subscribers

Lifecycle:

  • Event creation: Triggered by application business logic (domain events)
  • Event persistence: Synchronous SaveEvent call (writes to store)
  • Event durability: Persists forever (or until retention policy expires in JetStream)
  • Snapshot lifecycle: Optional; created by application decision or rebalancing; can be safely discarded (replay recovers)

Owner: Developer (application layer) owns writing events; Aether library owns storage

Current Code Locations:

  • /aether/event.go - Event struct, VersionConflictError, ReplayError
  • /aether/store/memory.go - InMemoryEventStore implementation
  • /aether/store/jetstream.go - JetStreamEventStore implementation (production)

Scaling Concerns:

  • Single node: Full replay fast for actors with <100 events; snapshots help >100 events
  • Cluster: Events stored in JetStream (durable across nodes); replay happens on failover
  • Multi-tenant: Events namespaced; separate streams per namespace avoid cross-contamination

Alignment with Vision:

  • Primitives over Frameworks: EventStore is interface; multiple implementations
  • NATS-Native: JetStreamEventStore uses JetStream durability
  • Events as Complete History: Events are source of truth; state is derived

Gaps/Observations:

  • Snapshot strategy is entirely application's responsibility (no built-in triggering)
  • Schema evolution for events not discussed (backward compatibility on deserialization)
  • Corruption recovery (ReplayError handling) is application's responsibility

Boundary Rules:

  • Inside: Event persistence, version validation, replay logic
  • Outside: Domain logic that generates events, retry policy on conflicts, snapshot triggering
  • Cannot cross: No shared models between Event Sourcing and other contexts; translation happens via events

Context 2: Optimistic Concurrency Control

Purpose: Detect and signal concurrent write conflicts; let application choose retry strategy.

Core Responsibility:

  • Protect against lost writes from concurrent writers
  • Detect conflicts early (version mismatch)
  • Provide detailed error context for retry logic
  • Enable at-least-once semantics for idempotent operations

Language (Ubiquitous Language):

  • Version: Sequential number tracking writer's view of current state
  • Conflict: Condition where attempted version <= current version (another writer won)
  • Optimistic Lock: Assumption that conflicts are rare; detect when they happen
  • Retry: Application's response to conflict; reload state and attempt again
  • AttemptedVersion: Version proposed by current writer
  • CurrentVersion: Version that actually won the race

Key Entities:

  • VersionConflictError (detailed error with actor ID, attempted, current versions)
  • OptimisticLock pattern (implicit; not a first-class entity)

Key Events Published:

  • VersionConflict - SaveEvent rejected due to version <= current (developer retries)

Key Events Consumed:

  • None directly; consumes version state from Event Sourcing

Interfaces to Other Contexts:

  • Event Sourcing: Reads latest version; detects conflicts on save
  • Application Logic: Application handles conflict and decides retry strategy

Lifecycle:

  • Conflict detection: Synchronous in SaveEvent (fast check: version > current)
  • Conflict lifecycle: Temporary; conflict happens then application retries with new version
  • Error lifecycle: Returned immediately; application decides next action

Owner: Aether library (detects conflicts); Application (implements retry strategy)

Current Code Locations:

  • /aether/event.go - ErrVersionConflict sentinel, VersionConflictError type
  • /aether/store/jetstream.go - SaveEvent validation (lines checking version)
  • /aether/store/memory.go - SaveEvent validation

Scaling Concerns:

  • High contention: If many writers target same actor, conflicts spike; application must implement backoff
  • Retry storms: Naive retry (tight loop) causes cascade failures; exponential backoff mitigates
  • Metrics: Track conflict rate to detect unexpected contention

Alignment with Vision:

  • Primitives over Frameworks: Aether returns error; application decides what to do
  • Does NOT impose retry strategy (that would be a framework opinion)

Gaps/Observations:

  • No built-in retry mechanism (intentional design choice)
  • No conflict metrics in library (application must instrument)
  • No guidance on retry backoff strategies in code (documented in PROBLEM_MAP, not in API)

Boundary Rules:

  • Inside: Detect conflict, validate version > current, return detailed error
  • Outside: Retry logic, backoff strategy, exponential delays, giving up after N attempts
  • Cannot cross: Each context owns its retry behavior; no global retry handler

Context 3: Namespace Isolation

Purpose: Provide logical data boundaries without opinionated multi-tenancy framework.

Core Responsibility:

  • Route events to subscribers matching namespace pattern
  • Isolate event stores by namespace prefix
  • Support hierarchical namespace naming (e.g., "prod.tenant-abc", "staging.orders")
  • Warn about wildcard bypass of isolation (explicit decision)

Language (Ubiquitous Language):

  • Namespace: Logical boundary (tenant, domain, environment, bounded context)
  • Namespace Pattern: NATS-style wildcard matching: "*" (single token), ">" (multi-token)
  • Isolation: Guarantee that events in namespace-A cannot be read from namespace-B (except via wildcard)
  • Wildcard Subscription: Cross-namespace visibility for trusted components (logging, monitoring)
  • Subject: NATS subject for routing (e.g., "aether.events.{namespace}")

Key Entities:

  • Namespace (just a string; meaning is application's)
  • JetStreamConfig with Namespace field (storage isolation)
  • SubscriptionFilter with namespace pattern (matching)
  • NATSEventBus subject routing

Key Events Published:

  • EventPublished - Event sent to namespace subscribers (via EventBus.Publish)

Key Events Consumed:

  • Events from Event Sourcing, filtered by namespace pattern

Interfaces to Other Contexts:

  • Event Sourcing: Stores can be namespaced (prefix in stream name)
  • Event Bus: Publishes to namespace; subscribers match by pattern
  • Cluster Coordination: Might use namespaced subscriptions to isolate tenant events

Lifecycle:

  • Namespace definition: Application decides; typically per-tenant or per-domain
  • Namespace creation: Implicit when first store/subscription uses it (no explicit schema)
  • Namespace deletion: Not supported; namespaces persist if events exist
  • Stream lifetime: JetStream stream "namespace_events" persists until deleted

Owner: Application layer (defines namespace boundaries); Library (enforces routing)

Current Code Locations:

  • /aether/eventbus.go - EventBus exact vs wildcard subscriber routing
  • /aether/nats_eventbus.go - NATSEventBus subject formatting (line 89: fmt.Sprintf("aether.events.%s", namespacePattern))
  • /aether/store/jetstream.go - JetStreamConfig.Namespace field, stream name sanitization (line 83)
  • /aether/pattern.go - MatchNamespacePattern, IsWildcardPattern functions

Scaling Concerns:

  • Single namespace: All events in one stream; scales with event volume
  • Multi-namespace: Separate streams per namespace; scales horizontally (add namespaces independently)
  • Wildcard subscriptions: Cross-namespace visibility; careful with security (documented warnings)

Alignment with Vision:

  • Primitives over Frameworks: Namespaces are primitives; no opinionated multi-tenancy layer
  • Non-goal: "Opinionated multi-tenancy" - this library provides isolation primitives, not tenant management

Gaps/Observations:

  • Namespace collision: No validation that namespace names are unique (risk: "orders" used by two teams)
  • Wildcard security: Extensively documented in code (SECURITY WARNING appears multiple times); good
  • No namespace registry or allow-list (application must enforce naming conventions)
  • Sanitization of namespace names happens in JetStreamEventStore (spaces → underscores) but not documented

Boundary Rules:

  • Inside: Namespace pattern matching, subject routing, stream prefixing
  • Outside: Defining namespace semantics (tenant, domain, environment), enforcing conventions
  • Cannot cross: Events in namespace-A published to namespace-A only (except wildcard subscribers)

Context 4: Cluster Coordination

Purpose: Distribute actors across cluster nodes; elect leader; rebalance on topology changes.

Core Responsibility:

  • Discover nodes in cluster (NATS-based, no external coordinator)
  • Elect one leader using lease-based coordination
  • Distribute shards across nodes via consistent hash ring
  • Detect node failures and trigger rebalancing
  • Provide shard assignment for actor placement

Language (Ubiquitous Language):

  • Node: Physical or logical computer in cluster; has ID, address, capacity, status
  • Leader: Single node responsible for coordination and rebalancing decisions
  • Term: Monotonically increasing leadership election round (prevents split-brain)
  • Shard: Virtual partition (1024 by default); actors hash to shards; shards assigned to nodes
  • Consistent Hash Ring: Algorithm mapping shards to nodes such that node failures cause minimal rebalancing
  • Rebalancing: Reassignment of shards when topology changes (node join/fail)
  • ShardMap: Current state of which shards live on which nodes
  • Heartbeat: Periodic signal from leader renewing its lease (proves still alive)
  • Lease: Time window during which leader's authority is valid (TTL-based, not quorum)

Key Entities:

  • NodeInfo (cluster node details: ID, address, capacity, status)
  • ShardMap (shard → nodes mapping; versioned)
  • LeadershipLease (leader ID, term, expiration)
  • ActorMigration (migration record for actor during rebalancing)

Key Events Published:

  • NodeJoined - New node added to cluster
  • NodeFailed - Node stopped responding (detected by heartbeat timeout)
  • LeaderElected - Leader selected (term incremented)
  • LeadershipLost - Leader lease expired (old leader can no longer coordinate)
  • ShardAssigned - Leader assigns shard to nodes
  • ShardMigrated - Shard moved from one node to another (during rebalancing)

Key Events Consumed:

  • Node topology changes (new nodes, failures) → trigger rebalancing
  • Leader election results → shard assignments

Interfaces to Other Contexts:

  • Namespace Isolation: Could use namespaced subscriptions for cluster-internal events
  • Event Sourcing: Cluster queries latest version to assign shards; failures trigger replay on new node
  • Event Bus: Cluster messages published to event bus; subscribers on each node act on them

Lifecycle:

  • Cluster formation: Nodes join; first leader elected
  • Leadership duration: Until lease expires (~10 seconds in config)
  • Shard assignment: Decided by leader; persists in ShardMap
  • Node failure: Detected after heartbeat timeout (~90 seconds implied by lease config)
  • Rebalancing: Triggered by topology change; completes when ShardMap versioned and distributed

Owner: ClusterManager (coordination); LeaderElection (election); ShardManager (placement)

Current Code Locations:

  • /aether/cluster/types.go - NodeInfo, ShardMap, LeadershipLease, ActorMigration types
  • /aether/cluster/manager.go - ClusterManager, node discovery, rebalancing loop
  • /aether/cluster/leader.go - LeaderElection (lease-based using NATS KV)
  • /aether/cluster/hashring.go - ConsistentHashRing (shard → node mapping)
  • /aether/cluster/shard.go - ShardManager (actor placement, shard assignment)

Scaling Concerns:

  • Leader election latency: 10s lease, 3s heartbeat → ~13s to detect failure (tunable)
  • Rebalancing overhead: Consistent hash minimizes movements (only affects shards from failed node)
  • Shard count: 1024 default; tune based on cluster size and actor count

Alignment with Vision:

  • NATS-Native: Leader election uses NATS KV store (lease-based); cluster discovery via NATS
  • Primitives over Frameworks: ShardManager and LeaderElection are composable; can swap algorithms

Gaps/Observations:

  • Rebalancing is triggered but algorithm not fully shown in code excerpt ("would rebalance across N nodes")
  • Actor migration during rebalancing: ShardManager has PlacementStrategy interface but sample migration handler not shown
  • Split-brain prevention: Lease-based (no concurrent leaders) but old leader could execute stale rebalancing
  • No explicit actor state migration during shard rebalancing (where does actor state go during move?)

Boundary Rules:

  • Inside: Node discovery, leader election, shard assignment, rebalancing decisions
  • Outside: Actor state migration (that's Event Sourcing's replay), actual actor message delivery
  • Cannot cross: Cluster decisions are made once per cluster (not per namespace or actor)

Context 5: Event Bus (Pub/Sub Distribution)

Purpose: Route events from producers to subscribers; support filtering and cross-node propagation.

Core Responsibility:

  • Local event distribution (in-process subscriptions)
  • Cross-node event distribution via NATS
  • Filter events by type and actor pattern
  • Support exact and wildcard namespace patterns
  • Non-blocking delivery (drop event if channel full, don't block publisher)

Language (Ubiquitous Language):

  • Publish: Send event to namespace (synchronous, non-blocking; may drop if subscribers slow)
  • Subscribe: Register interest in namespace pattern (returns channel)
  • Filter: Criteria for event delivery (EventTypes list, ActorPattern wildcard)
  • Wildcard Pattern: "*" (single token), ">" (multi-token) matching
  • Subject: NATS subject for routing (e.g., "aether.events.{namespace}")
  • Subscriber: Entity receiving events from channel (has local reference to channel)
  • Deliver: Attempt to send event to subscriber's channel; non-blocking (may drop)

Key Entities:

  • EventBroadcaster interface (local or NATS-backed)
  • EventBus (in-memory, local subscriptions only)
  • NATSEventBus (extends EventBus; adds NATS forwarding)
  • SubscriptionFilter (event types + actor pattern)
  • filteredSubscription (internal; tracks channel, pattern, filter)

Key Events Published:

  • EventPublished - Event sent via EventBus.Publish (may be delivered to subscribers)

Key Events Consumed:

  • Events from Event Sourcing context

Interfaces to Other Contexts:

  • Event Sourcing: Reads events to publish; triggered after SaveEvent
  • Namespace Isolation: Uses namespace pattern for routing
  • Cluster Coordination: Cluster messages flow through event bus

Lifecycle:

  • Subscription creation: Caller invokes Subscribe/SubscribeWithFilter; gets channel
  • Subscription duration: Lifetime of channel (caller controls)
  • Subscription cleanup: Unsubscribe closes channel
  • Event delivery: Synchronous Publish → deliver to all matching subscribers
  • Dropped events: Non-blocking delivery; full channel = dropped event (metrics recorded)

Owner: Library (EventBus implementation); Callers (subscribe/unsubscribe)

Current Code Locations:

  • /aether/eventbus.go - EventBus (local in-process pub/sub)
  • /aether/nats_eventbus.go - NATSEventBus (NATS-backed cross-node)
  • /aether/pattern.go - MatchNamespacePattern, SubscriptionFilter matching logic
  • Metrics tracking in both implementations

Scaling Concerns:

  • Local bus: In-memory channels; scales with subscriber count (no network overhead)
  • NATS bus: One NATS subscription per pattern; scales with unique patterns
  • Channel buffering: 100-element buffer (configurable); full = dropped events
  • Metrics: Track published, delivered, dropped per namespace

Alignment with Vision:

  • Primitives over Frameworks: EventBroadcaster is interface; swappable implementations
  • NATS-Native: NATSEventBus uses NATS subjects for routing

Gaps/Observations:

  • Dropped events are silent (metrics recorded but no callback); might surprise subscribers
  • Filter matching is string-based (no compile-time safety for event types)
  • Two-level filtering: Namespace at NATS level, EventTypes/ActorPattern at application level
  • NATSEventBus creates subscription per unique pattern (could be optimized with pattern hierarchy)

Boundary Rules:

  • Inside: Event routing, filter matching, non-blocking delivery
  • Outside: Semantics of events (that's Event Sourcing); decisions on what to do when event received
  • Cannot cross: Subscribers are responsible for their channels; publisher doesn't know who consumes

Context Relationships

Event Sourcing ↔ Event Bus

Type: Producer/Consumer (one-to-many)

Direction: Event Sourcing produces events; Event Bus distributes them

Integration:

  • Application saves event to store (SaveEvent)
  • Application publishes same event to bus (Publish)
  • Subscribers receive event from bus channel
  • Events are same object (Event struct)

Decoupling:

  • Store and bus are independent (application coordinates)
  • Bus subscribers don't know about storage
  • Replay doesn't trigger bus publish (events already stored)

Safety:

  • No shared transaction (save and publish are separate)
  • Risk: Event saved but publish fails (or vice versa) → bus has stale view
  • Mitigation: Application's responsibility to ensure consistency

Event Sourcing → Optimistic Concurrency Control

Type: Dependency (nested)

Direction: SaveEvent validates version using Optimistic Concurrency

Integration:

  • SaveEvent calls GetLatestVersion (read current)
  • Checks event.Version > currentVersion (optimistic lock)
  • Returns VersionConflictError if not

Decoupling:

  • Optimistic Concurrency is not a separate context; it's logic within Event Sourcing
  • Version validation is inline in SaveEvent, not a separate call

Note: Initially these seem like separate contexts (different language, different lifecycle). But Version is Event Sourcing's concern; Conflict is just an error condition (not a separate state machine). Optimistic locking is a pattern, not a context.


Event Sourcing → Namespace Isolation

Type: Containment (namespaces contain event streams)

Direction: Namespace Isolation scopes Event Sourcing

Integration:

  • JetStreamEventStore accepts Namespace in config
  • Actual stream name becomes "{namespace}_{streamName}"
  • GetEvents, GetLatestVersion, SaveEvent are namespace-scoped

Decoupling:

  • Each namespace has independent version sequences
  • No cross-namespace reads in Event Sourcing context
  • EventBus.Publish specifies namespace

Safety:

  • Complete isolation at storage level (different JetStream streams)
  • Events from namespace-A cannot appear in namespace-B queries
  • Wildcard subscriptions bypass this (documented risk)

Cluster Coordination → Event Sourcing

Type: Consumer (reads version state)

Direction: Cluster queries Event Sourcing for actor state

Integration:

  • ClusterManager might query GetLatestVersion to determine if shard can migrate
  • Nodes track which actors (shards) are assigned locally
  • On failover, new node replays events from store to rebuild state

Decoupling:

  • Cluster doesn't manage event storage (Event Sourcing owns that)
  • Cluster doesn't decide when to snapshot
  • Cluster doesn't know about versions (Event Sourcing concept)

Cluster Coordination → Namespace Isolation

Type: Orthogonal (can combine, but not required)

Direction: Cluster can use namespaced subscriptions; not required

Integration:

  • Cluster could publish node-join events to namespaced topics (e.g., "cluster.{tenant}")
  • Different tenants can have independent clusters (each with own cluster messages)

Decoupling:

  • Cluster doesn't care about namespace semantics
  • Namespace doesn't enforce cluster topology

Event Bus → (All contexts)

Type: Cross-cutting concern

Direction: Event Bus distributes events from all contexts

Integration:

  • Event Sourcing publishes to bus after SaveEvent
  • Cluster Coordination publishes shard assignments to bus
  • Namespace Isolation is a parameter to Publish/Subscribe
  • Subscribers receive events and can filter by type/actor

Decoupling:

  • Bus is asynchronous (events may be lost if no subscribers)
  • Subscribers don't block publishers
  • No ordering guarantee across namespaces

Boundary Rules Summary

By Language

Language Context Meaning
Event Event Sourcing Immutable fact; identified by ID, type, actor, version
Version Event Sourcing Monotonically increasing sequence per actor; also used for optimistic locking
Snapshot Event Sourcing Optional state cache at specific version; always disposable
Node Cluster Coordination Physical computer in cluster; has ID, address, capacity
Leader Cluster Coordination Single node elected for coordination (not per-namespace, not per-actor)
Shard Cluster Coordination Virtual partition for actor placement; 1024 by default
Namespace Namespace Isolation Logical boundary (tenant, domain, context); application-defined meaning
Wildcard Both Event Bus & Namespace "*" (single token) and ">" (multi-token) NATS pattern matching
Subject Event Bus NATS subject for message routing
Conflict Optimistic Concurrency Condition where write failed due to version being stale
Retry Optimistic Concurrency Application's decision to reload and try again
Subscribe Event Bus Register interest in namespace pattern; returns channel
Publish Event Bus Send event to namespace subscribers; non-blocking

By Lifecycle

Entity Created Destroyed Owner Context
Event SaveEvent Never (persists forever) Application writes, Aether stores Event Sourcing
Version Per-event With event Automatic (monotonic) Event Sourcing
Snapshot Application decision Application decision Application Event Sourcing
Node Join cluster Explicit leave Infrastructure Cluster Coordination
Leader Election completes Lease expires Automatic (election) Cluster Coordination
Shard Created with cluster With cluster ClusterManager Cluster Coordination
Namespace First use Never (persist) Application Namespace Isolation
Subscription Subscribe() call Unsubscribe() call Caller Event Bus
Channel Subscribe() returns Unsubscribe() closes Caller Event Bus

By Ownership

Context Who Decides What They Decide
Event Sourcing Application (developer) When to save events, event schema, snapshot strategy
Optimistic Concurrency Application Retry strategy, backoff, giving up
Namespace Isolation Application Namespace semantics (tenant, domain, env), naming convention
Cluster Coordination ClusterManager & LeaderElection Node discovery, leader election, shard assignment
Event Bus Application What to subscribe to, filtering criteria

By Scaling Boundary

Context Scales By Limits Tuning
Event Sourcing Event volume per actor Replay latency grows with version count Snapshots help
Cluster Coordination Node count Leader election latency, rebalancing overhead Lease TTL, heartbeat interval
Namespace Isolation Namespace count Stream count, NATS resource usage Separate JetStream streams
Event Bus Subscriber count Channel buffering (100 elements) Queue depth, metrics

Code vs. Intended: Alignment Analysis

Intended → Actual: Good Alignment

Context: Event Sourcing

  • Intended: EventStore interface with multiple implementations
  • Actual: InMemoryEventStore (testing) and JetStreamEventStore (production) both exist
  • ✓ Good: Matches vision of "primitives over frameworks"

Context: Optimistic Concurrency

  • Intended: Detect conflicts, return error, let app retry
  • Actual: SaveEvent returns VersionConflictError; no built-in retry
  • ✓ Good: Aligns with vision of primitives (app owns retry logic)

Context: Namespace Isolation

  • Intended: Logical boundaries without opinionated multi-tenancy
  • Actual: JetStreamConfig.Namespace, EventBus namespace patterns
  • ✓ Good: Primitives provided; semantics left to app

Context: Cluster Coordination

  • Intended: Node discovery, leader election, shard assignment
  • Actual: ClusterManager, LeaderElection, ConsistentHashRing all present
  • ✓ Good: Primitives implemented

Context: Event Bus

  • Intended: Local and cross-node pub/sub with filtering
  • Actual: EventBus (local) and NATSEventBus (NATS) both present
  • ✓ Good: Extensible via interface

Intended → Actual: Gaps

Context: Cluster Coordination

  • Intended: Actor migration during shard rebalancing
  • Actual: ShardManager has PlacementStrategy; ActorMigration type defined
  • Gap: Migration handler logic not shown; where does actor state transition during rebalance?
  • Impact: Cluster context is foundational but incomplete; application must implement actor handoff

Context: Event Sourcing

  • Intended: Snapshot strategy guidance
  • Actual: SnapshotStore interface; SaveSnapshot exists; no built-in strategy
  • Gap: No adaptive snapshotting, no time-based snapshotting
  • Impact: App must choose snapshot frequency (documented in PROBLEM_MAP, not enforced)

Context: Namespace Isolation

  • Intended: Warn about wildcard security risks
  • Actual: SECURITY WARNING in docstrings (excellent)
  • Gap: No namespace registry or allow-list to prevent collisions
  • Impact: Risk of two teams using same namespace (e.g., "orders") unintentionally

Context: Optimistic Concurrency

  • Intended: Guide app on retry strategy
  • Actual: Returns VersionConflictError with details
  • Gap: No retry helper, no backoff library
  • Impact: Each app implements own retry (fine; primitives approach)

Refactoring Backlog (if brownfield)

No Major Refactoring Required

The code structure already aligns well with intended bounded contexts:

  • Event Sourcing lives in /event.go and /store/
  • Cluster lives in /cluster/
  • Event Bus lives in /eventbus.go and /nats_eventbus.go
  • Pattern matching lives in /pattern.go

Minor Improvements

Issue 1: Document Actor Migration During Rebalancing

  • Current: ShardManager.AssignShard exists; ActorMigration type defined
  • Gap: No example code showing how actor state moves between nodes
  • Suggestion: Add sample migration handler in cluster package

Issue 2: Add Namespace Validation/Registry

  • Current: Namespace is just a string; no collision detection
  • Gap: Risk of two teams using same namespace
  • Suggestion: Document naming convention (e.g., "env.team.context"); optionally add schema/enum

Issue 3: Snapshot Strategy Recipes

  • Current: SnapshotStore interface; app responsible for strategy
  • Gap: Documentation could provide sample strategies (time-based, count-based, adaptive)
  • Suggestion: Add /examples/snapshot_strategies.go with reference implementations

Issue 4: Metrics for Concurrency Context

  • Current: Version conflict detection exists; no metrics
  • Gap: Apps can't easily observe conflict rate
  • Suggestion: Add conflict metrics to EventStore (or provide hooks)

Recommendations

For Product Strategy

  1. Confirm Bounded Contexts: Review this map with team. Are these five contexts the right cut? Missing any? Too many?

  2. Define Invariants per Context:

    • Event Sourcing: "Version must be strictly monotonic per actor" ✓ (enforced)
    • Cluster Coordination: "Only one leader can have valid lease at a time" ✓ (lease-based)
    • Namespace Isolation: "Events in namespace-A cannot be queried from namespace-B context" ✓ (separate streams)
    • Optimistic Concurrency: "Conflict detection is synchronous; resolution is async" ✓ (error returned immediately)
    • Event Bus: "Delivery is non-blocking; events may be dropped if subscriber slow" ✓ (metrics track this)
  3. Map Capabilities to Contexts:

    • "Store events durably" → Event Sourcing context
    • "Detect concurrent writes" → Optimistic Concurrency context
    • "Isolate logical domains" → Namespace Isolation context
    • "Distribute actors across nodes" → Cluster Coordination context
    • "Route events to subscribers" → Event Bus context
  4. Test Boundaries:

    • Single-node: Event Sourcing + Optimistic Concurrency + Event Bus (no Cluster)
    • Multi-node: Add Cluster Coordination (but cluster decisions don't affect other contexts)
    • Multi-tenant: Add Namespace Isolation (orthogonal to other contexts)

For Architecture

  1. Complete Cluster Context Documentation:

    • Show actor migration lifecycle during shard rebalancing
    • Document when state moves (during rebalance, during failover)
    • Provide sample ShardManager implementation
  2. Add Snapshot Strategy Guidance:

    • Time-based: Snapshot every hour
    • Count-based: Snapshot every 100 events
    • Adaptive: Snapshot when replay latency exceeds threshold
  3. Namespace Isolation Checklist:

    • Define naming convention (document in README)
    • Add compile-time checks (optional enum for known namespaces)
    • Test multi-tenant isolation (integration test suite)
  4. Concurrency Context Testing:

    • Add concurrent writer tests to store tests
    • Verify VersionConflictError details are accurate
    • Benchmark conflict detection performance

For Docs

  1. Add Context Diagram: Show five contexts as boxes; arrows for relationships

  2. Add Per-Context Glossary: Define ubiquitous language per context (terms table above)

  3. Add Lifecycle Diagrams: Show event lifetime, node lifetime, subscription lifetime, shard lifetime

  4. Security Section: Expand wildcard subscription warnings; document trust model


Anti-Patterns Avoided

Pattern: "One Big Event Model"

  • Anti-pattern: Single Event struct used everywhere with union types
  • What we do: Event is generic; domain language lives in EventType strings and Data map
  • Why: Primitives approach; library doesn't impose domain model

Pattern: "Shared Mutable State Across Contexts"

  • Anti-pattern: ClusterManager directly mutates EventStore data structures
  • What we do: Contexts communicate via events (if they need to) or via explicit queries
  • Why: Clean boundaries; each context owns its data

Pattern: "Automatic Retry for Optimistic Locks"

  • Anti-pattern: Library retries internally on version conflict
  • What we do: Return error to caller; caller decides retry strategy
  • Why: Primitives approach; retry policy is app's concern, not library's

Pattern: "Opinionated Snapshot Strategy"

  • Anti-pattern: "Snapshot every 100 events" hardcoded
  • What we do: SnapshotStore interface; app decides when to snapshot
  • Why: Different apps have different replay latency requirements

Pattern: "Wildcard Subscriptions by Default"

  • Anti-pattern: All subscriptions use ">" by default (receive everything)
  • What we do: Explicit namespaces; wildcard is optional and warned about
  • Why: Security-first; isolation is default

Conclusion

Aether's five bounded contexts are well-aligned with the problem space and the codebase:

  1. Event Sourcing - Store events as immutable history; enable replay
  2. Optimistic Concurrency - Detect conflicts; let app retry
  3. Namespace Isolation - Logical boundaries without opinionated multi-tenancy
  4. Cluster Coordination - Distribute actors, elect leader, rebalance on failure
  5. Event Bus - Route events from producers to subscribers

Each context has:

  • Clear language boundaries (different terms, different meanings)
  • Clear lifecycle boundaries (different creation/deletion patterns)
  • Clear ownership (who decides what within each context)
  • Clear scaling boundaries (why this context must be separate)

The implementation matches the vision of "primitives over frameworks." Library provides composition points (interfaces); applications wire them together.

Next step in product strategy: Define domain models within each context (Step 4 of strategy chain). For now, Aether provides primitives; applications build their domain models on top.