flowmade-one/aether

Fork 0

Files

Hugo Nijhuis 271f5db444

CI / build (push) Successful in 21s

Details

CI / integration (push) Failing after 2m1s

Details

Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-12 23:57:20 +01:00

34 KiB

Raw Blame History

Bounded Context Map: Aether Distributed Actor System

Summary

Aether has five distinct bounded contexts cut by language boundaries, lifecycle differences, ownership patterns, and scaling needs. The contexts emerge from the problem space: single-node event sourcing, distributed clustering, logical isolation, optimistic concurrency control, and event distribution.

Key insight: Each context has its own ubiquitous language (different meanings for similar terms) and its own lifecycle (actors persist forever; leases expire; subscriptions have independent lifetimes). Boundaries are enforced by language/data ownership, not by organizational structure.

Bounded Contexts

Context 1: Event Sourcing

Purpose: Persist events as immutable source of truth; enable state rebuild through replay.

Core Responsibility:

Events are facts (immutable, append-only)
Versions are monotonically increasing per actor
Snapshots are optional optimization hints, not required
Replay reconstructs state from history

Language (Ubiquitous Language):

Event: Immutable fact about what happened; identified by ID, type, actor, version
Version: Monotonically increasing sequence number per actor; used for optimistic locking
Snapshot: Point-in-time state capture at a specific version; optional; can always replay
ActorID: Identifier for the entity whose events we're storing; unique within namespace
Replay: Process of reading events from start version, applying each, to rebuild state

Key Entities (Event-Based, not Object-Based):

Event (immutable, versioned)
ActorSnapshot (optional state cache)
EventStore interface (multiple implementations)

Key Events Published:

EventStored - Event successfully persisted (triggered when SaveEvent succeeds)
VersionConflict - Attempted version <= current; optimistic lock lost (expensive mistake)
SnapshotCreated - State snapshot saved (optional; developers decide when)

Key Events Consumed:

None (this context is a source of truth; others consume from it)

Interfaces to Other Contexts:

Cluster Coordination: Cluster leader queries latest versions to assign shards
Namespace Isolation: Stores can be namespaced; queries filtered by namespace
Optimistic Concurrency: Version conflicts trigger retry logic in application
Event Bus: Events stored here are published to bus subscribers

Lifecycle:

Event creation: Triggered by application business logic (domain events)
Event persistence: Synchronous SaveEvent call (writes to store)
Event durability: Persists forever (or until retention policy expires in JetStream)
Snapshot lifecycle: Optional; created by application decision or rebalancing; can be safely discarded (replay recovers)

Owner: Developer (application layer) owns writing events; Aether library owns storage

Current Code Locations:

/aether/event.go - Event struct, VersionConflictError, ReplayError
/aether/store/memory.go - InMemoryEventStore implementation
/aether/store/jetstream.go - JetStreamEventStore implementation (production)

Scaling Concerns:

Single node: Full replay fast for actors with <100 events; snapshots help >100 events
Cluster: Events stored in JetStream (durable across nodes); replay happens on failover
Multi-tenant: Events namespaced; separate streams per namespace avoid cross-contamination

Alignment with Vision:

Primitives over Frameworks: EventStore is interface; multiple implementations
NATS-Native: JetStreamEventStore uses JetStream durability
Events as Complete History: Events are source of truth; state is derived

Gaps/Observations:

Snapshot strategy is entirely application's responsibility (no built-in triggering)
Schema evolution for events not discussed (backward compatibility on deserialization)
Corruption recovery (ReplayError handling) is application's responsibility

Boundary Rules:

Inside: Event persistence, version validation, replay logic
Outside: Domain logic that generates events, retry policy on conflicts, snapshot triggering
Cannot cross: No shared models between Event Sourcing and other contexts; translation happens via events

Context 2: Optimistic Concurrency Control

Purpose: Detect and signal concurrent write conflicts; let application choose retry strategy.

Core Responsibility:

Protect against lost writes from concurrent writers
Detect conflicts early (version mismatch)
Provide detailed error context for retry logic
Enable at-least-once semantics for idempotent operations

Language (Ubiquitous Language):

Version: Sequential number tracking writer's view of current state
Conflict: Condition where attempted version <= current version (another writer won)
Optimistic Lock: Assumption that conflicts are rare; detect when they happen
Retry: Application's response to conflict; reload state and attempt again
AttemptedVersion: Version proposed by current writer
CurrentVersion: Version that actually won the race

Key Entities:

VersionConflictError (detailed error with actor ID, attempted, current versions)
OptimisticLock pattern (implicit; not a first-class entity)

Key Events Published:

VersionConflict - SaveEvent rejected due to version <= current (developer retries)

Key Events Consumed:

None directly; consumes version state from Event Sourcing

Interfaces to Other Contexts:

Event Sourcing: Reads latest version; detects conflicts on save
Application Logic: Application handles conflict and decides retry strategy

Lifecycle:

Conflict detection: Synchronous in SaveEvent (fast check: version > current)
Conflict lifecycle: Temporary; conflict happens then application retries with new version
Error lifecycle: Returned immediately; application decides next action

Owner: Aether library (detects conflicts); Application (implements retry strategy)

Current Code Locations:

/aether/event.go - ErrVersionConflict sentinel, VersionConflictError type
/aether/store/jetstream.go - SaveEvent validation (lines checking version)
/aether/store/memory.go - SaveEvent validation

Scaling Concerns:

High contention: If many writers target same actor, conflicts spike; application must implement backoff
Retry storms: Naive retry (tight loop) causes cascade failures; exponential backoff mitigates
Metrics: Track conflict rate to detect unexpected contention

Alignment with Vision:

Primitives over Frameworks: Aether returns error; application decides what to do
Does NOT impose retry strategy (that would be a framework opinion)

Gaps/Observations:

No built-in retry mechanism (intentional design choice)
No conflict metrics in library (application must instrument)
No guidance on retry backoff strategies in code (documented in PROBLEM_MAP, not in API)

Boundary Rules:

Inside: Detect conflict, validate version > current, return detailed error
Outside: Retry logic, backoff strategy, exponential delays, giving up after N attempts
Cannot cross: Each context owns its retry behavior; no global retry handler

Context 3: Namespace Isolation

Purpose: Provide logical data boundaries without opinionated multi-tenancy framework.

Core Responsibility:

Route events to subscribers matching namespace pattern
Isolate event stores by namespace prefix
Support hierarchical namespace naming (e.g., "prod.tenant-abc", "staging.orders")
Warn about wildcard bypass of isolation (explicit decision)

Language (Ubiquitous Language):

Namespace: Logical boundary (tenant, domain, environment, bounded context)
Namespace Pattern: NATS-style wildcard matching: "*" (single token), ">" (multi-token)
Isolation: Guarantee that events in namespace-A cannot be read from namespace-B (except via wildcard)
Wildcard Subscription: Cross-namespace visibility for trusted components (logging, monitoring)
Subject: NATS subject for routing (e.g., "aether.events.{namespace}")

Key Entities:

Namespace (just a string; meaning is application's)
JetStreamConfig with Namespace field (storage isolation)
SubscriptionFilter with namespace pattern (matching)
NATSEventBus subject routing

Key Events Published:

EventPublished - Event sent to namespace subscribers (via EventBus.Publish)

Key Events Consumed:

Events from Event Sourcing, filtered by namespace pattern

Interfaces to Other Contexts:

Event Sourcing: Stores can be namespaced (prefix in stream name)
Event Bus: Publishes to namespace; subscribers match by pattern
Cluster Coordination: Might use namespaced subscriptions to isolate tenant events

Lifecycle:

Namespace definition: Application decides; typically per-tenant or per-domain
Namespace creation: Implicit when first store/subscription uses it (no explicit schema)
Namespace deletion: Not supported; namespaces persist if events exist
Stream lifetime: JetStream stream "namespace_events" persists until deleted

Owner: Application layer (defines namespace boundaries); Library (enforces routing)

Current Code Locations:

/aether/eventbus.go - EventBus exact vs wildcard subscriber routing
/aether/nats_eventbus.go - NATSEventBus subject formatting (line 89: fmt.Sprintf("aether.events.%s", namespacePattern))
/aether/store/jetstream.go - JetStreamConfig.Namespace field, stream name sanitization (line 83)
/aether/pattern.go - MatchNamespacePattern, IsWildcardPattern functions

Scaling Concerns:

Single namespace: All events in one stream; scales with event volume
Multi-namespace: Separate streams per namespace; scales horizontally (add namespaces independently)
Wildcard subscriptions: Cross-namespace visibility; careful with security (documented warnings)

Alignment with Vision:

Primitives over Frameworks: Namespaces are primitives; no opinionated multi-tenancy layer
Non-goal: "Opinionated multi-tenancy" - this library provides isolation primitives, not tenant management

Gaps/Observations:

Namespace collision: No validation that namespace names are unique (risk: "orders" used by two teams)
Wildcard security: Extensively documented in code (SECURITY WARNING appears multiple times); good
No namespace registry or allow-list (application must enforce naming conventions)
Sanitization of namespace names happens in JetStreamEventStore (spaces → underscores) but not documented

Boundary Rules:

Inside: Namespace pattern matching, subject routing, stream prefixing
Outside: Defining namespace semantics (tenant, domain, environment), enforcing conventions
Cannot cross: Events in namespace-A published to namespace-A only (except wildcard subscribers)

Context 4: Cluster Coordination

Purpose: Distribute actors across cluster nodes; elect leader; rebalance on topology changes.

Core Responsibility:

Discover nodes in cluster (NATS-based, no external coordinator)
Elect one leader using lease-based coordination
Distribute shards across nodes via consistent hash ring
Detect node failures and trigger rebalancing
Provide shard assignment for actor placement

Language (Ubiquitous Language):

Node: Physical or logical computer in cluster; has ID, address, capacity, status
Leader: Single node responsible for coordination and rebalancing decisions
Term: Monotonically increasing leadership election round (prevents split-brain)
Shard: Virtual partition (1024 by default); actors hash to shards; shards assigned to nodes
Consistent Hash Ring: Algorithm mapping shards to nodes such that node failures cause minimal rebalancing
Rebalancing: Reassignment of shards when topology changes (node join/fail)
ShardMap: Current state of which shards live on which nodes
Heartbeat: Periodic signal from leader renewing its lease (proves still alive)
Lease: Time window during which leader's authority is valid (TTL-based, not quorum)

Key Entities:

NodeInfo (cluster node details: ID, address, capacity, status)
ShardMap (shard → nodes mapping; versioned)
LeadershipLease (leader ID, term, expiration)
ActorMigration (migration record for actor during rebalancing)

Key Events Published:

NodeJoined - New node added to cluster
NodeFailed - Node stopped responding (detected by heartbeat timeout)
LeaderElected - Leader selected (term incremented)
LeadershipLost - Leader lease expired (old leader can no longer coordinate)
ShardAssigned - Leader assigns shard to nodes
ShardMigrated - Shard moved from one node to another (during rebalancing)

Key Events Consumed:

Node topology changes (new nodes, failures) → trigger rebalancing
Leader election results → shard assignments

Interfaces to Other Contexts:

Namespace Isolation: Could use namespaced subscriptions for cluster-internal events
Event Sourcing: Cluster queries latest version to assign shards; failures trigger replay on new node
Event Bus: Cluster messages published to event bus; subscribers on each node act on them

Lifecycle:

Cluster formation: Nodes join; first leader elected
Leadership duration: Until lease expires (~10 seconds in config)
Shard assignment: Decided by leader; persists in ShardMap
Node failure: Detected after heartbeat timeout (~90 seconds implied by lease config)
Rebalancing: Triggered by topology change; completes when ShardMap versioned and distributed

Owner: ClusterManager (coordination); LeaderElection (election); ShardManager (placement)

Current Code Locations:

/aether/cluster/types.go - NodeInfo, ShardMap, LeadershipLease, ActorMigration types
/aether/cluster/manager.go - ClusterManager, node discovery, rebalancing loop
/aether/cluster/leader.go - LeaderElection (lease-based using NATS KV)
/aether/cluster/hashring.go - ConsistentHashRing (shard → node mapping)
/aether/cluster/shard.go - ShardManager (actor placement, shard assignment)

Scaling Concerns:

Leader election latency: 10s lease, 3s heartbeat → ~13s to detect failure (tunable)
Rebalancing overhead: Consistent hash minimizes movements (only affects shards from failed node)
Shard count: 1024 default; tune based on cluster size and actor count

Alignment with Vision:

NATS-Native: Leader election uses NATS KV store (lease-based); cluster discovery via NATS
Primitives over Frameworks: ShardManager and LeaderElection are composable; can swap algorithms

Gaps/Observations:

Rebalancing is triggered but algorithm not fully shown in code excerpt ("would rebalance across N nodes")
Actor migration during rebalancing: ShardManager has PlacementStrategy interface but sample migration handler not shown
Split-brain prevention: Lease-based (no concurrent leaders) but old leader could execute stale rebalancing
No explicit actor state migration during shard rebalancing (where does actor state go during move?)

Boundary Rules:

Inside: Node discovery, leader election, shard assignment, rebalancing decisions
Outside: Actor state migration (that's Event Sourcing's replay), actual actor message delivery
Cannot cross: Cluster decisions are made once per cluster (not per namespace or actor)

Context 5: Event Bus (Pub/Sub Distribution)

Purpose: Route events from producers to subscribers; support filtering and cross-node propagation.

Core Responsibility:

Local event distribution (in-process subscriptions)
Cross-node event distribution via NATS
Filter events by type and actor pattern
Support exact and wildcard namespace patterns
Non-blocking delivery (drop event if channel full, don't block publisher)

Language (Ubiquitous Language):

Publish: Send event to namespace (synchronous, non-blocking; may drop if subscribers slow)
Subscribe: Register interest in namespace pattern (returns channel)
Filter: Criteria for event delivery (EventTypes list, ActorPattern wildcard)
Wildcard Pattern: "*" (single token), ">" (multi-token) matching
Subject: NATS subject for routing (e.g., "aether.events.{namespace}")
Subscriber: Entity receiving events from channel (has local reference to channel)
Deliver: Attempt to send event to subscriber's channel; non-blocking (may drop)

Key Entities:

EventBroadcaster interface (local or NATS-backed)
EventBus (in-memory, local subscriptions only)
NATSEventBus (extends EventBus; adds NATS forwarding)
SubscriptionFilter (event types + actor pattern)
filteredSubscription (internal; tracks channel, pattern, filter)

Key Events Published:

EventPublished - Event sent via EventBus.Publish (may be delivered to subscribers)

Key Events Consumed:

Events from Event Sourcing context

Interfaces to Other Contexts:

Event Sourcing: Reads events to publish; triggered after SaveEvent
Namespace Isolation: Uses namespace pattern for routing
Cluster Coordination: Cluster messages flow through event bus

Lifecycle:

Subscription creation: Caller invokes Subscribe/SubscribeWithFilter; gets channel
Subscription duration: Lifetime of channel (caller controls)
Subscription cleanup: Unsubscribe closes channel
Event delivery: Synchronous Publish → deliver to all matching subscribers
Dropped events: Non-blocking delivery; full channel = dropped event (metrics recorded)

Owner: Library (EventBus implementation); Callers (subscribe/unsubscribe)

Current Code Locations:

/aether/eventbus.go - EventBus (local in-process pub/sub)
/aether/nats_eventbus.go - NATSEventBus (NATS-backed cross-node)
/aether/pattern.go - MatchNamespacePattern, SubscriptionFilter matching logic
Metrics tracking in both implementations

Scaling Concerns:

Local bus: In-memory channels; scales with subscriber count (no network overhead)
NATS bus: One NATS subscription per pattern; scales with unique patterns
Channel buffering: 100-element buffer (configurable); full = dropped events
Metrics: Track published, delivered, dropped per namespace

Alignment with Vision:

Primitives over Frameworks: EventBroadcaster is interface; swappable implementations
NATS-Native: NATSEventBus uses NATS subjects for routing

Gaps/Observations:

Dropped events are silent (metrics recorded but no callback); might surprise subscribers
Filter matching is string-based (no compile-time safety for event types)
Two-level filtering: Namespace at NATS level, EventTypes/ActorPattern at application level
NATSEventBus creates subscription per unique pattern (could be optimized with pattern hierarchy)

Boundary Rules:

Inside: Event routing, filter matching, non-blocking delivery
Outside: Semantics of events (that's Event Sourcing); decisions on what to do when event received
Cannot cross: Subscribers are responsible for their channels; publisher doesn't know who consumes

Context Relationships

Event Sourcing ↔ Event Bus

Type: Producer/Consumer (one-to-many)

Direction: Event Sourcing produces events; Event Bus distributes them

Integration:

Application saves event to store (SaveEvent)
Application publishes same event to bus (Publish)
Subscribers receive event from bus channel
Events are same object (Event struct)

Decoupling:

Store and bus are independent (application coordinates)
Bus subscribers don't know about storage
Replay doesn't trigger bus publish (events already stored)

Safety:

No shared transaction (save and publish are separate)
Risk: Event saved but publish fails (or vice versa) → bus has stale view
Mitigation: Application's responsibility to ensure consistency

Event Sourcing → Optimistic Concurrency Control

Type: Dependency (nested)

Direction: SaveEvent validates version using Optimistic Concurrency

Integration:

SaveEvent calls GetLatestVersion (read current)
Checks event.Version > currentVersion (optimistic lock)
Returns VersionConflictError if not

Decoupling:

Optimistic Concurrency is not a separate context; it's logic within Event Sourcing
Version validation is inline in SaveEvent, not a separate call

Note: Initially these seem like separate contexts (different language, different lifecycle). But Version is Event Sourcing's concern; Conflict is just an error condition (not a separate state machine). Optimistic locking is a pattern, not a context.

Event Sourcing → Namespace Isolation

Type: Containment (namespaces contain event streams)

Direction: Namespace Isolation scopes Event Sourcing

Integration:

JetStreamEventStore accepts Namespace in config
Actual stream name becomes "{namespace}_{streamName}"
GetEvents, GetLatestVersion, SaveEvent are namespace-scoped

Decoupling:

Each namespace has independent version sequences
No cross-namespace reads in Event Sourcing context
EventBus.Publish specifies namespace

Safety:

Complete isolation at storage level (different JetStream streams)
Events from namespace-A cannot appear in namespace-B queries
Wildcard subscriptions bypass this (documented risk)

Cluster Coordination → Event Sourcing

Type: Consumer (reads version state)

Direction: Cluster queries Event Sourcing for actor state

Integration:

ClusterManager might query GetLatestVersion to determine if shard can migrate
Nodes track which actors (shards) are assigned locally
On failover, new node replays events from store to rebuild state

Decoupling:

Cluster doesn't manage event storage (Event Sourcing owns that)
Cluster doesn't decide when to snapshot
Cluster doesn't know about versions (Event Sourcing concept)

Cluster Coordination → Namespace Isolation

Type: Orthogonal (can combine, but not required)

Direction: Cluster can use namespaced subscriptions; not required

Integration:

Cluster could publish node-join events to namespaced topics (e.g., "cluster.{tenant}")
Different tenants can have independent clusters (each with own cluster messages)

Decoupling:

Cluster doesn't care about namespace semantics
Namespace doesn't enforce cluster topology

Event Bus → (All contexts)

Type: Cross-cutting concern

Direction: Event Bus distributes events from all contexts

Integration:

Event Sourcing publishes to bus after SaveEvent
Cluster Coordination publishes shard assignments to bus
Namespace Isolation is a parameter to Publish/Subscribe
Subscribers receive events and can filter by type/actor

Decoupling:

Bus is asynchronous (events may be lost if no subscribers)
Subscribers don't block publishers
No ordering guarantee across namespaces

Boundary Rules Summary

By Language

Language	Context	Meaning
Event	Event Sourcing	Immutable fact; identified by ID, type, actor, version
Version	Event Sourcing	Monotonically increasing sequence per actor; also used for optimistic locking
Snapshot	Event Sourcing	Optional state cache at specific version; always disposable
Node	Cluster Coordination	Physical computer in cluster; has ID, address, capacity
Leader	Cluster Coordination	Single node elected for coordination (not per-namespace, not per-actor)
Shard	Cluster Coordination	Virtual partition for actor placement; 1024 by default
Namespace	Namespace Isolation	Logical boundary (tenant, domain, context); application-defined meaning
Wildcard	Both Event Bus & Namespace	"*" (single token) and ">" (multi-token) NATS pattern matching
Subject	Event Bus	NATS subject for message routing
Conflict	Optimistic Concurrency	Condition where write failed due to version being stale
Retry	Optimistic Concurrency	Application's decision to reload and try again
Subscribe	Event Bus	Register interest in namespace pattern; returns channel
Publish	Event Bus	Send event to namespace subscribers; non-blocking

By Lifecycle

Entity	Created	Destroyed	Owner	Context
Event	SaveEvent	Never (persists forever)	Application writes, Aether stores	Event Sourcing
Version	Per-event	With event	Automatic (monotonic)	Event Sourcing
Snapshot	Application decision	Application decision	Application	Event Sourcing
Node	Join cluster	Explicit leave	Infrastructure	Cluster Coordination
Leader	Election completes	Lease expires	Automatic (election)	Cluster Coordination
Shard	Created with cluster	With cluster	ClusterManager	Cluster Coordination
Namespace	First use	Never (persist)	Application	Namespace Isolation
Subscription	Subscribe() call	Unsubscribe() call	Caller	Event Bus
Channel	Subscribe() returns	Unsubscribe() closes	Caller	Event Bus

By Ownership

Context	Who Decides	What They Decide
Event Sourcing	Application (developer)	When to save events, event schema, snapshot strategy
Optimistic Concurrency	Application	Retry strategy, backoff, giving up
Namespace Isolation	Application	Namespace semantics (tenant, domain, env), naming convention
Cluster Coordination	ClusterManager & LeaderElection	Node discovery, leader election, shard assignment
Event Bus	Application	What to subscribe to, filtering criteria

By Scaling Boundary

Context	Scales By	Limits	Tuning
Event Sourcing	Event volume per actor	Replay latency grows with version count	Snapshots help
Cluster Coordination	Node count	Leader election latency, rebalancing overhead	Lease TTL, heartbeat interval
Namespace Isolation	Namespace count	Stream count, NATS resource usage	Separate JetStream streams
Event Bus	Subscriber count	Channel buffering (100 elements)	Queue depth, metrics

Code vs. Intended: Alignment Analysis

Intended → Actual: Good Alignment

Context: Event Sourcing

Intended: EventStore interface with multiple implementations
Actual: InMemoryEventStore (testing) and JetStreamEventStore (production) both exist
✓ Good: Matches vision of "primitives over frameworks"

Context: Optimistic Concurrency

Intended: Detect conflicts, return error, let app retry
Actual: SaveEvent returns VersionConflictError; no built-in retry
✓ Good: Aligns with vision of primitives (app owns retry logic)

Context: Namespace Isolation

Intended: Logical boundaries without opinionated multi-tenancy
Actual: JetStreamConfig.Namespace, EventBus namespace patterns
✓ Good: Primitives provided; semantics left to app

Context: Cluster Coordination

Intended: Node discovery, leader election, shard assignment
Actual: ClusterManager, LeaderElection, ConsistentHashRing all present
✓ Good: Primitives implemented

Context: Event Bus

Intended: Local and cross-node pub/sub with filtering
Actual: EventBus (local) and NATSEventBus (NATS) both present
✓ Good: Extensible via interface

Intended → Actual: Gaps

Context: Cluster Coordination

Intended: Actor migration during shard rebalancing
Actual: ShardManager has PlacementStrategy; ActorMigration type defined
Gap: Migration handler logic not shown; where does actor state transition during rebalance?
Impact: Cluster context is foundational but incomplete; application must implement actor handoff

Context: Event Sourcing

Intended: Snapshot strategy guidance
Actual: SnapshotStore interface; SaveSnapshot exists; no built-in strategy
Gap: No adaptive snapshotting, no time-based snapshotting
Impact: App must choose snapshot frequency (documented in PROBLEM_MAP, not enforced)

Context: Namespace Isolation

Intended: Warn about wildcard security risks
Actual: SECURITY WARNING in docstrings (excellent)
Gap: No namespace registry or allow-list to prevent collisions
Impact: Risk of two teams using same namespace (e.g., "orders") unintentionally

Context: Optimistic Concurrency

Intended: Guide app on retry strategy
Actual: Returns VersionConflictError with details
Gap: No retry helper, no backoff library
Impact: Each app implements own retry (fine; primitives approach)

Refactoring Backlog (if brownfield)

No Major Refactoring Required

The code structure already aligns well with intended bounded contexts:

Event Sourcing lives in /event.go and /store/
Cluster lives in /cluster/
Event Bus lives in /eventbus.go and /nats_eventbus.go
Pattern matching lives in /pattern.go

Minor Improvements

Issue 1: Document Actor Migration During Rebalancing

Current: ShardManager.AssignShard exists; ActorMigration type defined
Gap: No example code showing how actor state moves between nodes
Suggestion: Add sample migration handler in cluster package

Issue 2: Add Namespace Validation/Registry

Current: Namespace is just a string; no collision detection
Gap: Risk of two teams using same namespace
Suggestion: Document naming convention (e.g., "env.team.context"); optionally add schema/enum

Issue 3: Snapshot Strategy Recipes

Current: SnapshotStore interface; app responsible for strategy
Gap: Documentation could provide sample strategies (time-based, count-based, adaptive)
Suggestion: Add /examples/snapshot_strategies.go with reference implementations

Issue 4: Metrics for Concurrency Context

Current: Version conflict detection exists; no metrics
Gap: Apps can't easily observe conflict rate
Suggestion: Add conflict metrics to EventStore (or provide hooks)

Recommendations

For Product Strategy

Confirm Bounded Contexts: Review this map with team. Are these five contexts the right cut? Missing any? Too many?
Define Invariants per Context:
- Event Sourcing: "Version must be strictly monotonic per actor" ✓ (enforced)
- Cluster Coordination: "Only one leader can have valid lease at a time" ✓ (lease-based)
- Namespace Isolation: "Events in namespace-A cannot be queried from namespace-B context" ✓ (separate streams)
- Optimistic Concurrency: "Conflict detection is synchronous; resolution is async" ✓ (error returned immediately)
- Event Bus: "Delivery is non-blocking; events may be dropped if subscriber slow" ✓ (metrics track this)
Map Capabilities to Contexts:
- "Store events durably" → Event Sourcing context
- "Detect concurrent writes" → Optimistic Concurrency context
- "Isolate logical domains" → Namespace Isolation context
- "Distribute actors across nodes" → Cluster Coordination context
- "Route events to subscribers" → Event Bus context
Test Boundaries:
- Single-node: Event Sourcing + Optimistic Concurrency + Event Bus (no Cluster)
- Multi-node: Add Cluster Coordination (but cluster decisions don't affect other contexts)
- Multi-tenant: Add Namespace Isolation (orthogonal to other contexts)

For Architecture

Complete Cluster Context Documentation:
- Show actor migration lifecycle during shard rebalancing
- Document when state moves (during rebalance, during failover)
- Provide sample ShardManager implementation
Add Snapshot Strategy Guidance:
- Time-based: Snapshot every hour
- Count-based: Snapshot every 100 events
- Adaptive: Snapshot when replay latency exceeds threshold
Namespace Isolation Checklist:
- Define naming convention (document in README)
- Add compile-time checks (optional enum for known namespaces)
- Test multi-tenant isolation (integration test suite)
Concurrency Context Testing:
- Add concurrent writer tests to store tests
- Verify VersionConflictError details are accurate
- Benchmark conflict detection performance

For Docs

Add Context Diagram: Show five contexts as boxes; arrows for relationships
Add Per-Context Glossary: Define ubiquitous language per context (terms table above)
Add Lifecycle Diagrams: Show event lifetime, node lifetime, subscription lifetime, shard lifetime
Security Section: Expand wildcard subscription warnings; document trust model

Anti-Patterns Avoided

Pattern: "One Big Event Model"

Anti-pattern: Single Event struct used everywhere with union types
What we do: Event is generic; domain language lives in EventType strings and Data map
Why: Primitives approach; library doesn't impose domain model

Pattern: "Shared Mutable State Across Contexts"

Anti-pattern: ClusterManager directly mutates EventStore data structures
What we do: Contexts communicate via events (if they need to) or via explicit queries
Why: Clean boundaries; each context owns its data

Pattern: "Automatic Retry for Optimistic Locks"

Anti-pattern: Library retries internally on version conflict
What we do: Return error to caller; caller decides retry strategy
Why: Primitives approach; retry policy is app's concern, not library's

Pattern: "Opinionated Snapshot Strategy"

Anti-pattern: "Snapshot every 100 events" hardcoded
What we do: SnapshotStore interface; app decides when to snapshot
Why: Different apps have different replay latency requirements

Pattern: "Wildcard Subscriptions by Default"

Anti-pattern: All subscriptions use ">" by default (receive everything)
What we do: Explicit namespaces; wildcard is optional and warned about
Why: Security-first; isolation is default

Conclusion

Aether's five bounded contexts are well-aligned with the problem space and the codebase:

Event Sourcing - Store events as immutable history; enable replay
Optimistic Concurrency - Detect conflicts; let app retry
Namespace Isolation - Logical boundaries without opinionated multi-tenancy
Cluster Coordination - Distribute actors, elect leader, rebalance on failure
Event Bus - Route events from producers to subscribers

Each context has:

Clear language boundaries (different terms, different meanings)
Clear lifecycle boundaries (different creation/deletion patterns)
Clear ownership (who decides what within each context)
Clear scaling boundaries (why this context must be separate)

The implementation matches the vision of "primitives over frameworks." Library provides composition points (interfaces); applications wire them together.

Next step in product strategy: Define domain models within each context (Step 4 of strategy chain). For now, Aether provides primitives; applications build their domain models on top.

34 KiB Raw Blame History

Bounded Context Map: Aether Distributed Actor System

Summary

Bounded Contexts

Context 1: Event Sourcing

Context 2: Optimistic Concurrency Control

Context 3: Namespace Isolation

Context 4: Cluster Coordination

Context 5: Event Bus (Pub/Sub Distribution)

Context Relationships

Event Sourcing ↔ Event Bus

Event Sourcing → Optimistic Concurrency Control

Event Sourcing → Namespace Isolation

Cluster Coordination → Event Sourcing

Cluster Coordination → Namespace Isolation

Event Bus → (All contexts)

Boundary Rules Summary

By Language

By Lifecycle

By Ownership

By Scaling Boundary

Code vs. Intended: Alignment Analysis

Intended → Actual: Good Alignment

Intended → Actual: Gaps

Refactoring Backlog (if brownfield)

No Major Refactoring Required

Minor Improvements

Recommendations

For Product Strategy

For Architecture

For Docs

Anti-Patterns Avoided

Pattern: "One Big Event Model"

Pattern: "Shared Mutable State Across Contexts"

Pattern: "Automatic Retry for Optimistic Locks"

Pattern: "Opinionated Snapshot Strategy"

Pattern: "Wildcard Subscriptions by Default"

Conclusion

34 KiB

Raw Blame History