Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
36 KiB
Problem Map: Aether Distributed Actor System
Summary
Aether solves the problem of building distributed, event-sourced systems in Go without heavyweight frameworks or reinventing infrastructure. The core tension is providing composable primitives (Event, EventStore, clustering) that work together seamlessly while maintaining organizational values: auditability, business language in code, independent evolution, and explicit intent.
The problem space is defined by four distinct developer journeys: single-node development (testing/iteration), scaling to distributed clusters, isolating multi-tenant/multi-context data, and managing concurrent writes through optimistic locking.
Developer User Journeys
Journey 1: Single-Node Event-Sourced System (Testing & Iteration)
Job to be done: "Quickly build and test event-sourced domain logic without distributed complexity"
Steps:
-
Developer starts new bounded context
- Outcome: Empty event store configured
- Pain: Must choose between in-memory (loses data) and production store (overkill for iteration)
- Design: InMemoryEventStore provides fast iteration; no schema migration burden
-
Developer writes first event class
- Outcome: Event type defined with domain language (e.g., "OrderPlaced", not "order_v1")
- Pain: Event types are strings, easy to typo; no compile-time safety
- Design: Event struct accepts EventType as string; metadata provides correlation/causation tracking
-
Developer emits and replays events
- Outcome: State rebuilt from event history
- Pain: Replay can be slow if events accumulate; need to know when snapshots help
- Design: SnapshotStore interface separates snapshot logic from event storage
-
Developer runs integration test
- Outcome: Test validates domain behavior without NATS
- Pain: InMemoryEventStore is fast but tests don't catch distributed issues
- Design: EventStore interface allows swapping implementations; tests use memory
Events in this journey:
EventStoreInitialized- Developer created store (InMemory selected)EventClassDefined- Domain event type created (OrderPlaced)EventStored- Event persisted to storeReplayStarted- Developer replays events to rebuild stateSnapshotConsidered- Developer evaluates snapshot vs full replay cost
Journey 2: Scaling from Single Node to Distributed Cluster
Job to be done: "Move proven domain logic to production without rewriting; handle shard assignment and leader coordination"
Steps:
-
Developer switches EventStore to JetStream
- Outcome: Events now persisted in NATS cluster; available across nodes
- Pain: Chose JetStream (production) but now depends on NATS uptime
- Design: JetStreamEventStore implements same EventStore interface; namespace isolation available
- Event:
EventStoreUpgraded- Switched from memory to JetStream
-
Developer connects nodes to cluster
- Outcome: Nodes discover each other via NATS
- Pain: Must bootstrap cluster; leader election hasn't started yet
- Design: ClusterManager handles node discovery and topology
- Event:
NodeJoined- New node joined cluster with address/capacity/metadata
-
Developer enables leader election
- Outcome: One node elected leader; can coordinate shard assignments
- Pain: If leader crashes, new election takes time; old leader might cause split-brain
- Design: LeaderElection uses NATS KV store with lease-based coordination (TTL + renewal)
- Event:
LeaderElected- Leader chosen; term incremented
-
Developer assigns shards to nodes
- Outcome: Consistent hash ring distributes shards across nodes
- Pain: Initial shard assignment is manual; rebalancing after node failure is complex
- Design: ConsistentHashRing handles placement; ShardManager routes actors to shards
- Event:
ShardsAssigned- Shards allocated to nodes via consistent hash
-
Developer tests failover scenario
- Outcome: Node crashes; system continues; other nodes take over shards
- Pain: How do migrated actors recover state? Where is state during migration?
- Design: Events are in JetStream (durable); snapshots help fast recovery
- Event:
NodeFailed- Node marked failed; shards need reassignment
Events in this journey:
EventStoreUpgraded- Switched from memory to JetStreamNodeJoined- Node added to clusterNodeDiscovered- New node found via NATSLeaderElected- Leader selected after electionLeaderHeartbeat- Leader renews lease (periodic)ShardAssigned- Actor assigned to shardShardRebalanceRequested- Leader initiates rebalancingNodeFailed- Node stopped respondingShardMigrated- Shard moved from one node to another
Journey 3: Multi-Tenant System with Namespace Isolation
Job to be done: "Isolate tenant data logically without complex multi-tenancy framework; ensure queries see only their data"
Steps:
-
Developer decides on namespace boundary
- Outcome: Defines namespace as tenant ID or domain boundary (e.g., "tenant-abc", "prod.orders")
- Pain: Must understand NATS subject naming conventions; unsure about collision risks
- Design: Namespace is arbitrary string; uses dot-separated tokens for hierarchical patterns
- Event:
NamespaceDefined- Namespace selected (tenant-abc)
-
Developer creates namespaced EventStore
- Outcome: Events for this namespace stored in separate JetStream stream (e.g., "tenant-abc_events")
- Pain: Must remember to use correct namespace everywhere; easy to cross-contaminate
- Design: JetStreamEventStoreWithNamespace enforces namespace in stream name
- Event:
NamespacedStoreCreated- Store created with namespace prefix
-
Developer publishes to namespaced bus
- Outcome: EventBus.Publish("tenant-abc", event) routes to subscribers of "tenant-abc"
- Pain: Wildcard subscriptions bypass isolation (prod.* receives prod.orders and prod.users)
- Design: MatchNamespacePattern enforces NATS wildcard rules; documentation warns of security
- Event:
EventPublished- Event sent to namespace
-
Developer creates filtered subscription
- Outcome: Subscriber can filter by event type or actor pattern
- Pain: Filters are client-side after full subscription; what if payload is sensitive?
- Design: EventBus filters after receiving; NATSEventBus uses NATS subject patterns for efficiency
- Event:
SubscriberFiltered- Subscriber created with filter criteria
-
Developer validates isolation
- Outcome: Test confirms tenant-abc cannot see tenant-def events
- Pain: Must test at multiple levels (store, bus, query models); still no compile-time guarantee
- Design: Integration tests verify namespace boundaries
- Event:
IsolationVerified- Test confirmed namespace separation
Events in this journey:
NamespaceDefined- Namespace boundary establishedNamespacedStoreCreated- Store created for namespaceEventPublished- Event sent to namespaceSubscriptionCreated- Subscriber registered for namespaceFilterApplied- Subscription filter configuredIsolationBreached- Test detected cross-namespace data leak (anti-pattern)
Journey 4: Optimistic Concurrency Control for Concurrent Writes
Job to be done: "Handle multiple concurrent writes to same actor without corruption; fail fast on conflicts"
Steps:
-
Developer loads actor state
- Outcome: Reads current version from store
- Pain: Version is snapshot of one moment; concurrent writer might be ahead
- Design: GetLatestVersion returns current version; developer must store it
- Event:
VersionRead- Latest version fetched for actor
-
Developer modifies actor state
- Outcome: Developer applies domain logic to old state
- Pain: By the time they're ready to write, another writer may have succeeded
- Design: Developer computes new event with currentVersion + 1
- Event:
EventCreated- New event generated with version
-
Developer attempts to save event
- Outcome: SaveEvent validates version > current; succeeds if true, fails if not
- Pain: Version conflict error requires retry logic; easy to drop writes
- Design: ErrVersionConflict is sentinel; VersionConflictError provides details
- Event:
SaveAttempted- Event save started
-
Conflict occurs; developer retries
- Outcome: Second writer succeeded first; first writer gets conflict
- Pain: Developer must decide: retry immediately? backoff? give up?
- Design: Error includes currentVersion; developer can reload and retry
- Event:
VersionConflict- Save failed due to version mismatch (irreversible decision: conflict happened)
-
Developer implements retry strategy
- Outcome: Loop: GetLatestVersion -> apply logic -> SaveEvent (repeat until no conflict)
- Pain: Risk of livelock if both writers keep retrying; no built-in retry
- Design: Aether provides primitives; application implements retry policy
- Event:
EventSaved- Event persisted successfully
Events in this journey:
VersionRead- Latest version fetchedEventCreated- New event generatedSaveAttempted- Save operation initiatedVersionConflict- Save rejected due to version <= current (expensive mistake)EventSaved- Event persisted after successful saveRetryInitiated- Conflict detected; retry loop started
Business Event Timeline
Key insight: Events are facts that happened, not data structures. Events are immutable, ordered, and represent decisions made or state changes that occurred.
Event Sourcing Layer Events
EventStored
- Trigger: Developer calls SaveEvent(event) with valid version > current
- Change: Event appended to actor's event stream in store
- Interested parties: Replay logic, event bus subscribers, audit trail
- Data: event ID, actor ID, event type, version, data payload, timestamp, correlation/causation IDs
VersionConflict (irreversible - conflict already happened; causes costly retry)
- Trigger: Developer calls SaveEvent with version <= current latest version
- Change: Event rejected; write fails; optimistic lock lost
- Interested parties: Developer (must retry), monitoring system (tracks contention)
- Data: actor ID, attempted version, current version, time of conflict
SnapshotCreated
- Trigger: Developer/operator decides to snapshot state at version N
- Change: State snapshot saved alongside event stream
- Interested parties: Replay logic (can start from snapshot), query models
- Data: actor ID, version number, state data, timestamp
Namespace & Isolation Events
NamespaceCreated (reversible - can delete namespace if isolated)
- Trigger: Developer defines new tenant/domain boundary
- Change: Namespace registered; can be published to and subscribed from
- Interested parties: EventBus, EventStore with namespace prefix
- Data: namespace name, context/purpose, creation timestamp
NamespacedStoreInitialized
- Trigger: Developer creates JetStreamEventStore with namespace prefix
- Change: NATS stream created with namespace-prefixed name (e.g., "tenant-abc_events")
- Interested parties: EventStore queries, JetStream durability
- Data: namespace name, stream configuration, retention policy
EventPublished (reversible - event is published but not stored until SaveEvent)
- Trigger: Developer calls EventBus.Publish(namespace, event)
- Change: Event distributed to subscribers matching namespace pattern
- Interested parties: EventBus subscribers, wildcard subscribers
- Data: namespace, event ID, event type, subscriber count
Clustering & Leadership Events
NodeJoined (reversible - node can leave)
- Trigger: New node connects to NATS and starts ClusterManager
- Change: Node added to cluster view; consistent hash ring updated
- Interested parties: Leader election, shard distribution, health monitors
- Data: node ID, address, port, capacity, metadata, timestamp
LeaderElected (irreversible - past elections cannot be undone; new term starts)
- Trigger: Leader election round completes; one node wins
- Change: Winner creates lease in NATS KV store; becomes leader for this term
- Interested parties: Shard rebalancing, cluster coordination
- Data: leader ID, term number, lease expiration, timestamp
LeadershipLost (irreversible - loss of leadership is a fact)
- Trigger: Leader's lease expires; renewal fails; new election started
- Change: Leader status cleared; other nodes initiate new election
- Interested parties: Rebalancing pauses; coordination waits for new leader
- Data: old leader ID, term number, loss time, reason (timeout/explicit resign)
ShardAssigned (reversible at cluster level - can rebalance later)
- Trigger: Leader's consistent hash ring determines shard ownership
- Change: Shard mapped to node(s); actors hash to shards; traffic routes accordingly
- Interested parties: Actor placement, routing, shard managers
- Data: shard ID, node ID list (primary + replicas), assignment timestamp
NodeFailed (irreversible - failure is a fact; rebalancing response is new event)
- Trigger: Node health check fails; no heartbeat for >90 seconds
- Change: Node marked as failed; shards reassigned; actors may migrate
- Interested parties: Rebalancing, failover, monitoring, alerting
- Data: node ID, failure timestamp, last seen, shard list affected
ShardMigrated (irreversible - migration is a committed fact)
- Trigger: Rebalancing decided to move shard S from Node A to Node B
- Change: Actors in shard begin migrating; state copied; traffic switches
- Interested parties: Source node, destination node, actor placement
- Data: shard ID, from node, to node, actor list, migration status, timestamp
Concurrency Control Events
OptimisticLockAttempted
- Trigger: Developer calls SaveEvent with version = currentVersion + 1
- Change: Validation checks if version is strictly greater
- Interested parties: Event store, metrics (lock contention tracking)
- Data: actor ID, attempted version, current version before check
WriteSucceeded (irreversible - write to event store is committed)
- Trigger: SaveEvent validation passed; event appended to store
- Change: Event now part of durable record; cannot be undone
- Interested parties: Audit, replay, other writers (they will see conflict on next attempt)
- Data: event ID, actor ID, version, write timestamp
WriteRetried (reversible - retry is a tactical decision, not business fact)
- Trigger: OptimisticLock conflict; developer reloads and tries again
- Change: New attempt with higher version number
- Interested parties: Metrics (retry counts), developer (backoff strategy)
- Data: actor ID, retry attempt number, original conflict timestamp
Decision Points & Trade-Offs
Decision 1: Which EventStore to Use?
Context: Developer choosing between in-memory and JetStream
Type: Reversible (can swap store implementations)
Options:
- InMemoryEventStore: Fast iteration; no external dependency; loses data on restart
- JetStreamEventStore: Durable; scales across nodes; requires NATS cluster
Stakes:
- Wrong choice: Testing against memory then discovering issues in production, or slowing down iteration with JetStream overhead
- Cost of wrong choice: Medium (change is possible but requires refactoring downstream code)
Info needed:
- Is this for testing/iteration or production?
- How much data will accumulate?
- Is failover/replication required?
Decision rule (from vision):
- Testing/CI: Use InMemory
- Production: Use JetStream (NATS-native)
- Development: Start with InMemory; switch to JetStream when integrating with cluster
Decision 2: Snapshot Strategy
Context: Developer deciding when to snapshot actor state
Type: Reversible (snapshots are optional; can rebuild from events anytime)
Options:
- No snapshots: Always replay from event 1 (simple; slow for high-version actors)
- Periodic snapshots: Snapshot every N events or every T time (balance complexity/speed)
- On-demand snapshots: Snapshot when version exceeds threshold (react to actual usage)
Stakes:
- Wrong choice: Slow actor startup (many events to replay) or storage waste (too many snapshots)
- Cost: Low (snapshots are hints; can always replay)
Info needed:
- How many events does this actor accumulate?
- How often do we need to rebuild state?
- What's the latency requirement for actor startup?
Decision rule:
- Actors with <100 events: Skip snapshots; replay is fast
- Actors with 100-1000 events: Snapshot every 100 events or daily
- Actors with >1000 events: Snapshot every 50 events or implement adaptive snapshotting
Decision 3: Namespace Boundaries
Context: Developer deciding logical isolation boundaries
Type: Reversible (namespaces can be reorganized; events are namespace-scoped)
Options:
- Tenant per namespace: "tenant-123", "tenant-456" (simple multi-tenancy)
- Domain per namespace: "orders", "payments", "users" (bounded context pattern)
- Hierarchical namespaces: "prod.orders", "staging.orders" (environment + domain)
- Global namespace: Single namespace for entire system (simplest; no isolation)
Stakes:
- Wrong choice: Cross-contamination (tenant sees other tenant's data), or over-isolated (complex coordination)
- Cost: Medium (changing boundaries requires data migration)
Info needed:
- What's the isolation requirement? (regulatory, security, operational)
- Do different domains need independent scaling?
- How many isolated scopes exist? (2 tenants vs 1000 tenants vs infinite)
Decision rule:
- Multi-tenant SaaS: Use "tenant-{id}" namespace per customer
- Microservices: Use "domain" namespace per bounded context
- Multi-environment: Use "env.domain" namespace (e.g., "prod.orders")
Security implication: Wildcard subscriptions (prod.*) bypass isolation; only trusted components should use them.
Decision 4: Concurrent Write Conflict Handling
Context: Developer handling version conflicts from optimistic locking
Type: Irreversible (the conflict happened; must decide retry strategy now)
Options:
- Fail immediately: Return error to caller; let application decide retry (simple; caller handles complexity)
- Automatic retry with backoff: Library retries internally; hides complexity; risk of cascade failures
- Merge conflicts: Attempt to merge conflicting changes (domain-specific; risky if wrong logic)
- Abort and alert: Fail loudly; signal that concurrent writes are happening; investigate
Stakes:
- Wrong choice: Lost writes (fail immediately without alerting), cascade failures (retry forever), or silent merges (corrupted data)
- Cost: High (affects data integrity; bugs compound over time)
Info needed:
- How frequent are conflicts expected? (rare = fail fast; common = retry needed)
- What's the business impact of a lost write?
- Can the application safely retry? (idempotent commands)
Decision rule (from Aether design):
- Aether provides primitives; application implements retry logic
- Return VersionConflictError to caller
- Caller decides: retry, fail, alert, exponential backoff
- Idiom: Loop with version reload on conflict (at-least-once semantics)
Decision 5: Leader Election Tolerance
Context: Developer deploying cluster and concerned about leader failures
Type: Irreversible (election results are committed facts)
Options:
- Fast election (short lease TTL): Leader changeover in seconds; risk of split-brain if network partitions
- Stable election (long lease TTL): Leader stable; slow to detect failure; risk of stalled cluster if leader hangs
- Quorum-based: Multiple nodes vote; requires odd number of nodes; safe but complex
Stakes:
- Wrong choice: Either frequent leader flapping (cascading rebalancing) or slow failure detection (cluster stalled)
- Cost: High (affects availability; cascading failures)
Info needed:
- How critical is leadership stability? (frequent rebalancing is expensive)
- What's the acceptable MTTR (mean time to recovery) from leader failure?
- Is split-brain acceptable? (multiple leaders claiming leadership)
Decision rule (from code):
- Aether uses lease-based election: 10s lease, 3s heartbeat, 2s election timeout
- Suitable for: Relatively stable networks; single-region deployments
- Not suitable for: WAN with frequent partitions; requires custom implementation
Decision 6: Shard Rebalancing Policy
Context: After node failure, who moves shards and when?
Type: Reversible (rebalancing can be undone if wrong; is a tactical response)
Options:
- Immediate rebalancing: After node failure, immediately reassign shards (fast; heavy load on new node)
- Delayed rebalancing: Wait for grace period; rebalance only if node doesn't recover (stable; but leaves shards on dead node temporarily)
- Manual rebalancing: Operator initiates rebalancing explicitly (safe; slow)
- Adaptive rebalancing: Rebalance based on load/health metrics (complex; optimized)
Stakes:
- Wrong choice: Cascading failures (overload remaining nodes), or stalled shards (no home)
- Cost: Medium (rebalancing is expensive but not data-loss critical)
Info needed:
- How stable is the infrastructure? (frequent failures = gradual rebalancing needed)
- What's peak load on single node? (can it absorb sudden redistribution)
- How critical are latencies during rebalancing?
Decision rule (from code):
- Aether triggers rebalancing when leader detects node topology changes
- Simple algorithm: Redistribute shards across active nodes using consistent hash
- Application can implement custom rebalancing policies
Risk Areas & Expensive Mistakes
Risk 1: Version Conflict Cascade (High Impact, High Likelihood)
Risk: Multiple writers simultaneously attempting to write to same actor
Consequences:
- Some writes fail with VersionConflict
- Developers must implement retry logic
- If retry is naive (immediate loop), can cause high CPU, high latency
- If no retry at all, silent data loss (events dropped)
Detection:
- Metrics: Track conflict rate; spike indicates contention
- Logs: VersionConflictError includes current version; easy to debug
- Tests: Concurrent writer tests expose retry logic bugs
Mitigation:
- Design domain model to minimize concurrent writes (lock at actor level)
- Implement exponential backoff on retries
- Set maximum retry limit (circuit breaker)
- Document that Aether provides primitives; retry is application's responsibility
- Consider redesign if conflict rate >5% of writes
Code pattern to enforce:
// Correct: Retry with backoff
for attempt := 0; attempt < maxRetries; attempt++ {
version, _ := store.GetLatestVersion(actorID)
event.Version = version + 1
if err := store.SaveEvent(event); err == nil {
break // Success
}
// On error, sleep then retry
time.Sleep(time.Duration(math.Pow(2, float64(attempt))) * time.Millisecond)
}
// Anti-pattern: Tight loop (DON'T DO THIS)
for store.SaveEvent(event) != nil {
// Spin forever if conflict persists
}
Risk 2: Namespace Isolation Breach (High Impact, Medium Likelihood)
Risk: Wildcard subscriptions or misconfigured stores leak data across namespaces
Consequences:
- Tenant A sees events from Tenant B
- Regulatory breach (GDPR, HIPAA, etc.)
- Silent data leak (no error; just wrong data)
- Hard to detect (requires integration tests with multiple tenants)
Examples of mistakes:
- Using ">" wildcard in multi-tenant system (receives all namespaces)
- Creating single JetStream stream for all tenants (namespace prefix ignored)
- Forgetting to pass namespace to EventBus.Publish() (goes to empty namespace)
Detection:
- Integration tests: Multi-tenant test scenario; verify isolation
- Audit: Log all wildcard subscriptions; require approval
- Schema: Enforce namespace in struct; compile-time checks weak (strings)
Mitigation:
- Always pass namespace explicitly:
Publish(namespace, event) - Code review: Flag any wildcard patterns ("*" or ">") in production code
- Documentation: Warn that wildcard bypasses isolation; document when it's safe
- Tests: Write integration tests for each supported isolation boundary
- Monitoring: Alert if unexpected namespaces appear in logs
Code smell:
// Risky: Wildcard subscription in multi-tenant system
ch := eventBus.Subscribe(">") // Receives ALL namespaces!
// Safe: Explicit namespace only
ch := eventBus.Subscribe("tenant-" + tenantID)
// Safe: Wildcard in trusted system component only (document why)
ch := eventBus.Subscribe("prod.>") // Only admin monitoring subscribes
Risk 3: Leader Election Livelock (Medium Impact, Low Likelihood)
Risk: Leader failure during rebalancing; new leader starts rebalancing; old leader comes back and conflicts
Consequences:
- Shards assigned to multiple nodes (split-brain)
- Actors migrated multiple times (cascading failures)
- Cluster unstable; rebalancing never completes
Trigger:
- Network partition: Old leader isolated but still thinks it's leader
- Slow leader: Lease expires; new leader elected; old leader comes back online and reasserts leadership
Detection:
- Metrics: Track leadership changes; spike indicates instability
- Logs: "Cluster leadership changed to X" happens frequently (>once per minute)
- Monitoring: Alert on leadership thrashing
Mitigation:
- LeaderElection uses lease-based coordination in NATS KV; cannot have two concurrent leaders
- But old leader might still be executing rebalancing when new leader elected
- Add generation/term numbers to shard assignments (only newer term accepted)
- Document that rebalancing is not atomic; intermediate states possible
- Operator can force shard assignment in extreme cases
Risk 4: Event Store Corruption from Bad Unmarshaling (Medium Impact, Low Likelihood)
Risk: Corrupted event in JetStream; cannot unmarshal; replay fails
Consequences:
- Actor cannot be replayed from point of corruption
- Entire actor's state is stuck
- Snapshot helps (if available); otherwise, manual recovery needed
Examples:
- Event stored with wrong schema version; cannot parse in new code
- Binary/JSON corruption in JetStream storage
- Application bug: Stores invalid data in event.Data map
Detection:
- Replay errors: ReplayError captures sequence number and raw bytes
- EventStoreWithErrors interface: Caller can inspect errors during replay
- Metrics: Track unmarshaling errors per actor
Mitigation:
- Design events for schema evolution: Add new fields as optional; keep old fields
- Provide data migration tool: Rewrite corrupted events to clean state
- Test: Corrupt events intentionally; verify error handling
- Snapshot frequently: Limits impact of corruption to recent events only
- JetStreamEventStore.GetEventsWithErrors() returns ReplayResult with Errors field
Code pattern:
// Good: Handle replay errors
result, _ := store.GetEventsWithErrors(actorID, 0)
for _, replayErr := range result.Errors {
log.Printf("Corrupted event at seq %d: %v", replayErr.SequenceNumber, replayErr.Err)
// Decide: skip? alert? pause replay?
}
Risk 5: Snapshot Staleness During Failover (Medium Impact, Medium Likelihood)
Risk: Node A crashes; actor migrated to Node B; Node B replays from stale snapshot
Consequences:
- Lost events between snapshot and crash
- State on Node B is older than what client expects
- Client sees state go backward (temporal anomaly)
Trigger:
- Snapshot taken at version 100
- New events created (versions 101-105)
- Node crashes before migration completes
- New node starts with snapshot at version 100; events 101-105 may be lost or replayed slowly
Detection:
- Version inconsistencies: Client sees actor version decrease
- Logs: "Loaded snapshot at version 100, expected 105"
- Metrics: Track snapshot age (time since last event)
Mitigation:
- Snapshot is a hint, not a guarantee
- Always replay events from snapshot version + 1
- Test: Crash node during rebalancing; verify no data loss
- Operational: Monitor snapshot freshness; alert if outdated
- Design: For critical actors, skip snapshots; always replay (safe but slow)
Risk 6: Namespace Name Collision in Hierarchical Naming (Low Impact, Low Likelihood)
Risk: Two separate logical domains accidentally use same namespace name
Consequences:
- Events cross-contaminate
- Subtle data corruption (events from domain A in domain B's stream)
- Very hard to detect (seems like normal operation)
Trigger:
- Dev: namespace = "orders"
- Ops: namespace = "orders" (different meaning!)
- Events published to same stream; subscribers confused
Detection:
- Naming convention: Enforce "env.team.domain" pattern
- Code review: Flag any hardcoded namespace strings
- Tests: Validate namespace against allow-list
Mitigation:
- Document namespace naming conventions in team wiki
- Use enum or constant for namespaces (compile-time checks)
- Enforce hierarchical naming: "prod.checkout.orders", not just "orders"
- Monitoring: Alert if new namespaces appear
Code Analysis: Intended vs Actual Implementation
Observation 1: Version Conflict Handling is Correctly Asymmetric
Intended: Optimistic locking with explicit error handling; application implements retry
Actual:
- EventStore.SaveEvent returns VersionConflictError (wraps ErrVersionConflict sentinel)
- Code provides detailed error: ActorID, AttemptedVersion, CurrentVersion
- No built-in retry logic (correct; encourages explicit retry at application level)
Alignment: GOOD - Implementation matches intent
Observation 2: Namespace Isolation is Primitive, Not Framework
Intended: Provide namespace primitives; let application layer handle multi-tenancy
Actual:
- EventBus.Publish(namespace, event) accepts arbitrary string
- MatchNamespacePattern supports NATS wildcards ("*", ">")
- JetStreamEventStore with namespace prefix creates separate streams
- NATSEventBus passes namespace as subject suffix: "aether.events.{namespace}"
Alignment: GOOD - No opinionated tenant management; just primitives
Gap: Namespace collision risk is real (see Risk 6); naming convention docs would help
Observation 3: Snapshot Strategy is Optional, Not Required
Intended: Snapshots should be purely performance optimization; events are source of truth
Actual:
- SnapshotStore extends EventStore interface
- GetLatestSnapshot can return nil (no snapshot exists)
- Replay logic can ignore snapshots and always replay from event 1
- Application chooses snapshot strategy
Alignment: GOOD - Snapshot is truly optional
Gap: No built-in snapshot strategy (periodic, adaptive); documentation could provide recipes
Observation 4: Cluster Management Exists but is Foundational, Not Complete
Intended: Provide node discovery, leader election, shard distribution primitives
Actual:
- ClusterManager coordinates topology
- LeaderElection uses NATS KV for lease-based coordination
- ConsistentHashRing distributes shards
- ShardManager (interface VMRegistry) connects VMs to shards
Alignment: GOOD - Primitives are in place
Gaps identified:
- Actor migration during rebalancing: ShardManager interface exists but no migration handler shown. Where do actors move their state during failover?
- Rebalancing algorithm: Code shows trigger points but not the actual rebalancing logic ("would rebalance across N nodes")
- Split-brain prevention: Lease-based election prevents two concurrent leaders, but old leader might still execute rebalancing during transition
Recommendation: Document the rebalancing lifecycle explicitly; show sample actor migration code
Observation 5: Event Bus Filtering is Multi-Level
Intended: Namespace patterns at NATS level; event type and actor filtering at application level
Actual:
- EventBus: In-memory subscriptions with local filtering
- NATSEventBus: Extends EventBus; adds NATS subject subscriptions
- SubscriptionFilter: EventTypes (list) + ActorPattern (wildcard string)
- Filter applied after receiving (client-side)
Alignment: GOOD - Two-level filtering is efficient (network filters namespaces; client filters details)
Security note: NATSEventBus wildcard patterns documented with security warnings
Observation 6: Correlation & Causation Metadata is Built In
Intended: Track request flow across events for auditability
Actual:
- Event.Metadata map with standard keys: CorrelationID, CausationID, UserID, TraceID, SpanID
- Helper methods: SetMetadata, GetMetadata, SetCorrelationID, GetCorrelationID
- WithMetadataFrom copies metadata from source event (chain causation)
Alignment: GOOD - Supports auditability principle from manifesto
Observation: Metadata is optional; not enforced. Could add validation to require correlation ID in production
Recommendations
For Product Strategy (Next Steps)
-
Create Bounded Context Map
- Map intents: EventStore context, Namespace context, Cluster context, Concurrency context
- Identify where each developer journey crosses boundaries
- Define context boundaries for brownfield code
-
Document Failure Scenarios
- Create scenario: "Node fails during rebalancing; what state is consistent?"
- Show event trace for each failure mode
- Provide recovery procedures
-
Define Capabilities
- "Store events durably with conflict detection"
- "Isolate logical domains using namespaces"
- "Distribute actors across cluster nodes"
- "Elect coordinator and coordinate rebalancing"
-
Build Integration Test Suite
- Single node: Event storage, snapshots, replay
- Two node cluster: Node failure, shard migration, failover
- Multi-tenant: Namespace isolation, cross-contamination detection
- Concurrency: Version conflicts, concurrent writers, retry logic
For Architecture (Implementation Gaps)
-
Actor Migration Strategy
- Define how actors move state during shard rebalancing
- Show whether events follow actor, or actor replays from new location
- Provide sample migration handler code
-
Namespace Naming Convention
- Document "env.domain" pattern
- Provide namespace registry or allow-list validation
- Add compile-time checks (enums, not strings)
-
Rebalancing Lifecycle
- Document full state machine: NodeFailed → RebalanceRequested → ShardMigrated → Completed
- Specify atomic boundaries (what's guaranteed, what's eventual)
- Provide sample operator commands
-
Snapshot Strategy Recipes
- Document when to snapshot (event count, time-based, adaptive)
- Provide sample snapshot implementation
- Show cost/benefit trade-offs
For Risk Mitigation
-
Add Validation Layer
- Enforce namespace format
- Validate event version strictly
- Check for required metadata (correlation ID, user ID)
-
Observability Hooks
- Metrics: conflict rate, rebalancing latency, namespace usage
- Logs: Every significant event with structured fields
- Tracing: Correlation ID propagation for request flows
-
Safety Documentation
- Pinpoint which wildcard patterns are safe (document only trusted uses)
- Version conflict handling recipes (backoff, circuit breaker)
- Multi-tenant isolation verification checklist
Summary: Problem Space Captured
Aether solves the problem of distributed event sourcing for Go without frameworks by providing composable primitives aligned with organizational values. The problem space has four developer journeys, each with decision points and risks:
| Journey | Core Decision | Risk Area | Mitigation |
|---|---|---|---|
| Single Node | InMemory vs JetStream | Choice overload | Start with memory; docs guide migration |
| Distributed | Snapshot strategy | Stale snapshots | Always replay from snapshot+1; test failover |
| Multi-tenant | Namespace boundaries | Isolation breach | Wildcard warnings; integration tests |
| Concurrency | Retry strategy | Lost writes | Return error; docs show retry patterns |
The vision (primitives over frameworks) is well-executed in the code. Gaps are in documentation of failure modes, actor migration strategy, and namespace conventions. Next phase should map bounded contexts and define domain invariants (Step 3 of product-strategy chain).