flowmade-one/aether

Fork 0

Files

Hugo Nijhuis 271f5db444

CI / build (push) Successful in 21s

Details

CI / integration (push) Failing after 2m1s

Details

Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-12 23:57:20 +01:00

36 KiB

Raw Permalink Blame History

Problem Map: Aether Distributed Actor System

Summary

Aether solves the problem of building distributed, event-sourced systems in Go without heavyweight frameworks or reinventing infrastructure. The core tension is providing composable primitives (Event, EventStore, clustering) that work together seamlessly while maintaining organizational values: auditability, business language in code, independent evolution, and explicit intent.

The problem space is defined by four distinct developer journeys: single-node development (testing/iteration), scaling to distributed clusters, isolating multi-tenant/multi-context data, and managing concurrent writes through optimistic locking.

Developer User Journeys

Journey 1: Single-Node Event-Sourced System (Testing & Iteration)

Job to be done: "Quickly build and test event-sourced domain logic without distributed complexity"

Steps:

Developer starts new bounded context
- Outcome: Empty event store configured
- Pain: Must choose between in-memory (loses data) and production store (overkill for iteration)
- Design: InMemoryEventStore provides fast iteration; no schema migration burden
Developer writes first event class
- Outcome: Event type defined with domain language (e.g., "OrderPlaced", not "order_v1")
- Pain: Event types are strings, easy to typo; no compile-time safety
- Design: Event struct accepts EventType as string; metadata provides correlation/causation tracking
Developer emits and replays events
- Outcome: State rebuilt from event history
- Pain: Replay can be slow if events accumulate; need to know when snapshots help
- Design: SnapshotStore interface separates snapshot logic from event storage
Developer runs integration test
- Outcome: Test validates domain behavior without NATS
- Pain: InMemoryEventStore is fast but tests don't catch distributed issues
- Design: EventStore interface allows swapping implementations; tests use memory

Events in this journey:

EventStoreInitialized - Developer created store (InMemory selected)
EventClassDefined - Domain event type created (OrderPlaced)
EventStored - Event persisted to store
ReplayStarted - Developer replays events to rebuild state
SnapshotConsidered - Developer evaluates snapshot vs full replay cost

Journey 2: Scaling from Single Node to Distributed Cluster

Job to be done: "Move proven domain logic to production without rewriting; handle shard assignment and leader coordination"

Steps:

Developer switches EventStore to JetStream
- Outcome: Events now persisted in NATS cluster; available across nodes
- Pain: Chose JetStream (production) but now depends on NATS uptime
- Design: JetStreamEventStore implements same EventStore interface; namespace isolation available
- Event: EventStoreUpgraded - Switched from memory to JetStream
Developer connects nodes to cluster
- Outcome: Nodes discover each other via NATS
- Pain: Must bootstrap cluster; leader election hasn't started yet
- Design: ClusterManager handles node discovery and topology
- Event: NodeJoined - New node joined cluster with address/capacity/metadata
Developer enables leader election
- Outcome: One node elected leader; can coordinate shard assignments
- Pain: If leader crashes, new election takes time; old leader might cause split-brain
- Design: LeaderElection uses NATS KV store with lease-based coordination (TTL + renewal)
- Event: LeaderElected - Leader chosen; term incremented
Developer assigns shards to nodes
- Outcome: Consistent hash ring distributes shards across nodes
- Pain: Initial shard assignment is manual; rebalancing after node failure is complex
- Design: ConsistentHashRing handles placement; ShardManager routes actors to shards
- Event: ShardsAssigned - Shards allocated to nodes via consistent hash
Developer tests failover scenario
- Outcome: Node crashes; system continues; other nodes take over shards
- Pain: How do migrated actors recover state? Where is state during migration?
- Design: Events are in JetStream (durable); snapshots help fast recovery
- Event: NodeFailed - Node marked failed; shards need reassignment

Events in this journey:

EventStoreUpgraded - Switched from memory to JetStream
NodeJoined - Node added to cluster
NodeDiscovered - New node found via NATS
LeaderElected - Leader selected after election
LeaderHeartbeat - Leader renews lease (periodic)
ShardAssigned - Actor assigned to shard
ShardRebalanceRequested - Leader initiates rebalancing
NodeFailed - Node stopped responding
ShardMigrated - Shard moved from one node to another

Journey 3: Multi-Tenant System with Namespace Isolation

Job to be done: "Isolate tenant data logically without complex multi-tenancy framework; ensure queries see only their data"

Steps:

Developer decides on namespace boundary
- Outcome: Defines namespace as tenant ID or domain boundary (e.g., "tenant-abc", "prod.orders")
- Pain: Must understand NATS subject naming conventions; unsure about collision risks
- Design: Namespace is arbitrary string; uses dot-separated tokens for hierarchical patterns
- Event: NamespaceDefined - Namespace selected (tenant-abc)
Developer creates namespaced EventStore
- Outcome: Events for this namespace stored in separate JetStream stream (e.g., "tenant-abc_events")
- Pain: Must remember to use correct namespace everywhere; easy to cross-contaminate
- Design: JetStreamEventStoreWithNamespace enforces namespace in stream name
- Event: NamespacedStoreCreated - Store created with namespace prefix
Developer publishes to namespaced bus
- Outcome: EventBus.Publish("tenant-abc", event) routes to subscribers of "tenant-abc"
- Pain: Wildcard subscriptions bypass isolation (prod.* receives prod.orders and prod.users)
- Design: MatchNamespacePattern enforces NATS wildcard rules; documentation warns of security
- Event: EventPublished - Event sent to namespace
Developer creates filtered subscription
- Outcome: Subscriber can filter by event type or actor pattern
- Pain: Filters are client-side after full subscription; what if payload is sensitive?
- Design: EventBus filters after receiving; NATSEventBus uses NATS subject patterns for efficiency
- Event: SubscriberFiltered - Subscriber created with filter criteria
Developer validates isolation
- Outcome: Test confirms tenant-abc cannot see tenant-def events
- Pain: Must test at multiple levels (store, bus, query models); still no compile-time guarantee
- Design: Integration tests verify namespace boundaries
- Event: IsolationVerified - Test confirmed namespace separation

Events in this journey:

NamespaceDefined - Namespace boundary established
NamespacedStoreCreated - Store created for namespace
EventPublished - Event sent to namespace
SubscriptionCreated - Subscriber registered for namespace
FilterApplied - Subscription filter configured
IsolationBreached - Test detected cross-namespace data leak (anti-pattern)

Journey 4: Optimistic Concurrency Control for Concurrent Writes

Job to be done: "Handle multiple concurrent writes to same actor without corruption; fail fast on conflicts"

Steps:

Developer loads actor state
- Outcome: Reads current version from store
- Pain: Version is snapshot of one moment; concurrent writer might be ahead
- Design: GetLatestVersion returns current version; developer must store it
- Event: VersionRead - Latest version fetched for actor
Developer modifies actor state
- Outcome: Developer applies domain logic to old state
- Pain: By the time they're ready to write, another writer may have succeeded
- Design: Developer computes new event with currentVersion + 1
- Event: EventCreated - New event generated with version
Developer attempts to save event
- Outcome: SaveEvent validates version > current; succeeds if true, fails if not
- Pain: Version conflict error requires retry logic; easy to drop writes
- Design: ErrVersionConflict is sentinel; VersionConflictError provides details
- Event: SaveAttempted - Event save started
Conflict occurs; developer retries
- Outcome: Second writer succeeded first; first writer gets conflict
- Pain: Developer must decide: retry immediately? backoff? give up?
- Design: Error includes currentVersion; developer can reload and retry
- Event: VersionConflict - Save failed due to version mismatch (irreversible decision: conflict happened)
Developer implements retry strategy
- Outcome: Loop: GetLatestVersion -> apply logic -> SaveEvent (repeat until no conflict)
- Pain: Risk of livelock if both writers keep retrying; no built-in retry
- Design: Aether provides primitives; application implements retry policy
- Event: EventSaved - Event persisted successfully

Events in this journey:

VersionRead - Latest version fetched
EventCreated - New event generated
SaveAttempted - Save operation initiated
VersionConflict - Save rejected due to version <= current (expensive mistake)
EventSaved - Event persisted after successful save
RetryInitiated - Conflict detected; retry loop started

Business Event Timeline

Key insight: Events are facts that happened, not data structures. Events are immutable, ordered, and represent decisions made or state changes that occurred.

Event Sourcing Layer Events

EventStored

Trigger: Developer calls SaveEvent(event) with valid version > current
Change: Event appended to actor's event stream in store
Interested parties: Replay logic, event bus subscribers, audit trail
Data: event ID, actor ID, event type, version, data payload, timestamp, correlation/causation IDs

VersionConflict (irreversible - conflict already happened; causes costly retry)

Trigger: Developer calls SaveEvent with version <= current latest version
Change: Event rejected; write fails; optimistic lock lost
Interested parties: Developer (must retry), monitoring system (tracks contention)
Data: actor ID, attempted version, current version, time of conflict

SnapshotCreated

Trigger: Developer/operator decides to snapshot state at version N
Change: State snapshot saved alongside event stream
Interested parties: Replay logic (can start from snapshot), query models
Data: actor ID, version number, state data, timestamp

Namespace & Isolation Events

NamespaceCreated (reversible - can delete namespace if isolated)

Trigger: Developer defines new tenant/domain boundary
Change: Namespace registered; can be published to and subscribed from
Interested parties: EventBus, EventStore with namespace prefix
Data: namespace name, context/purpose, creation timestamp

NamespacedStoreInitialized

Trigger: Developer creates JetStreamEventStore with namespace prefix
Change: NATS stream created with namespace-prefixed name (e.g., "tenant-abc_events")
Interested parties: EventStore queries, JetStream durability
Data: namespace name, stream configuration, retention policy

EventPublished (reversible - event is published but not stored until SaveEvent)

Trigger: Developer calls EventBus.Publish(namespace, event)
Change: Event distributed to subscribers matching namespace pattern
Interested parties: EventBus subscribers, wildcard subscribers
Data: namespace, event ID, event type, subscriber count

Clustering & Leadership Events

NodeJoined (reversible - node can leave)

Trigger: New node connects to NATS and starts ClusterManager
Change: Node added to cluster view; consistent hash ring updated
Interested parties: Leader election, shard distribution, health monitors
Data: node ID, address, port, capacity, metadata, timestamp

LeaderElected (irreversible - past elections cannot be undone; new term starts)

Trigger: Leader election round completes; one node wins
Change: Winner creates lease in NATS KV store; becomes leader for this term
Interested parties: Shard rebalancing, cluster coordination
Data: leader ID, term number, lease expiration, timestamp

LeadershipLost (irreversible - loss of leadership is a fact)

Trigger: Leader's lease expires; renewal fails; new election started
Change: Leader status cleared; other nodes initiate new election
Interested parties: Rebalancing pauses; coordination waits for new leader
Data: old leader ID, term number, loss time, reason (timeout/explicit resign)

ShardAssigned (reversible at cluster level - can rebalance later)

Trigger: Leader's consistent hash ring determines shard ownership
Change: Shard mapped to node(s); actors hash to shards; traffic routes accordingly
Interested parties: Actor placement, routing, shard managers
Data: shard ID, node ID list (primary + replicas), assignment timestamp

NodeFailed (irreversible - failure is a fact; rebalancing response is new event)

Trigger: Node health check fails; no heartbeat for >90 seconds
Change: Node marked as failed; shards reassigned; actors may migrate
Interested parties: Rebalancing, failover, monitoring, alerting
Data: node ID, failure timestamp, last seen, shard list affected

ShardMigrated (irreversible - migration is a committed fact)

Trigger: Rebalancing decided to move shard S from Node A to Node B
Change: Actors in shard begin migrating; state copied; traffic switches
Interested parties: Source node, destination node, actor placement
Data: shard ID, from node, to node, actor list, migration status, timestamp

Concurrency Control Events

OptimisticLockAttempted

Trigger: Developer calls SaveEvent with version = currentVersion + 1
Change: Validation checks if version is strictly greater
Interested parties: Event store, metrics (lock contention tracking)
Data: actor ID, attempted version, current version before check

WriteSucceeded (irreversible - write to event store is committed)

Trigger: SaveEvent validation passed; event appended to store
Change: Event now part of durable record; cannot be undone
Interested parties: Audit, replay, other writers (they will see conflict on next attempt)
Data: event ID, actor ID, version, write timestamp

WriteRetried (reversible - retry is a tactical decision, not business fact)

Trigger: OptimisticLock conflict; developer reloads and tries again
Change: New attempt with higher version number
Interested parties: Metrics (retry counts), developer (backoff strategy)
Data: actor ID, retry attempt number, original conflict timestamp

Decision Points & Trade-Offs

Decision 1: Which EventStore to Use?

Context: Developer choosing between in-memory and JetStream

Type: Reversible (can swap store implementations)

Options:

InMemoryEventStore: Fast iteration; no external dependency; loses data on restart
JetStreamEventStore: Durable; scales across nodes; requires NATS cluster

Stakes:

Wrong choice: Testing against memory then discovering issues in production, or slowing down iteration with JetStream overhead
Cost of wrong choice: Medium (change is possible but requires refactoring downstream code)

Info needed:

Is this for testing/iteration or production?
How much data will accumulate?
Is failover/replication required?

Decision rule (from vision):

Testing/CI: Use InMemory
Production: Use JetStream (NATS-native)
Development: Start with InMemory; switch to JetStream when integrating with cluster

Decision 2: Snapshot Strategy

Context: Developer deciding when to snapshot actor state

Type: Reversible (snapshots are optional; can rebuild from events anytime)

Options:

No snapshots: Always replay from event 1 (simple; slow for high-version actors)
Periodic snapshots: Snapshot every N events or every T time (balance complexity/speed)
On-demand snapshots: Snapshot when version exceeds threshold (react to actual usage)

Stakes:

Wrong choice: Slow actor startup (many events to replay) or storage waste (too many snapshots)
Cost: Low (snapshots are hints; can always replay)

Info needed:

How many events does this actor accumulate?
How often do we need to rebuild state?
What's the latency requirement for actor startup?

Decision rule:

Actors with <100 events: Skip snapshots; replay is fast
Actors with 100-1000 events: Snapshot every 100 events or daily
Actors with >1000 events: Snapshot every 50 events or implement adaptive snapshotting

Decision 3: Namespace Boundaries

Context: Developer deciding logical isolation boundaries

Type: Reversible (namespaces can be reorganized; events are namespace-scoped)

Options:

Tenant per namespace: "tenant-123", "tenant-456" (simple multi-tenancy)
Domain per namespace: "orders", "payments", "users" (bounded context pattern)
Hierarchical namespaces: "prod.orders", "staging.orders" (environment + domain)
Global namespace: Single namespace for entire system (simplest; no isolation)

Stakes:

Wrong choice: Cross-contamination (tenant sees other tenant's data), or over-isolated (complex coordination)
Cost: Medium (changing boundaries requires data migration)

Info needed:

What's the isolation requirement? (regulatory, security, operational)
Do different domains need independent scaling?
How many isolated scopes exist? (2 tenants vs 1000 tenants vs infinite)

Decision rule:

Multi-tenant SaaS: Use "tenant-{id}" namespace per customer
Microservices: Use "domain" namespace per bounded context
Multi-environment: Use "env.domain" namespace (e.g., "prod.orders")

Security implication: Wildcard subscriptions (prod.*) bypass isolation; only trusted components should use them.

Decision 4: Concurrent Write Conflict Handling

Context: Developer handling version conflicts from optimistic locking

Type: Irreversible (the conflict happened; must decide retry strategy now)

Options:

Fail immediately: Return error to caller; let application decide retry (simple; caller handles complexity)
Automatic retry with backoff: Library retries internally; hides complexity; risk of cascade failures
Merge conflicts: Attempt to merge conflicting changes (domain-specific; risky if wrong logic)
Abort and alert: Fail loudly; signal that concurrent writes are happening; investigate

Stakes:

Wrong choice: Lost writes (fail immediately without alerting), cascade failures (retry forever), or silent merges (corrupted data)
Cost: High (affects data integrity; bugs compound over time)

Info needed:

How frequent are conflicts expected? (rare = fail fast; common = retry needed)
What's the business impact of a lost write?
Can the application safely retry? (idempotent commands)

Decision rule (from Aether design):

Aether provides primitives; application implements retry logic
Return VersionConflictError to caller
Caller decides: retry, fail, alert, exponential backoff
Idiom: Loop with version reload on conflict (at-least-once semantics)

Decision 5: Leader Election Tolerance

Context: Developer deploying cluster and concerned about leader failures

Type: Irreversible (election results are committed facts)

Options:

Fast election (short lease TTL): Leader changeover in seconds; risk of split-brain if network partitions
Stable election (long lease TTL): Leader stable; slow to detect failure; risk of stalled cluster if leader hangs
Quorum-based: Multiple nodes vote; requires odd number of nodes; safe but complex

Stakes:

Wrong choice: Either frequent leader flapping (cascading rebalancing) or slow failure detection (cluster stalled)
Cost: High (affects availability; cascading failures)

Info needed:

How critical is leadership stability? (frequent rebalancing is expensive)
What's the acceptable MTTR (mean time to recovery) from leader failure?
Is split-brain acceptable? (multiple leaders claiming leadership)

Decision rule (from code):

Aether uses lease-based election: 10s lease, 3s heartbeat, 2s election timeout
Suitable for: Relatively stable networks; single-region deployments
Not suitable for: WAN with frequent partitions; requires custom implementation

Decision 6: Shard Rebalancing Policy

Context: After node failure, who moves shards and when?

Type: Reversible (rebalancing can be undone if wrong; is a tactical response)

Options:

Immediate rebalancing: After node failure, immediately reassign shards (fast; heavy load on new node)
Delayed rebalancing: Wait for grace period; rebalance only if node doesn't recover (stable; but leaves shards on dead node temporarily)
Manual rebalancing: Operator initiates rebalancing explicitly (safe; slow)
Adaptive rebalancing: Rebalance based on load/health metrics (complex; optimized)

Stakes:

Wrong choice: Cascading failures (overload remaining nodes), or stalled shards (no home)
Cost: Medium (rebalancing is expensive but not data-loss critical)

Info needed:

How stable is the infrastructure? (frequent failures = gradual rebalancing needed)
What's peak load on single node? (can it absorb sudden redistribution)
How critical are latencies during rebalancing?

Decision rule (from code):

Aether triggers rebalancing when leader detects node topology changes
Simple algorithm: Redistribute shards across active nodes using consistent hash
Application can implement custom rebalancing policies

Risk Areas & Expensive Mistakes

Risk 1: Version Conflict Cascade (High Impact, High Likelihood)

Risk: Multiple writers simultaneously attempting to write to same actor

Consequences:

Some writes fail with VersionConflict
Developers must implement retry logic
If retry is naive (immediate loop), can cause high CPU, high latency
If no retry at all, silent data loss (events dropped)

Detection:

Metrics: Track conflict rate; spike indicates contention
Logs: VersionConflictError includes current version; easy to debug
Tests: Concurrent writer tests expose retry logic bugs

Mitigation:

Design domain model to minimize concurrent writes (lock at actor level)
Implement exponential backoff on retries
Set maximum retry limit (circuit breaker)
Document that Aether provides primitives; retry is application's responsibility
Consider redesign if conflict rate >5% of writes

Code pattern to enforce:

// Correct: Retry with backoff
for attempt := 0; attempt < maxRetries; attempt++ {
    version, _ := store.GetLatestVersion(actorID)
    event.Version = version + 1
    if err := store.SaveEvent(event); err == nil {
        break // Success
    }
    // On error, sleep then retry
    time.Sleep(time.Duration(math.Pow(2, float64(attempt))) * time.Millisecond)
}

// Anti-pattern: Tight loop (DON'T DO THIS)
for store.SaveEvent(event) != nil {
    // Spin forever if conflict persists
}

Risk 2: Namespace Isolation Breach (High Impact, Medium Likelihood)

Risk: Wildcard subscriptions or misconfigured stores leak data across namespaces

Consequences:

Tenant A sees events from Tenant B
Regulatory breach (GDPR, HIPAA, etc.)
Silent data leak (no error; just wrong data)
Hard to detect (requires integration tests with multiple tenants)

Examples of mistakes:

Using ">" wildcard in multi-tenant system (receives all namespaces)
Creating single JetStream stream for all tenants (namespace prefix ignored)
Forgetting to pass namespace to EventBus.Publish() (goes to empty namespace)

Detection:

Integration tests: Multi-tenant test scenario; verify isolation
Audit: Log all wildcard subscriptions; require approval
Schema: Enforce namespace in struct; compile-time checks weak (strings)

Mitigation:

Always pass namespace explicitly: Publish(namespace, event)
Code review: Flag any wildcard patterns ("*" or ">") in production code
Documentation: Warn that wildcard bypasses isolation; document when it's safe
Tests: Write integration tests for each supported isolation boundary
Monitoring: Alert if unexpected namespaces appear in logs

Code smell:

// Risky: Wildcard subscription in multi-tenant system
ch := eventBus.Subscribe(">")  // Receives ALL namespaces!

// Safe: Explicit namespace only
ch := eventBus.Subscribe("tenant-" + tenantID)

// Safe: Wildcard in trusted system component only (document why)
ch := eventBus.Subscribe("prod.>")  // Only admin monitoring subscribes

Risk 3: Leader Election Livelock (Medium Impact, Low Likelihood)

Risk: Leader failure during rebalancing; new leader starts rebalancing; old leader comes back and conflicts

Consequences:

Shards assigned to multiple nodes (split-brain)
Actors migrated multiple times (cascading failures)
Cluster unstable; rebalancing never completes

Trigger:

Network partition: Old leader isolated but still thinks it's leader
Slow leader: Lease expires; new leader elected; old leader comes back online and reasserts leadership

Detection:

Metrics: Track leadership changes; spike indicates instability
Logs: "Cluster leadership changed to X" happens frequently (>once per minute)
Monitoring: Alert on leadership thrashing

Mitigation:

LeaderElection uses lease-based coordination in NATS KV; cannot have two concurrent leaders
But old leader might still be executing rebalancing when new leader elected
Add generation/term numbers to shard assignments (only newer term accepted)
Document that rebalancing is not atomic; intermediate states possible
Operator can force shard assignment in extreme cases

Risk 4: Event Store Corruption from Bad Unmarshaling (Medium Impact, Low Likelihood)

Risk: Corrupted event in JetStream; cannot unmarshal; replay fails

Consequences:

Actor cannot be replayed from point of corruption
Entire actor's state is stuck
Snapshot helps (if available); otherwise, manual recovery needed

Examples:

Event stored with wrong schema version; cannot parse in new code
Binary/JSON corruption in JetStream storage
Application bug: Stores invalid data in event.Data map

Detection:

Replay errors: ReplayError captures sequence number and raw bytes
EventStoreWithErrors interface: Caller can inspect errors during replay
Metrics: Track unmarshaling errors per actor

Mitigation:

Design events for schema evolution: Add new fields as optional; keep old fields
Provide data migration tool: Rewrite corrupted events to clean state
Test: Corrupt events intentionally; verify error handling
Snapshot frequently: Limits impact of corruption to recent events only
JetStreamEventStore.GetEventsWithErrors() returns ReplayResult with Errors field

Code pattern:

// Good: Handle replay errors
result, _ := store.GetEventsWithErrors(actorID, 0)
for _, replayErr := range result.Errors {
    log.Printf("Corrupted event at seq %d: %v", replayErr.SequenceNumber, replayErr.Err)
    // Decide: skip? alert? pause replay?
}

Risk 5: Snapshot Staleness During Failover (Medium Impact, Medium Likelihood)

Risk: Node A crashes; actor migrated to Node B; Node B replays from stale snapshot

Consequences:

Lost events between snapshot and crash
State on Node B is older than what client expects
Client sees state go backward (temporal anomaly)

Trigger:

Snapshot taken at version 100
New events created (versions 101-105)
Node crashes before migration completes
New node starts with snapshot at version 100; events 101-105 may be lost or replayed slowly

Detection:

Version inconsistencies: Client sees actor version decrease
Logs: "Loaded snapshot at version 100, expected 105"
Metrics: Track snapshot age (time since last event)

Mitigation:

Snapshot is a hint, not a guarantee
Always replay events from snapshot version + 1
Test: Crash node during rebalancing; verify no data loss
Operational: Monitor snapshot freshness; alert if outdated
Design: For critical actors, skip snapshots; always replay (safe but slow)

Risk 6: Namespace Name Collision in Hierarchical Naming (Low Impact, Low Likelihood)

Risk: Two separate logical domains accidentally use same namespace name

Consequences:

Events cross-contaminate
Subtle data corruption (events from domain A in domain B's stream)
Very hard to detect (seems like normal operation)

Trigger:

Dev: namespace = "orders"
Ops: namespace = "orders" (different meaning!)
Events published to same stream; subscribers confused

Detection:

Naming convention: Enforce "env.team.domain" pattern
Code review: Flag any hardcoded namespace strings
Tests: Validate namespace against allow-list

Mitigation:

Document namespace naming conventions in team wiki
Use enum or constant for namespaces (compile-time checks)
Enforce hierarchical naming: "prod.checkout.orders", not just "orders"
Monitoring: Alert if new namespaces appear

Code Analysis: Intended vs Actual Implementation

Observation 1: Version Conflict Handling is Correctly Asymmetric

Intended: Optimistic locking with explicit error handling; application implements retry

Actual:

EventStore.SaveEvent returns VersionConflictError (wraps ErrVersionConflict sentinel)
Code provides detailed error: ActorID, AttemptedVersion, CurrentVersion
No built-in retry logic (correct; encourages explicit retry at application level)

Alignment: GOOD - Implementation matches intent

Observation 2: Namespace Isolation is Primitive, Not Framework

Intended: Provide namespace primitives; let application layer handle multi-tenancy

Actual:

EventBus.Publish(namespace, event) accepts arbitrary string
MatchNamespacePattern supports NATS wildcards ("*", ">")
JetStreamEventStore with namespace prefix creates separate streams
NATSEventBus passes namespace as subject suffix: "aether.events.{namespace}"

Alignment: GOOD - No opinionated tenant management; just primitives

Gap: Namespace collision risk is real (see Risk 6); naming convention docs would help

Observation 3: Snapshot Strategy is Optional, Not Required

Intended: Snapshots should be purely performance optimization; events are source of truth

Actual:

SnapshotStore extends EventStore interface
GetLatestSnapshot can return nil (no snapshot exists)
Replay logic can ignore snapshots and always replay from event 1
Application chooses snapshot strategy

Alignment: GOOD - Snapshot is truly optional

Gap: No built-in snapshot strategy (periodic, adaptive); documentation could provide recipes

Observation 4: Cluster Management Exists but is Foundational, Not Complete

Intended: Provide node discovery, leader election, shard distribution primitives

Actual:

ClusterManager coordinates topology
LeaderElection uses NATS KV for lease-based coordination
ConsistentHashRing distributes shards
ShardManager (interface VMRegistry) connects VMs to shards

Alignment: GOOD - Primitives are in place

Gaps identified:

Actor migration during rebalancing: ShardManager interface exists but no migration handler shown. Where do actors move their state during failover?
Rebalancing algorithm: Code shows trigger points but not the actual rebalancing logic ("would rebalance across N nodes")
Split-brain prevention: Lease-based election prevents two concurrent leaders, but old leader might still execute rebalancing during transition

Recommendation: Document the rebalancing lifecycle explicitly; show sample actor migration code

Observation 5: Event Bus Filtering is Multi-Level

Intended: Namespace patterns at NATS level; event type and actor filtering at application level

Actual:

EventBus: In-memory subscriptions with local filtering
NATSEventBus: Extends EventBus; adds NATS subject subscriptions
SubscriptionFilter: EventTypes (list) + ActorPattern (wildcard string)
Filter applied after receiving (client-side)

Alignment: GOOD - Two-level filtering is efficient (network filters namespaces; client filters details)

Security note: NATSEventBus wildcard patterns documented with security warnings

Observation 6: Correlation & Causation Metadata is Built In

Intended: Track request flow across events for auditability

Actual:

Event.Metadata map with standard keys: CorrelationID, CausationID, UserID, TraceID, SpanID
Helper methods: SetMetadata, GetMetadata, SetCorrelationID, GetCorrelationID
WithMetadataFrom copies metadata from source event (chain causation)

Alignment: GOOD - Supports auditability principle from manifesto

Observation: Metadata is optional; not enforced. Could add validation to require correlation ID in production

Recommendations

For Product Strategy (Next Steps)

Create Bounded Context Map
- Map intents: EventStore context, Namespace context, Cluster context, Concurrency context
- Identify where each developer journey crosses boundaries
- Define context boundaries for brownfield code
Document Failure Scenarios
- Create scenario: "Node fails during rebalancing; what state is consistent?"
- Show event trace for each failure mode
- Provide recovery procedures
Define Capabilities
- "Store events durably with conflict detection"
- "Isolate logical domains using namespaces"
- "Distribute actors across cluster nodes"
- "Elect coordinator and coordinate rebalancing"
Build Integration Test Suite
- Single node: Event storage, snapshots, replay
- Two node cluster: Node failure, shard migration, failover
- Multi-tenant: Namespace isolation, cross-contamination detection
- Concurrency: Version conflicts, concurrent writers, retry logic

For Architecture (Implementation Gaps)

Actor Migration Strategy
- Define how actors move state during shard rebalancing
- Show whether events follow actor, or actor replays from new location
- Provide sample migration handler code
Namespace Naming Convention
- Document "env.domain" pattern
- Provide namespace registry or allow-list validation
- Add compile-time checks (enums, not strings)
Rebalancing Lifecycle
- Document full state machine: NodeFailed → RebalanceRequested → ShardMigrated → Completed
- Specify atomic boundaries (what's guaranteed, what's eventual)
- Provide sample operator commands
Snapshot Strategy Recipes
- Document when to snapshot (event count, time-based, adaptive)
- Provide sample snapshot implementation
- Show cost/benefit trade-offs

For Risk Mitigation

Add Validation Layer
- Enforce namespace format
- Validate event version strictly
- Check for required metadata (correlation ID, user ID)
Observability Hooks
- Metrics: conflict rate, rebalancing latency, namespace usage
- Logs: Every significant event with structured fields
- Tracing: Correlation ID propagation for request flows
Safety Documentation
- Pinpoint which wildcard patterns are safe (document only trusted uses)
- Version conflict handling recipes (backoff, circuit breaker)
- Multi-tenant isolation verification checklist

Summary: Problem Space Captured

Aether solves the problem of distributed event sourcing for Go without frameworks by providing composable primitives aligned with organizational values. The problem space has four developer journeys, each with decision points and risks:

Journey	Core Decision	Risk Area	Mitigation
Single Node	InMemory vs JetStream	Choice overload	Start with memory; docs guide migration
Distributed	Snapshot strategy	Stale snapshots	Always replay from snapshot+1; test failover
Multi-tenant	Namespace boundaries	Isolation breach	Wildcard warnings; integration tests
Concurrency	Retry strategy	Lost writes	Return error; docs show retry patterns

The vision (primitives over frameworks) is well-executed in the code. Gaps are in documentation of failure modes, actor migration strategy, and namespace conventions. Next phase should map bounded contexts and define domain invariants (Step 3 of product-strategy chain).

36 KiB Raw Permalink Blame History

Problem Map: Aether Distributed Actor System

Summary

Developer User Journeys

Journey 1: Single-Node Event-Sourced System (Testing & Iteration)

Journey 2: Scaling from Single Node to Distributed Cluster

Journey 3: Multi-Tenant System with Namespace Isolation

Journey 4: Optimistic Concurrency Control for Concurrent Writes

Business Event Timeline

Event Sourcing Layer Events

Namespace & Isolation Events

Clustering & Leadership Events

Concurrency Control Events

Decision Points & Trade-Offs

Decision 1: Which EventStore to Use?

Decision 2: Snapshot Strategy

Decision 3: Namespace Boundaries

Decision 4: Concurrent Write Conflict Handling

Decision 5: Leader Election Tolerance

Decision 6: Shard Rebalancing Policy

Risk Areas & Expensive Mistakes

Risk 1: Version Conflict Cascade (High Impact, High Likelihood)

Risk 2: Namespace Isolation Breach (High Impact, Medium Likelihood)

Risk 3: Leader Election Livelock (Medium Impact, Low Likelihood)

Risk 4: Event Store Corruption from Bad Unmarshaling (Medium Impact, Low Likelihood)

Risk 5: Snapshot Staleness During Failover (Medium Impact, Medium Likelihood)

Risk 6: Namespace Name Collision in Hierarchical Naming (Low Impact, Low Likelihood)

Code Analysis: Intended vs Actual Implementation

Observation 1: Version Conflict Handling is Correctly Asymmetric

Observation 2: Namespace Isolation is Primitive, Not Framework

Observation 3: Snapshot Strategy is Optional, Not Required

Observation 4: Cluster Management Exists but is Foundational, Not Complete

Observation 5: Event Bus Filtering is Multi-Level

Observation 6: Correlation & Causation Metadata is Built In

Recommendations

For Product Strategy (Next Steps)

For Architecture (Implementation Gaps)

For Risk Mitigation

Summary: Problem Space Captured

36 KiB

Raw Permalink Blame History