Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
43 KiB
Domain Model: Cluster Coordination
Summary
The Cluster Coordination context manages the distributed topology of actor nodes in an Aether cluster. Its core responsibility is to maintain consistency invariants: exactly one leader per term, all active shards assigned to at least one node, and no orphaned shards. It coordinates node discovery (via NATS heartbeats), leader election (lease-based), shard assignment (via consistent hashing), and rebalancing (when topology changes). The context enforces that only the leader can initiate rebalancing, and that node failures trigger shard reassignment to prevent actor orphaning.
Key insight: Cluster Coordination is not actor placement or routing (that's the application's responsibility via ShardManager). It owns the topology and leadership, enabling routing decisions by publishing shard assignments.
Invariants
These are the business rules that must never be violated:
Invariant 1: Single Leader Per Term
- Rule: At any point in time, at most one node is the leader for the current leadership term.
- Scope: LeadershipLease aggregate
- Why: Multiple leaders (split-brain) lead to conflicting rebalancing decisions and inconsistent shard assignments.
- Enforcement: LeaderElection enforces via NATS KV atomic operations (create/update with revision). Only one node can atomically claim the "leader" key.
Invariant 2: All Active Shards Have Owner(s)
- Rule: Every shard ID in [0, ShardCount) must be assigned to at least one active node if the cluster is healthy.
- Scope: ShardAssignment aggregate
- Why: Unassigned shards mean actors on those shards have no home; messages will orphan.
- Enforcement: LeaderElection enforces (only leader can assign). ClusterManager validates before applying assignments.
Invariant 3: Assigned Shards Exist on Healthy Nodes Only
- Rule: A shard assignment to node N is only valid if N is in NodeStatusActive.
- Scope: ShardAssignment + Cluster aggregates (coupled)
- Why: Assigning shards to failed nodes means actors can't execute.
- Enforcement: When node fails (NodeStatusFailed), leader rebalances shards off that node. handleNodeUpdate marks nodes failed after 90s heartbeat miss.
Invariant 4: Shard Assignments Stable During Leadership Lease
- Rule: Shard assignments only change in response to LeaderElected or NodeFailed; they don't arbitrarily shift during a stable leadership term.
- Scope: ShardAssignment + LeadershipLease (coupled)
- Why: Frequent rebalancing causes thrashing and actor migration overhead.
- Enforcement: rebalanceLoop (every 5 min) only runs if leader; triggerShardRebalancing only called on node changes (NodeJoined/Left/Failed).
Invariant 5: Leader Is an Active Node
- Rule: If LeaderID is set, the node with that ID must exist in Cluster.nodes with status=Active.
- Scope: Cluster + LeadershipLease (coupled)
- Why: A failed leader cannot coordinate cluster decisions.
- Enforcement: handleNodeUpdate marks nodes failed after timeout; leader renewal fails if node is marked failed. Split-brain risk: partition could allow multiple leaders, but lease expiration + atomic update mitigates.
Aggregates
Aggregate 1: Cluster (Root)
Invariants enforced:
- Invariant 2: All active shards have owners
- Invariant 3: Shards assigned only to healthy nodes
- Invariant 4: Shard assignments stable during leadership lease
- Invariant 5: Leader is an active node
Entities:
- Cluster (root): Represents the distributed topology and orchestrates rebalancing
nodes: Map[NodeID → NodeInfo] - all known nodes, their status, load, capacityshardMap: ShardMap - current shard-to-node assignmentshashRing: ConsistentHashRing - used to compute which node owns which shardcurrentLeaderID: String - who is leading this termterm: uint64 - leadership term counter
Value Objects:
NodeInfo: ID, Address, Port, Status, Capacity, Load, LastSeen, Metadata, VMCount, ShardIDs- Represents a physical node in the cluster; immutable after creation, mutated only via NodeUpdate commands
ShardMap: Version, Shards (map[ShardID → []NodeID]), Nodes (map[NodeID → NodeInfo]), UpdateTime- Snapshot of current shard topology; immutable (replaced, not mutated)
NodeStatus: Enum (Active, Draining, Failed)- Indicates health state of a node
Lifecycle:
- Created when: ClusterManager is instantiated (Cluster exists as singleton during runtime)
- Destroyed when: Cluster shuts down or node is permanently removed
- Transitions:
- NodeJoined → add node to nodes, add to hashRing, trigger rebalance (if leader)
- NodeLeft → remove node from nodes, remove from hashRing, trigger rebalance (if leader)
- NodeFailed (detected) → mark node as failed, trigger rebalance (if leader)
- LeaderElected → update currentLeaderID, may trigger rebalance
- ShardAssigned → update shardMap, increment version
Behavior Methods (not just getters/setters):
addNode(nodeInfo)→ NodeJoined event + may trigger rebalanceremoveNode(nodeID)→ NodeLeft event + trigger rebalancemarkNodeFailed(nodeID)→ NodeFailed event + trigger rebalanceassignShards(shardMap)→ ShardAssigned event (leader only)rebalanceTopology()→ ShardMigrated events (leader only)
Aggregate 2: LeadershipLease (Root)
Invariants enforced:
- Invariant 1: Single leader per term
- Invariant 5: Leader is an active node
Entities:
- LeadershipLease (root): Represents the current leadership claim
leaderID: String - which node holds the leaseterm: uint64 - monotonically increasing term numberexpiresAt: Timestamp - when this lease expires (now + LeaderLeaseTimeout)startedAt: Timestamp - when leader was elected
Value Objects:
- None (all properties immutable; lease is replaced, not mutated)
Lifecycle:
- Created when: A node wins election and creates the "leader" key in NATS KV
- Destroyed when: Lease expires and is not renewed, or leader resigns
- Transitions:
- TryBecomeLeader → attempt atomic create; if fails, maybe claim expired lease
- RenewLease (every 3s) → atomically update expiresAt to now + 10s
- LeaseExpired (detected) → remove from KV, allow new election
- NodeFailed (detected) → if failed node is leader, expiration will trigger new election
Behavior Methods:
tryAcquire(nodeID)→ LeaderElected event (if succeeds)renewLease(nodeID)→ LeadershipRenewed event (internal, not exposed as command)isExpired()→ BooleanisLeader(nodeID)→ Boolean
Invariant enforcement mechanism:
- Atomic operations in NATS KV: Only one node can successfully create "leader" key (or update with correct revision), ensuring single leader per term.
- Lease expiration: If leader crashes without renewing, lease expires after 10s, allowing another node to claim it.
- Revision-based updates: Update to lease must include correct revision (optimistic concurrency control), preventing stale leader from renewing.
Aggregate 3: ShardAssignment (Root)
Invariants enforced:
- Invariant 2: All active shards have owners
- Invariant 3: Shards assigned only to healthy nodes
Entities:
- ShardAssignment (root): Maps shards to their owning nodes
version: uint64 - incremented on each change, enables version comparison for replicationassignments: Map[ShardID → []NodeID] - shard to primary+replica nodesnodes: Map[NodeID → NodeInfo] - snapshot of active nodes at assignment timeupdateTime: Timestamp
Value Objects:
- None (structure is just data; immutability via replacement)
Lifecycle:
- Created when: Cluster initializes (empty assignments)
- Updated when: Leader calls rebalanceTopology() → new ShardAssignment created (old one replaced)
- Destroyed when: Cluster shuts down
Behavior Methods:
assignShard(shardID, nodeList)→ validates all nodes in nodeList are activerebalanceFromTopology(topology, strategy)→ calls strategy to compute new assignmentsvalidateAssignments()→ checks all shards assigned, all owners healthygetAssignmentsForNode(nodeID)→ []ShardID
Validation Rules:
- All nodes in assignment must be in nodes map with status=Active
- All shard IDs in [0, ShardCount) must appear in assignments (no orphans)
- Replication factor respected (each shard has 1..ReplicationFactor owners)
Commands
Commands represent user or system intents to change the cluster state. Only aggregates handle commands.
Command 1: JoinCluster
- Aggregate: Cluster
- Actor: Node joining (or discovery service announcing)
- Input: nodeID, address, port, capacity, metadata
- Validates:
- nodeID is not empty
- capacity > 0
- address is reachable (optional)
- Invariants enforced: Invariant 2 (rebalance if needed)
- Success: NodeJoined event published
- Failure: DuplicateNodeError (node already in cluster), ValidationError
Command 2: ElectLeader
- Aggregate: LeadershipLease
- Actor: Node attempting election (triggered periodically)
- Input: nodeID, currentTerm
- Validates:
- nodeID matches a current cluster member (in Cluster.nodes, status=Active)
- Can attempt if no current leader OR lease is expired
- Invariants enforced: Invariant 1, 5
- Success: LeaderElected event published (if atomic create succeeds); LeadershipRenewed (if claim expired lease)
- Failure: LeaderElectionFailed (atomic operation lost), NodeNotHealthy
Command 3: RenewLeadership
- Aggregate: LeadershipLease
- Actor: Current leader (triggered every 3s)
- Input: nodeID, currentTerm
- Validates:
- nodeID is current leader
- term matches current term
- node status is Active (else fail and lose leadership)
- Invariants enforced: Invariant 1, 5
- Success: LeadershipRenewed (internal event, triggers heartbeat log entry)
- Failure: LeadershipLost (node is no longer healthy or lost atomic update race)
Command 4: MarkNodeFailed
- Aggregate: Cluster
- Actor: System (monitoring service) or leader (if heartbeat misses)
- Input: nodeID, reason
- Validates:
- nodeID exists in cluster
- node is currently Active (don't re-fail already-failed nodes)
- Invariants enforced: Invariant 2, 3, 5 (rebalance to move shards off failed node)
- Success: NodeFailed event published; RebalanceTriggered (if leader)
- Failure: NodeNotFound, NodeAlreadyFailed
Command 5: AssignShards
- Aggregate: ShardAssignment (+ reads Cluster topology)
- Actor: Leader (only leader can assign)
- Input: nodeID (must be leader), newAssignments (Map[ShardID → []NodeID])
- Validates:
- nodeID is current leader
- all nodes in assignments are Active
- all shards in [0, ShardCount) are covered
- replication factor respected
- Invariants enforced: Invariant 2, 3 (assignment only valid if all nodes healthy)
- Success: ShardAssigned event published with new ShardMap
- Failure: NotLeader, InvalidAssignment (node not found), UnhealthyNode, IncompleteAssignment (missing shards)
Command 6: RebalanceShards
- Aggregate: Cluster (orchestrates) + ShardAssignment (executes)
- Actor: Leader (triggered by node changes or periodic check)
- Input: nodeID (must be leader), strategy (optional placement strategy)
- Validates:
- nodeID is current leader
- cluster has active nodes
- Invariants enforced: Invariant 2 (all shards still assigned), Invariant 3 (only to healthy nodes)
- Success: RebalancingCompleted event; zero or more ShardMigrated events (one per shard moved)
- Failure: NotLeader, NoActiveNodes, RebalancingFailed (unexpected topology change mid-rebalance)
Events
Events represent facts that happened. They are published after successful command execution.
Event 1: NodeJoined
- Triggered by: JoinCluster command
- Aggregate: Cluster
- Data: nodeID, address, port, capacity, metadata, timestamp
- Consumed by:
- Cluster (adds node to ring)
- Policies (RebalancingTriggerPolicy)
- Semantics: A new node entered the cluster and is ready to host actors
- Immutability: Once published, never changes
Event 2: NodeDiscovered
- Triggered by: NodeDiscovery announces node via NATS pub/sub (implicit)
- Aggregate: Cluster (discovery feeds into cluster topology)
- Data: nodeID, nodeInfo, timestamp
- Consumed by: Cluster topology sync
- Semantics: Node became visible to the cluster; may be new or rediscovered after network partition
- Note: Implicit event; not explicitly commanded, but captured in domain language
Event 3: LeaderElected
- Triggered by: ElectLeader command (atomic KV create succeeds) or ReclaimExpiredLease
- Aggregate: LeadershipLease
- Data: leaderID, term, expiresAt, startedAt, timestamp
- Consumed by:
- Cluster (updates currentLeaderID)
- Policies (LeaderElectionCompletePolicy)
- Semantics: A node has acquired leadership for the given term
- Guarantee: At most one node can succeed in creating this event per term
Event 4: LeadershipLost
- Triggered by: Lease expires (detected by monitorLeadership watcher) or RenewLeadership fails
- Aggregate: LeadershipLease
- Data: leaderID, term, reason (LeaseExpired, FailedToRenew, NodeFailed), timestamp
- Consumed by:
- Cluster (clears currentLeaderID)
- Policies (trigger new election)
- Semantics: The leader is no longer valid and coordination authority is vacant
- Trigger: No renewal received for 10s, or atomic update fails
Event 5: LeadershipRenewed
- Triggered by: RenewLeadership command (succeeds every 3s)
- Aggregate: LeadershipLease
- Data: leaderID, term, expiresAt, timestamp
- Consumed by: Internal use (heartbeat signal); not published to other contexts
- Semantics: Leader is alive and ready to coordinate
- Frequency: Every 3s per leader
Event 6: ShardAssigned
- Triggered by: AssignShards command or RebalanceShards command
- Aggregate: ShardAssignment
- Data: shardID, nodeIDs (primary + replicas), version, timestamp
- Consumed by:
- ShardManager (updates routing)
- Policies (ShardOwnershipPolicy)
- Other contexts (if they subscribe to shard topology changes)
- Semantics: Shard N is now owned by these nodes (primary first)
- Bulk event: Often published multiple times in one rebalance operation
Event 7: NodeFailed
- Triggered by: MarkNodeFailed command
- Aggregate: Cluster
- Data: nodeID, reason (HeartbeatTimeout, AdminMarked, etc.), timestamp
- Consumed by:
- Cluster (removes from active pool)
- Policies (RebalancingTriggerPolicy, actor migration)
- Other contexts (may need to relocate actors)
- Semantics: Node is unresponsive and should be treated as offline
- Detection: heartbeat miss after 90s, or explicit admin action
Event 8: NodeLeft
- Triggered by: Node gracefully shuts down (announceNode(NodeLeft)) or MarkNodeFailed (for draining)
- Aggregate: Cluster
- Data: nodeID, reason (GracefulShutdown, AdminRemoved, etc.), timestamp
- Consumed by: Policies (same as NodeFailed, triggers rebalance)
- Semantics: Node is intentionally leaving and will not rejoin
- Difference from NodeFailed: Intent signal; failed nodes might rejoin after network partition heals
Event 9: ShardMigrated
- Triggered by: RebalanceShards command (one event per shard reassigned)
- Aggregate: Cluster
- Data: shardID, fromNodes (old owners), toNodes (new owners), timestamp
- Consumed by:
- Local runtime (via ShardManager; triggers actor migration)
- Other contexts (if they track actor locations)
- Semantics: A shard's ownership changed; actors on that shard may need to migrate
- Migration strategy: Application owns how to move actors (via ActorMigration); cluster just signals the change
Event 10: RebalancingTriggered
- Triggered by: RebalanceShards command (start)
- Aggregate: Cluster
- Data: leaderID, reason (NodeJoined, NodeFailed, Manual), timestamp
- Consumed by: Monitoring/debugging
- Semantics: Leader has initiated a rebalancing cycle
- Note: Informational; subsequent ShardMigrated events describe the actual changes
Event 11: RebalancingCompleted
- Triggered by: RebalanceShards command (finish)
- Aggregate: Cluster
- Data: leaderID, completedAt, migrationsCount, timestamp
- Consumed by: Monitoring/debugging, other contexts may wait for this before proceeding
- Semantics: All shard migrations have been assigned; doesn't mean they're complete on actors
- Note: ShardMigrated is the signal to move actors; this is the coordination signal
Policies
Policies are automated reactions to events. They connect events to commands across aggregates and contexts.
Policy 1: Single Leader Policy
- Trigger: When LeadershipLost event
- Action: Any node can attempt ElectLeader command
- Context: Only one will succeed due to atomic NATS KV operation
- Rationale: Ensure leadership is re-established quickly after vacancy
- Implementation: electionLoop in LeaderElection runs every 2s, calls tryBecomeLeader if not leader
Policy 2: Lease Renewal Policy
- Trigger: Periodic timer (every 3s)
- Action: If IsLeader, execute RenewLeadership command
- Context: Heartbeat mechanism to prove leader is alive
- Rationale: Detect leader failure via lease expiration after 10s inactivity
- Implementation: leaseRenewalLoop in LeaderElection; failure triggers loseLeadership()
Policy 3: Lease Expiration Policy
- Trigger: When LeadershipLease.expiresAt < now (detected by monitorLeadership watcher)
- Action: Clear currentLeader, publish LeadershipLost, trigger SingleLeaderPolicy
- Context: Automatic failover when leader stops renewing
- Rationale: Prevent stale leaders from coordinating during network partitions
- Implementation: monitorLeadership watches "leader" KV key; if deleted or expired, calls handleLeadershipUpdate
Policy 4: Node Heartbeat Policy
- Trigger: Periodic timer (every 30s) - NodeDiscovery announces
- Action: Publish node status via NATS "aether.discovery" subject
- Context: Membership discovery; all nodes broadcast presence
- Rationale: Other nodes learn topology via heartbeats; leader detects failures via absence
- Implementation: NodeDiscovery.Start() runs heartbeat ticker
Policy 5: Node Failure Detection Policy
- Trigger: When NodeUpdate received with LastSeen > 90s ago
- Action: Mark node as NodeStatusFailed; if leader, trigger RebalanceShards
- Context: Eventual failure detection (passive, via heartbeat miss)
- Rationale: Failed nodes may still hold shard assignments; rebalance moves shards to healthy nodes
- Implementation: handleNodeUpdate checks LastSeen and marks nodes failed; checkNodeHealth periodic check
Policy 6: Shard Rebalancing Trigger Policy
- Trigger: When NodeJoined, NodeLeft, or NodeFailed event
- Action: If leader, execute RebalanceShards command
- Context: Topology change → redistribute actors
- Rationale: New node should get load; failed node's shards must be reassigned
- Implementation: handleNodeUpdate calls triggerShardRebalancing if leader
Policy 7: Shard Ownership Enforcement Policy
- Trigger: When ShardAssigned event
- Action: Update local ShardMap; nodes use this for actor routing
- Context: All nodes must agree on shard ownership for routing consistency
- Rationale: Single source of truth (published by leader) prevents routing conflicts
- Implementation: ClusterManager receives ShardAssigned via NATS; updates shardMap
Policy 8: Shard Coverage Policy
- Trigger: Periodic check (every 5 min) or after NodeFailed
- Action: Validate all shards in [0, ShardCount) are assigned; if any missing, trigger RebalanceShards
- Context: Safety check to prevent shard orphaning
- Rationale: Ensure no actor can be born on an unassigned shard
- Implementation: rebalanceLoop calls triggerShardRebalancing with reason "periodic rebalance check"
Policy 9: Leader-Only Rebalancing Policy
- Trigger: RebalanceShards command
- Action: Validate nodeID is currentLeader before executing
- Context: Only leader can initiate topology changes
- Rationale: Prevent cascading rebalancing from multiple nodes; single coordinator
- Implementation: triggerShardRebalancing checks IsLeader() at start
Policy 10: Graceful Shutdown Policy
- Trigger: NodeDiscovery.Stop() called
- Action: Publish NodeLeft event
- Context: Signal that this node is intentionally leaving
- Rationale: Other nodes should rebalance shards away from this node; different from failure
- Implementation: Stop() calls announceNode(NodeLeft) before shutting down
Read Models
Read models project state for queries. They have no invariants and can be eventually consistent.
Read Model 1: GetClusterTopology
- Purpose: What nodes are currently in the cluster?
- Data:
nodes: []NodeInfo (filtered to status=Active only)timestamp: When snapshot was taken
- Source: Cluster.nodes, filtered by status != Failed
- Updated: After NodeJoined, NodeLeft, NodeFailed events
- Queryable by: nodeID, status, capacity, load
- Eventual consistency: Replica nodes lag leader by a few heartbeats
Read Model 2: GetLeader
- Purpose: Who is the current leader?
- Data:
leaderID: Current leader node ID, or null if no leaderterm: Leadership term numberexpiresAt: When current leadership lease expiresconfidence: "high" (just renewed), "medium" (recent), "low" (about to expire)
- Source: LeadershipLease
- Updated: After LeaderElected, LeadershipRenewed, LeadershipLost events
- Queryable by: leaderID, term, expiration time
- Eventual consistency: Non-leader nodes lag by up to 10s (lease timeout)
Read Model 3: GetShardAssignments
- Purpose: Where does each shard live?
- Data:
shardID: Shard numberprimaryNode: Node ID (shardMap.Shards[shardID][0])replicaNodes: []NodeID (shardMap.Shards[shardID][1:])version: ShardMap version (for optimistic concurrency)
- Source: Cluster.shardMap
- Updated: After ShardAssigned, ShardMigrated events
- Queryable by: shardID, nodeID (which shards does node own?)
- Eventual consistency: Replicas lag leader by one NATS publish; consistent within a term
Read Model 4: GetNodeHealth
- Purpose: Is a given node healthy?
- Data:
nodeID: Node identifierstatus: Active | Draining | FailedlastSeen: Last heartbeat timestampdownForSeconds: (now - lastSeen)
- Source: Cluster.nodes[nodeID]
- Updated: After NodeJoined, NodeUpdated, NodeFailed events
- Queryable by: nodeID, status threshold (e.g., "give me all failed nodes")
- Eventual consistency: Non-leader nodes lag by 30s (heartbeat interval)
Read Model 5: GetRebalancingStatus
- Purpose: Is rebalancing in progress? How many shards moved?
- Data:
isRebalancing: BooleanstartedAt: Timestampreason: "node_joined" | "node_failed" | "periodic" | "manual"completedCount: Number of shards finishedtotalCount: Total shards to move
- Source: RebalancingTriggered, ShardMigrated, RebalancingCompleted events
- Updated: On rebalancing events
- Queryable by: current status, started within N seconds
- Eventual consistency: Replicas lag by one NATS publish
Value Objects
Value Object 1: NodeInfo
Represents a physical node in the cluster.
Fields:
ID: string - unique identifierAddress: string - IP or hostnamePort: int - NATS portStatus: NodeStatus enum (Active, Draining, Failed)Capacity: float64 - max load capacityLoad: float64 - current loadLastSeen: time.Time - last heartbeatTimestamp: time.Time - when created/updatedMetadata: map[string]string - arbitrary tags (region, version, etc.)IsLeader: bool - is this the leader?VMCount: int - how many actors on this nodeShardIDs: []int - which shards are assigned
Equality: Two NodeInfos are equal if all fields match (identity-based for clustering purposes, but immutable)
Validation:
- ID non-empty
- Capacity > 0
- Status in {Active, Draining, Failed}
- Port in valid range [1, 65535]
Value Object 2: ShardMap
Represents the current shard-to-node assignment snapshot.
Fields:
Version: uint64 - incremented on each change; used for optimistic concurrencyShards: Map[ShardID → []NodeID] - shard to [primary, replica1, replica2, ...]Nodes: Map[NodeID → NodeInfo] - snapshot of nodes known at assignment timeUpdateTime: time.Time - when created
Equality: Two ShardMaps are equal if Version and Shards are equal (Nodes is metadata)
Validation:
- All shard IDs in [0, ShardCount)
- All node IDs in Shards exist in Nodes
- All nodes in Nodes have status=Active
- Replication factor respected (1 ≤ len(Shards[sid]) ≤ ReplicationFactor)
Immutability: ShardMap is never mutated; rebalancing creates a new ShardMap
Value Object 3: LeadershipLease
Represents a leader's claim on coordination authority.
Fields:
LeaderID: string - node ID holding the leaseTerm: uint64 - monotonically increasing term numberExpiresAt: time.Time - when lease is no longer validStartedAt: time.Time - when leader was elected
Equality: Two leases are equal if LeaderID, Term, and ExpiresAt match
Validation:
- LeaderID non-empty
- Term ≥ 0
- ExpiresAt > StartedAt
- ExpiresAt - StartedAt == LeaderLeaseTimeout
Lifecycle:
- Created: node wins election
- Renewed: every 3s, ExpiresAt extended
- Expired: if ExpiresAt < now and not renewed
- Replaced: next term when new leader elected
Value Object 4: Term
Represents a leadership term (could be extracted for clarity).
Fields:
Number: uint64 - term counter
Semantics: Monotonically increasing; each new leader gets a higher term. Used to detect stale messages.
Code Analysis
Intended vs Actual: ClusterManager
Intended (from Domain Model):
- Root aggregate owning Cluster topology
- Enforces invariants: shard coverage, healthy node assignments, rebalancing triggers
- Commands: JoinCluster, MarkNodeFailed, RebalanceShards
- Events: NodeJoined, NodeFailed, ShardAssigned, ShardMigrated
Actual (from /cluster/manager.go):
- Partially aggregate-like: owns
nodes,shardMap,hashRing - Lacks explicit command methods: has
handleClusterMessage()but not named commands likeJoinCluster() - Lacks explicit event publishing: updates state but doesn't publish domain events
- Invariant enforcement scattered: node failure detection in
handleNodeUpdate(), but no central validation - Missing behavior: shard assignment logic in
ShardManager, not in Cluster aggregate
Misalignment:
- Anemic aggregate: ClusterManager reads/writes state but doesn't enforce invariants or publish events
- Responsibility split: Cluster topology (Manager) vs shard assignment (ShardManager) vs leadership (LeaderElection) are not unified under one aggregate root
- No explicit commands: Node updates handled via generic message dispatcher, not domain-language commands
- No event sourcing: State changes don't produce events
Gaps:
- No JoinCluster command handler
- No MarkNodeFailed command handler (only handleNodeUpdate which detects failures)
- No explicit ShardAssigned/ShardMigrated events
- Rebalancing triggers exist (triggerShardRebalancing) but not as domain commands
Intended vs Actual: LeaderElection
Intended (from Domain Model):
- Root aggregate owning LeadershipLease invariant (single leader per term)
- Commands: ElectLeader, RenewLeadership
- Events: LeaderElected, LeadershipLost, LeadershipRenewed
Actual (from /cluster/leader.go):
- Correctly implements lease-based election with NATS KV
- Enforces single leader via atomic operations (create, update with revision)
- Has implicit command pattern (tryBecomeLeader, renewLease, resignLeadership)
- Has callbacks for leadership change, but no explicit event publishing
Alignment:
- Atomic operations correctly enforce Invariant 1 (single leader)
- Lease renewal every 3s enforces lease validity
- Lease expiration detected via watcher
- Leadership transitions (elected, lost) well-modeled
Gaps:
- Events not explicitly published; callbacks used instead
- No event sourcing (events should be recorded in event store, not just callbacks)
- No term-based validation (could reject stale messages with old term)
- Could be more explicit about LeaderElected event vs just callback
Intended vs Actual: ConsistentHashRing
Intended (from Domain Model):
- Used by ShardAssignment to compute which node owns which shard
- Policy: shards assigned via consistent hashing
- Minimizes reshuffling on node join/leave
Actual (from /cluster/hashring.go):
- Correctly implements consistent hash ring with virtual nodes
- AddNode/RemoveNode operations are clean
- GetNode(key) returns responsible node; used for actor placement
Alignment:
- Good separation of concerns (ring is utility, not aggregate)
- Virtual nodes (150 per node) reduce reshuffling on node change
- Immutable ring structure (recreated on changes)
Gaps:
- Not actively used by ShardAssignment (ShardManager has own hash logic)
- Could be used by RebalanceShards policy to compute initial assignments
- Currently more of a utility than a policy
Intended vs Actual: ShardManager
Intended (from Domain Model):
- ShardAssignment aggregate managing shard-to-node mappings
- Commands: AssignShard, RebalanceShards (via PlacementStrategy)
- Enforces invariants: all shards assigned, only to healthy nodes
- Emits ShardAssigned events
Actual (from /cluster/shard.go):
- Owns ShardMap, but like ClusterManager, is more of a data holder than aggregate
- Has methods: AssignShard, RebalanceShards (delegates to PlacementStrategy)
- Lacks invariant validation (doesn't check if nodes are healthy)
- Lacks event publishing
Alignment:
- PlacementStrategy pattern allows different algorithms (good design)
- ConsistentHashPlacement exists but is stubbed
Gaps:
- ShardManager.RebalanceShards not integrated with ClusterManager's decision to rebalance
- No event publishing on shard changes
- Invariant validation needed: validate nodes in assignments are healthy
Intended vs Actual: NodeDiscovery
Intended (from Domain Model):
- Detects nodes via NATS heartbeats
- Publishes NodeJoined, NodeUpdated, NodeLeft events via announceNode
- Triggers policies (node failure detection, rebalancing)
Actual (from /cluster/discovery.go):
- Heartbeats every 30s via announceNode
- Subscribes to "aether.discovery" channel
- Publishes NodeUpdate messages, not domain events
Alignment:
- Heartbeat mechanism good; detected failure via 90s timeout in ClusterManager
- Message-based communication works for event bus
Gaps:
- NodeUpdate is not a domain event; should publish NodeJoined, NodeUpdated, NodeLeft as explicit events
- Could be clearer about lifecycle: Start announces NodeJoined, Stop announces NodeLeft
Intended vs Actual: DistributedVM
Intended (from Domain Model):
- Orchestrates all cluster components (discovery, election, coordination, sharding)
- Not itself an aggregate; more of a façade/orchestrator
Actual (from /cluster/distributed.go):
- Correctly orchestrates: discovery + cluster manager + sharding + local runtime
- DistributedVMRegistry provides VMRegistry interface to ClusterManager
- Good separation: doesn't force topology decisions on runtime
Alignment:
- Architecture clean; each component has clear responsibility
- Decoupling via interfaces (Runtime, VirtualMachine, VMProvider) is good
Gaps:
- No explicit orchestration logic (Start method incomplete; only shown first 100 lines)
- Could coordinate startup order more explicitly
Refactoring Backlog
Refactoring 1: Extract Cluster Aggregate from ClusterManager
Current: ClusterManager is anemic; only stores state Target: ClusterManager becomes true aggregate root enforcing invariants
Steps:
- Add explicit command methods to ClusterManager:
JoinCluster(nodeInfo NodeInfo) errorMarkNodeFailed(nodeID string) errorAssignShards(shardMap ShardMap) errorRebalanceTopology(reason string) error
- Each command:
- Validates preconditions
- Calls aggregate behavior (private methods)
- Publishes events
- Returns result
- Add event publishing:
- Create EventPublisher interface in ClusterManager
- Publish NodeJoined, NodeFailed, ShardAssigned, ShardMigrated events
- Events captured in event store (optional, or via NATS pub/sub)
Impact: Medium - changes ClusterManager interface but not external APIs yet Priority: High - unblocks event-driven integration with other contexts
Refactoring 2: Extract ShardAssignment Commands from RebalanceShards
Current: ShardManager.RebalanceShards delegates to PlacementStrategy; no validation of healthy nodes Target: ShardAssignment commands validate invariants
Steps:
- Add to ShardManager:
AssignShards(assignments map[int][]string, nodes map[string]*NodeInfo) error- Validates: all nodes exist and are Active
- Validates: all shards in [0, ShardCount) assigned
- Validates: replication factor respected
ValidateAssignments() error
- Move shard validation from coordinator to ShardManager
- Publish ShardAssigned events on successful assignment
- Update ClusterManager to call ShardManager.AssignShards instead of directly mutating ShardMap
Impact: Medium - clarifies shard aggregate, adds validation Priority: High - prevents invalid shard assignments
Refactoring 3: Publish Domain Events from LeaderElection
Current: LeaderElection uses callbacks; no event sourcing Target: Explicit event publishing for leader changes
Steps:
- Add EventPublisher interface to LeaderElection
- In becomeLeader: publish LeaderElected event
- In loseLeadership: publish LeadershipLost event
- Optional: publish LeadershipRenewed on each renewal (for audit trail)
- Events include: leaderID, term, expiresAt, timestamp
- Consumers subscribe via NATS and react (no longer callbacks)
Impact: Medium - changes LeaderElection interface Priority: Medium - improves observability and enables event sourcing
Refactoring 4: Unify Node Failure Detection and Rebalancing
Current: Node failure detected in handleNodeUpdate (90s timeout) + periodic checkNodeHealth; rebalancing trigger spread across multiple methods Target: Explicit MarkNodeFailed command, single rebalancing trigger
Steps:
- Create explicit MarkNodeFailed command handler
- Move node failure detection logic to ClusterManager.markNodeFailed()
- Consolidate node failure checks (remove duplicate in checkNodeHealth)
- Trigger rebalancing only from MarkNodeFailed, not scattered
- Add RebalancingTriggered event before starting rebalance
Impact: Low - refactoring existing logic, not new behavior Priority: Medium - improves clarity
Refactoring 5: Implement PlacementStrategy for Rebalancing
Current: ConsistentHashPlacement.RebalanceShards is stubbed Target: Real rebalancing logic using consistent hashing
Steps:
- Implement ConsistentHashPlacement.RebalanceShards:
- Input: current ShardMap, updated nodes (may have added/removed)
- Output: new ShardMap with shards redistributed via consistent hash
- Minimize movement: use virtual nodes to keep most shards in place
- Add RebalancingStrategy interface if other strategies needed (e.g., load-aware)
- Test: verify adding/removing node only reshuffles ~1/N shards
Impact: Medium - core rebalancing logic, affects all topology changes Priority: High - currently rebalancing doesn't actually redistribute
Refactoring 6: Add Node Health Check Endpoint
Current: No way to query node health directly Target: Read model for GetNodeHealth
Steps:
- Add method to ClusterManager:
GetNodeHealth(nodeID string) NodeHealthStatus - Return: status, lastSeen, downForSeconds
- Expose via NATS request/reply (if distributed query needed)
- Test: verify timeout logic
Impact: Low - new query method, no state changes Priority: Low - nice to have for monitoring
Refactoring 7: Add Shard Migration Tracking
Current: ShardMigrated event published, but no tracking of migration progress Target: ActorMigration status tracking and completion callback
Steps:
- Add MigrationTracker in cluster package
- On ShardMigrated event: create migration record (pending)
- Application reports migration progress (in_progress, completed, failed)
- On completion: remove from tracker
- Rebalancing can wait for migrations to complete before declaring rebalance done
Impact: High - affects how rebalancing coordinates with application Priority: Medium - improves robustness (don't rebalance while migrations in flight)
Testing Strategy
Unit Tests
LeaderElection invariant tests:
- Only one node can successfully create "leader" key → test atomic create succeeds once, fails second time
- Lease expiration triggers new election → create expired lease, verify election succeeds
- Lease renewal extends expiry → create lease, renew, verify new expiry is ~10s from now
- Stale leader can't renew → mark node failed, verify renewal fails
Cluster topology invariant tests:
- NodeJoined adds to hashRing → call addNode, verify GetNode routes consistently
- NodeFailed triggers rebalance → call markNodeFailed, verify rebalance triggered
- Shard coverage validated → rebalance with 100 nodes, 1024 shards, verify all shards assigned
- Only healthy nodes get shards → assign to failed node, verify rejected
ShardManager invariant tests:
- AssignShards validates node health → assign to failed node, verify error
- RebalanceShards covers all shards → simulate topology change, verify no orphans
- Virtual nodes minimize reshuffling → add node, verify < 1/N shards move
Integration Tests
Single leader election:
- Create 3 cluster nodes
- Verify exactly one becomes leader
- Stop leader
- Verify new leader elected within 10s
- Test: leadership term increments
Node failure and recovery:
- Create 5-node cluster with 100 shards
- Mark node-2 failed
- Verify shards reassigned from node-2 to others
- Verify node-3 doesn't become unreasonably overloaded
- Restart node-2
- Verify shards rebalanced back
Graceful shutdown:
- Create 3-node cluster
- Gracefully stop node-1 (announces NodeLeft)
- Verify no 90s timeout; rebalancing happens immediately
- Compare to failure case (90s delay)
Split-brain recovery:
- Create 3-node cluster: [A(leader), B, C]
- Partition network: A isolated, B+C connected
- Verify A loses leadership after 10s
- Verify B or C becomes leader
- Heal partition
- Verify single leader, no conflicts (A didn't try to be leader again)
Rebalancing under load:
- Create 5-node cluster, 100 shards, with actors running
- Add node-6
- Verify actors migrated off other nodes to node-6
- No actors are orphaned (all still reachable)
- Measure: reshuffled < 1/5 of shards
Chaos Testing
- Leader failure mid-rebalance → verify rebalancing resumed by new leader
- Network partition (leader isolated) → verify quorum (or lease) ensures no split-brain
- Cascading failures → 5 nodes, fail 3 at once, verify cluster stabilizes
- High churn → nodes join/leave rapidly, verify topology converges
Boundary Conditions and Limitations
Design Decisions
Why lease-based election instead of Raft?
- Simpler to implement and reason about
- Detect failure in 10s (acceptable for coordination)
- Risk: split-brain if network partition persists > 10s and both partitions have nodes (mitigation: leader must renew in each partition; only one will have NATS connection)
Why leader-only rebalancing?
- Prevent cascading rebalancing decisions
- Single source of truth (leader decides topology)
- Risk: leader bottleneck if rebalancing is expensive (mitigation: leader can delegate to algorithms, not compute itself)
Why consistent hashing instead of load-balancing?
- Minimize shard movement on topology change (good for actor locality)
- Deterministic without central state (nodes can independently compute assignments)
- Risk: load imbalance if actors heavily skewed (mitigation: application can use custom PlacementStrategy)
Why 90s failure detection timeout?
- 3 heartbeats missed (30s * 3) before declaring failure
- Allows for some network jitter without false positives
- Risk: slow failure detection (mitigation: application can force MarkNodeFailed if it detects failure faster)
Assumptions
- NATS cluster is available: If NATS is down, cluster can't communicate (no failover without NATS)
- Clocks are reasonably synchronized: Lease expiration depends on wall clock; major clock skew can break election
- Network partitions are rare: Split-brain only possible if partition > 10s and leader isolated
- Rebalancing is not time-critical: 5-min periodic check is default; no SLA on shard assignment latency
Known Gaps
- No quorum-based election: Single leader with lease; could add quorum for stronger consistency (Raft-like)
- No actor migration semantics: Who actually moves actors? Cluster signals ShardMigrated, but application must handle
- No topology versioning: ShardMap has version, but no way to detect if a node has an outdated topology
- No leader handoff during rebalancing: If leader fails mid-rebalance, new leader might redo already-started migrations
- No split-brain detection: Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but cluster doesn't enforce it)
Alignment with Product Vision
Primitives Over Frameworks:
- Cluster Coordination provides primitives (leader election, shard assignment), not a complete framework
- Application owns actor migration strategy (via ShardManager PlacementStrategy)
- Application owns failure response (can custom-implement node monitoring)
NATS-Native:
- Leader election uses NATS KV for atomic operations
- Node discovery uses NATS pub/sub for heartbeats
- Shard topology can be published via NATS events
Event-Sourced:
- All topology changes produce events (NodeJoined, NodeFailed, ShardAssigned, ShardMigrated)
- Events enable audit trail and replay (who owns which shard when?)
Resource Conscious:
- Minimal overhead: consistent hashing avoids per-node state explosion
- Lease-based election lighter than Raft (no log replication)
- Virtual nodes (150) on modest hardware
References
- Lease-based election: Inspired by Chubby, Google's lock service
- Consistent hashing: Karger et al., "Consistent Hashing and Random Trees"
- Virtual nodes: Reduces reshuffling on topology change (Dynamo, Cassandra pattern)
- NATS KV: Used for atomicity; alternatives: etcd, Consul (but less NATS-native)