# Cluster Coordination: Architecture Reference ## High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Aether Cluster Runtime │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ DistributedVM (Orchestrator - not an aggregate) │ │ │ │ ├─ Local Runtime (executes actors) │ │ │ │ ├─ NodeDiscovery (heartbeat → cluster awareness) │ │ │ │ ├─ ClusterManager (Cluster aggregate root) │ │ │ │ │ ├─ nodes: Map[ID → NodeInfo] │ │ │ │ │ ├─ shardMap: ShardMap (current assignments) │ │ │ │ │ ├─ hashRing: ConsistentHashRing (util) │ │ │ │ │ └─ election: LeaderElection │ │ │ │ └─ ShardManager (ShardAssignment aggregate) │ │ │ │ ├─ shardCount: int │ │ │ │ ├─ shardMap: ShardMap │ │ │ │ └─ placement: PlacementStrategy │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ NATS │ │ ▼ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ NATS Cluster │ │ │ │ ├─ Subject: aether.discovery (heartbeats) │ │ │ │ ├─ Subject: aether.cluster.* (messages) │ │ │ │ └─ KeyValue: aether-leader-election (lease) │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## Aggregate Boundaries ### Aggregate 1: Cluster (Root) Owns node topology, shard assignments, and rebalancing decisions. ``` Cluster Aggregate ├─ Entities │ └─ Cluster (root) │ ├─ nodes: Map[NodeID → NodeInfo] │ ├─ shardMap: ShardMap │ ├─ hashRing: ConsistentHashRing │ └─ currentLeaderID: string │ ├─ Commands │ ├─ JoinCluster(nodeInfo) │ ├─ MarkNodeFailed(nodeID) │ ├─ AssignShards(shardMap) │ └─ RebalanceShards(reason) │ ├─ Events │ ├─ NodeJoined │ ├─ NodeFailed │ ├─ NodeLeft │ ├─ ShardAssigned │ ├─ ShardMigrated │ └─ RebalancingTriggered │ ├─ Invariants Enforced │ ├─ I2: All active shards have owners │ ├─ I3: Shards only on healthy nodes │ └─ I4: Assignments stable during lease │ └─ Code Location: ClusterManager (cluster/manager.go) ``` ### Aggregate 2: LeadershipLease (Root) Owns leadership claim and ensures single leader per term. ``` LeadershipLease Aggregate ├─ Entities │ └─ LeadershipLease (root) │ ├─ leaderID: string │ ├─ term: uint64 │ ├─ expiresAt: time.Time │ └─ startedAt: time.Time │ ├─ Commands │ ├─ ElectLeader(nodeID) │ └─ RenewLeadership(nodeID) │ ├─ Events │ ├─ LeaderElected │ ├─ LeadershipRenewed │ └─ LeadershipLost │ ├─ Invariants Enforced │ ├─ I1: Single leader per term │ └─ I5: Leader is active node │ └─ Code Location: LeaderElection (cluster/leader.go) ``` ### Aggregate 3: ShardAssignment (Root) Owns shard-to-node mappings and validates assignments. ``` ShardAssignment Aggregate ├─ Entities │ └─ ShardAssignment (root) │ ├─ version: uint64 │ ├─ assignments: Map[ShardID → []NodeID] │ ├─ nodes: Map[NodeID → NodeInfo] │ └─ updateTime: time.Time │ ├─ Commands │ ├─ AssignShard(shardID, nodeList) │ └─ RebalanceFromTopology(nodes) │ ├─ Events │ ├─ ShardAssigned │ └─ ShardMigrated │ ├─ Invariants Enforced │ ├─ I2: All shards assigned │ └─ I3: Only healthy nodes │ └─ Code Location: ShardManager (cluster/shard.go) ``` --- ## Command Flow Diagrams ### Scenario 1: Node Joins Cluster ``` ┌─────────┐ NodeJoined ┌──────────────────┐ │New Node │─────────────▶│ClusterManager │ └─────────┘ │.JoinCluster() │ └────────┬─────────┘ │ ┌─────────────┼─────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │Validate │ │Update │ │Publish │ │ID unique │ │topology │ │NodeJoined│ │Capacity>0 hashRing │ │event │ └────┬─────┘ └──────────┘ └──────────┘ │ ┌────▼────────────────────────┐ │Is this node leader? │ │If yes: trigger rebalance │ └─────────────────────────────┘ │ ┌───────────┴───────────┐ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │RebalanceShards │ │(nothing) │ │ command │ │ │ └──────────────────┘ └──────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ConsistentHashPlacement │ │ .RebalanceShards() │ │ (compute new assignments) │ └────────────┬─────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ShardManager.AssignShards() │ │ (validate & apply) │ └────────────┬─────────────────────┘ │ ┌────┴──────────────────────┐ ▼ ▼ ┌──────────────┐ ┌─────────────────┐ │For each │ │Publish │ │shard moved │ │ShardMigrated │ │ │ │event per shard │ └──────────────┘ └─────────────────┘ ``` ### Scenario 2: Node Failure Detected ``` ┌──────────────────────┐ │Heartbeat timeout │ │(LastSeen > 90s) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────────────┐ │ClusterManager │ │.MarkNodeFailed() │ │ ├─ Mark status=Failed │ │ ├─ Remove from hashRing │ │ └─ Publish NodeFailed event │ └────────────┬─────────────────┘ │ ┌────────▼────────────┐ │Is this node leader? │ └────────┬────────────┘ │ ┌────────┴─────────────────┐ │ YES │ NO ▼ ▼ ┌──────────────┐ ┌──────────────────┐ │Trigger │ │(nothing) │ │Rebalance │ │ │ └──────────────┘ └──────────────────┘ │ └─▶ [Same as Scenario 1 from RebalanceShards] ``` ### Scenario 3: Leader Election (Implicit) ``` ┌─────────────────────────────────────┐ │All nodes: electionLoop runs every 2s│ └────────────┬────────────────────────┘ │ ┌────────▼────────────┐ │Am I leader? │ └────────┬────────────┘ │ ┌────────┴──────────────────────────┐ │ YES │ NO ▼ ▼ ┌──────────────┐ ┌─────────────────────────┐ │Do nothing │ │Should try election? │ │(already │ │ ├─ No leader exists │ │leading) │ │ ├─ Lease expired │ └──────────────┘ │ └─ (other conditions) │ └────────┬────────────────┘ │ ┌─────────▼──────────┐ │try AtomicCreate │ │"leader" key in KV │ └────────┬───────────┘ │ ┌─────────────┴──────────────┐ │ SUCCESS │ FAILED ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │Became Leader! │ │Try claim expired │ │Publish │ │lease; if success,│ │LeaderElected │ │become leader │ └──────────────────┘ │Else: stay on │ │ │bench │ ▼ └──────────────────┘ ┌──────────────────┐ │Start lease │ │renewal loop │ │(every 3s) │ └──────────────────┘ ``` --- ## Decision Trees ### Decision 1: Is Node Healthy? ``` Query: Is Node X healthy? ┌─────────────────────────────────────┐ │Get node status from Cluster.nodes │ └────────────┬────────────────────────┘ │ ┌────────▼────────────────┐ │Check node.Status field │ └────────┬────────────────┘ │ ┌────────┴───────────────┬─────────────────────┬──────────────┐ │ │ │ │ ▼ ▼ ▼ ▼ ┌────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────┐ │Active │ │Draining │ │Failed │ │Unknown │ ├────────┤ ├────────────┤ ├──────────────┤ ├─────────┤ │✓Healthy│ │⚠ Draining │ │✗ Unhealthy │ │✗ Error │ │Can host│ │Should not │ │Don't use for │ │ │ │shards │ │get new │ │sharding │ │ │ └────────┘ │shards, but │ │Delete shards │ └─────────┘ │existing OK │ │ASAP │ └────────────┘ └──────────────┘ ``` ### Decision 2: Should This Node Rebalance Shards? ``` Command: RebalanceShards(nodeID, reason) ┌──────────────────────────────────┐ │Is nodeID the current leader? │ └────────┬─────────────────────────┘ │ ┌────┴──────────────────┐ │ YES │ NO ▼ ▼ ┌────────────┐ ┌──────────────────────┐ │Continue │ │REJECT: NotLeader │ │rebalancing │ │ │ └─────┬──────┘ │Only leader can │ │ │initiate rebalancing │ │ └──────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │Are there active nodes? │ └────────┬────────────────────────────┘ │ ┌────┴──────────────────┐ │ YES │ NO ▼ ▼ ┌────────────┐ ┌──────────────────────┐ │Continue │ │REJECT: NoActiveNodes │ │rebalancing │ │ │ └─────┬──────┘ │Can't assign shards │ │ │with no healthy nodes │ │ └──────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │Call PlacementStrategy.RebalanceShards() │ │ (compute new assignments) │ └────────┬─────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │Call ShardManager.AssignShards() │ │ (validate & apply new assignments) │ └────────┬─────────────────────────────────┘ │ ┌────┴──────────────────┐ │ SUCCESS │ FAILURE ▼ ▼ ┌────────────┐ ┌──────────────────────┐ │Publish │ │Publish │ │Shard │ │RebalancingFailed │ │Migrated │ │event │ │events │ │ │ │ │ │Log error, backoff │ │Publish │ │try again in 5 min │ │Rebalancing │ │ │ │Completed │ │ │ └────────────┘ └──────────────────────┘ ``` ### Decision 3: Can We Assign This Shard to This Node? ``` Command: AssignShard(shardID, nodeID) ┌──────────────────────────────────┐ │Is nodeID in Cluster.nodes? │ └────────┬─────────────────────────┘ │ ┌────┴──────────────────┐ │ YES │ NO ▼ ▼ ┌────────────┐ ┌──────────────────────┐ │Continue │ │REJECT: NodeNotFound │ │assignment │ │ │ └─────┬──────┘ │Can't assign shard │ │ │to non-existent node │ │ └──────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │Check node.Status │ └────────┬─────────────────────────────┘ │ ┌────┴──────────────────┐ │Active or Draining │ Failed ▼ ▼ ┌────────────┐ ┌──────────────────────┐ │Continue │ │REJECT: UnhealthyNode │ │assignment │ │ │ └─────┬──────┘ │Can't assign to │ │ │failed node; it can't │ │ │execute shards │ │ └──────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │Check replication factor │ │ (existing nodes < replication limit?)│ └────────┬─────────────────────────────┘ │ ┌────┴──────────────────┐ │ YES │ NO ▼ ▼ ┌────────────┐ ┌──────────────────────┐ │ACCEPT │ │REJECT: TooManyReplicas │Add node to │ │ │ │shard's │ │Already have max │ │replica │ │replicas for shard │ │list │ │ │ └────────────┘ └──────────────────────┘ ``` --- ## State Transitions ### Cluster State Machine ``` ┌────────────────┐ │ Initializing │ │ (no nodes) │ └────────┬───────┘ │ NodeJoined ▼ ┌────────────────┐ │ Single Node │ │ (one node only)│ └────────┬───────┘ │ NodeJoined ▼ ┌────────────────────────────────────────────┐ │ Multi-Node Cluster │ │ ├─ Stable (healthy nodes, shards assigned) │ │ ├─ Rebalancing (shards moving) │ │ └─ Degraded (failed node waiting for heal) │ └────────────┬───────────────────────────────┘ │ (All nodes left or failed) ▼ ┌────────────────┐ │ No Nodes │ │ (cluster dead) │ └────────────────┘ ``` ### Node State Machine (per node) ``` ┌────────────────┐ │ Discovered │ │ (new heartbeat)│ └────────┬───────┘ │ JoinCluster ▼ ┌────────────────┐ │ Active │ │ (healthy, │ │ can host │ │ shards) │ └────────┬───────┘ │ ┌────────┴──────────────┬─────────────────┐ │ │ │ │ (graceful) │ (heartbeat miss)│ ▼ ▼ ▼ ┌────────────┐ ┌────────────────┐ ┌────────────────┐ │ Draining │ │ Failed │ │ Failed │ │ (stop new │ │ (timeout:90s) │ │ (detected) │ │ shards, │ │ │ │ (admin/health) │ │ preserve │ │ Rebalance │ │ │ │ existing) │ │ shards ASAP │ │ Rebalance │ │ │ │ │ │ shards ASAP │ │ │ │Recover? │ │ │ │ │ │ ├─ Yes: │ │Recover? │ │ │ │ │ → Active │ │ ├─ Yes: │ │ │ │ └─ No: │ │ │ → Active │ │ │ │ → Deleted │ │ └─ No: │ │ │ │ │ │ → Deleted │ │ │ └────────────────┘ └────────────────┘ └────┬───────┘ │ Removed ▼ ┌────────────────┐ │ Deleted │ │ (left cluster) │ └────────────────┘ ``` ### Leadership State Machine (per node) ``` ┌──────────────────┐ │ Not a Leader │ │ (waiting) │ └────────┬─────────┘ │ Try Election (every 2s) │ Atomic create "leader" succeeds ▼ ┌──────────────────┐ │ Candidate │ │ (won election) │ └────────┬─────────┘ │ Start lease renewal loop ▼ ┌──────────────────┐ │ Leader │ │ (holding lease) │ └────────┬─────────┘ │ ┌──────┴───────────┬──────────────────────┐ │ │ │ │ Renew lease │ Lease expires │ Graceful │ (every 3s) │ (90s timeout) │ shutdown │ ✓ Success │ ✗ Failure │ ▼ ▼ ▼ [stays] ┌──────────────────┐ ┌──────────────────┐ │ Lost Leadership │ │ Lost Leadership │ │ (lease expired) │ │ (graceful) │ └────────┬─────────┘ └────────┬─────────┘ │ │ └──────────┬───────────┘ │ ▼ ┌──────────────────┐ │ Not a Leader │ │ (back to bench) │ └──────────────────┘ │ └─▶ [Back to top] ``` --- ## Concurrency Model ### Thread Safety All aggregates use `sync.RWMutex` for thread safety: ```go type ClusterManager struct { mutex sync.RWMutex // Protects access to: nodes map[string]*NodeInfo // - nodes shardMap *ShardMap // - shardMap // ... } // Read operation (multiple goroutines) func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo { cm.mutex.RLock() // Shared lock defer cm.mutex.RUnlock() // ... } // Write operation (exclusive) func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error { cm.mutex.Lock() // Exclusive lock defer cm.mutex.Unlock() // ... (only one writer at a time) } ``` ### Background Goroutines ``` ┌─────────────────────────────────────────────────┐ │ DistributedVM.Start() │ ├─────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ ClusterManager.Start() │ │ │ │ ├─ go election.Start() │ │ │ │ │ ├─ go electionLoop() [ticker: 2s] │ │ │ │ │ ├─ go leaseRenewalLoop() [ticker: 3s]│ │ │ │ │ └─ go monitorLeadership() [watcher] │ │ │ │ │ │ │ │ │ ├─ go monitorNodes() [ticker: 30s] │ │ │ │ └─ go rebalanceLoop() [ticker: 5m] │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ NodeDiscovery.Start() │ │ │ │ ├─ go announceNode() [ticker: 30s] │ │ │ │ └─ Subscribe to "aether.discovery" │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ NATS subscriptions │ │ │ │ ├─ "aether.cluster.*" → messages │ │ │ │ ├─ "aether.discovery" → node updates │ │ │ │ └─ "aether-leader-election" → KV watch │ │ │ └──────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ ``` --- ## Event Sequences ### Event: Node Join, Rebalance, Shard Assignment ``` Timeline: T=0s ├─ Node-4 joins cluster ├─ NodeDiscovery announces NodeJoined via NATS └─ ClusterManager receives and processes T=0.1s ├─ ClusterManager.JoinCluster() executes ├─ Updates nodes map, hashRing ├─ Publishes NodeJoined event └─ If leader: triggers rebalancing T=0.5s (if leader) ├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing) ├─ PlacementStrategy.RebalanceShards() computes new assignments └─ ShardManager.AssignShards() applies new assignments T=1.0s ├─ Publishes ShardMigrated events (one per shard moved) ├─ All nodes subscribe to these events ├─ Each node routing table updated └─ Actors aware of new shard locations T=1.5s onwards ├─ Actors on moved shards migrated (application layer) ├─ Actor Runtime subscribes to ShardMigrated ├─ Triggers actor migration via ActorMigration └─ Eventually: rebalancing complete T=5m ├─ Periodic rebalance check (5m timer) ├─ If no changes: no-op └─ If imbalance detected: trigger rebalance again ``` ### Event: Node Failure Detection and Recovery ``` Timeline: T=0s ├─ Node-2 healthy, last heartbeat received └─ Node-2.LastSeen = now T=30s ├─ Node-2 healthcheck runs (every 30s timer) ├─ Publishes heartbeat └─ Node-2.LastSeen updated T=60s ├─ (Node-2 still healthy) └─ Heartbeat received, LastSeen updated T=65s ├─ Node-2 CRASH (network failure or process crash) ├─ No more heartbeats sent └─ Node-2.LastSeen = 60s T=90s (timeout) ├─ ClusterManager.checkNodeHealth() detects timeout ├─ now - LastSeen > 90s → mark node Failed ├─ ClusterManager.MarkNodeFailed() executes ├─ Publishes NodeFailed event ├─ If leader: triggers rebalancing └─ (If not leader: waits for leader to rebalance) T=91s (if leader) ├─ RebalanceShards triggered ├─ PlacementStrategy computes new topology without Node-2 ├─ ShardManager.AssignShards() reassigns shards └─ Publishes ShardMigrated events T=92s onwards ├─ Actors migrated from Node-2 to healthy nodes └─ No actor loss (assuming replication or migration succeeded) T=120s (Node-2 recovery) ├─ Node-2 process restarts ├─ NodeDiscovery announces NodeJoined again ├─ Status: Active └─ (Back to Node Join sequence if leader decides) ``` --- ## Configuration & Tuning ### Key Parameters ```go // From cluster/types.go const ( // LeaderLeaseTimeout: how long before leader must renew LeaderLeaseTimeout = 10 * time.Second // HeartbeatInterval: how often leader renews HeartbeatInterval = 3 * time.Second // ElectionTimeout: how often nodes try election ElectionTimeout = 2 * time.Second // Node failure detection (in manager.go) nodeFailureTimeout = 90 * time.Second ) // From cluster/types.go const ( // DefaultNumShards: total shards in cluster DefaultNumShards = 1024 // DefaultVirtualNodes: per-node virtual replicas for distribution DefaultVirtualNodes = 150 ) ``` ### Tuning Guide | Parameter | Current | Rationale | Trade-off | |-----------|---------|-----------|-----------| | LeaderLeaseTimeout | 10s | Fast failure detection | May cause thrashing in high-latency networks | | HeartbeatInterval | 3s | Leader alive signal every 3s | Overhead 3x per 9s window | | ElectionTimeout | 2s | Retry elections frequently | CPU cost, but quick recovery | | NodeFailureTimeout | 90s | 3x heartbeat interval | Tolerance for temp network issues | | DefaultNumShards | 1024 | Good granularity for large clusters | More shards = more metadata | | DefaultVirtualNodes | 150 | Balance between distribution and overhead | Lower = worse distribution, higher = more ring operations | --- ## Failure Scenarios & Recovery ### Scenario A: Single Node Fails ``` Before: [A (leader), B, C, D] with 1024 shards ├─ A: 256 shards (+ leader) ├─ B: 256 shards ├─ C: 256 shards └─ D: 256 shards B crashes (no recovery) → waits 90s → marked Failed After Rebalance: [A (leader), C, D] with 1024 shards ├─ A: 341 shards (+ leader) ├─ C: 341 shards └─ D: 342 shards ``` **Recovery:** Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this) --- ### Scenario B: Leader Fails ``` Before: [A (leader), B, C, D] A crashes → waits 90s → marked Failed no lease renewal → lease expires after 10s B, C, or D wins election → new leader → triggers rebalance → reshuffles A's shards After: [B (leader), C, D] ``` **Recovery:** New leader elected within 10s; rebalancing within 100s; no loss if replicas present --- ### Scenario C: Network Partition ``` Before: [A (leader), B, C, D] Partition: {A} isolated | {B, C, D} connected At T=10s (lease expires): ├─ A: can't reach NATS, can't renew → loses leadership ├─ B, C, D: A's lease expired, one wins election └─ New leader coordinates rebalance Risk: If A can reach NATS (just isolated from app), might try to renew but atomic update fails because term mismatch Result: Single leader maintained; no split-brain ``` --- ## Monitoring & Observability ### Key Metrics to Track ``` # Cluster Topology gauge: cluster.nodes.count [active|draining|failed] gauge: cluster.shards.assigned [0, 1024] gauge: cluster.shards.orphaned [0, 1024] # Leadership gauge: cluster.leader.is_leader [0, 1] gauge: cluster.leader.term [0, ∞] gauge: cluster.leader.lease_expires_in_seconds [0, 10] # Rebalancing counter: cluster.rebalancing.triggered [reason] gauge: cluster.rebalancing.active [0, 1] counter: cluster.rebalancing.completed [shards_moved] counter: cluster.rebalancing.failed [reason] # Node Health gauge: cluster.node.heartbeat_latency_ms [per node] gauge: cluster.node.load [per node] gauge: cluster.node.vm_count [per node] counter: cluster.node.failures [reason] ``` ### Alerts ``` - Leader heartbeat missing > 5s → election may be stuck - Rebalancing in progress > 5min → something wrong - Orphaned shards > 0 → invariant violation - Node failure > 50% of cluster → investigate ``` --- ## References - [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Full domain model - [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation roadmap - [manager.go](./manager.go) - ClusterManager implementation - [leader.go](./leader.go) - LeaderElection implementation - [shard.go](./shard.go) - ShardManager implementation - [discovery.go](./discovery.go) - NodeDiscovery implementation - [distributed.go](./distributed.go) - DistributedVM orchestrator