Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
834 lines
36 KiB
Markdown
834 lines
36 KiB
Markdown
# Cluster Coordination: Architecture Reference
|
|
|
|
## High-Level Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Aether Cluster Runtime │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ DistributedVM (Orchestrator - not an aggregate) │ │
|
|
│ │ ├─ Local Runtime (executes actors) │ │
|
|
│ │ ├─ NodeDiscovery (heartbeat → cluster awareness) │ │
|
|
│ │ ├─ ClusterManager (Cluster aggregate root) │ │
|
|
│ │ │ ├─ nodes: Map[ID → NodeInfo] │ │
|
|
│ │ │ ├─ shardMap: ShardMap (current assignments) │ │
|
|
│ │ │ ├─ hashRing: ConsistentHashRing (util) │ │
|
|
│ │ │ └─ election: LeaderElection │ │
|
|
│ │ └─ ShardManager (ShardAssignment aggregate) │ │
|
|
│ │ ├─ shardCount: int │ │
|
|
│ │ ├─ shardMap: ShardMap │ │
|
|
│ │ └─ placement: PlacementStrategy │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │ NATS │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ NATS Cluster │ │
|
|
│ │ ├─ Subject: aether.discovery (heartbeats) │ │
|
|
│ │ ├─ Subject: aether.cluster.* (messages) │ │
|
|
│ │ └─ KeyValue: aether-leader-election (lease) │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Aggregate Boundaries
|
|
|
|
### Aggregate 1: Cluster (Root)
|
|
Owns node topology, shard assignments, and rebalancing decisions.
|
|
|
|
```
|
|
Cluster Aggregate
|
|
├─ Entities
|
|
│ └─ Cluster (root)
|
|
│ ├─ nodes: Map[NodeID → NodeInfo]
|
|
│ ├─ shardMap: ShardMap
|
|
│ ├─ hashRing: ConsistentHashRing
|
|
│ └─ currentLeaderID: string
|
|
│
|
|
├─ Commands
|
|
│ ├─ JoinCluster(nodeInfo)
|
|
│ ├─ MarkNodeFailed(nodeID)
|
|
│ ├─ AssignShards(shardMap)
|
|
│ └─ RebalanceShards(reason)
|
|
│
|
|
├─ Events
|
|
│ ├─ NodeJoined
|
|
│ ├─ NodeFailed
|
|
│ ├─ NodeLeft
|
|
│ ├─ ShardAssigned
|
|
│ ├─ ShardMigrated
|
|
│ └─ RebalancingTriggered
|
|
│
|
|
├─ Invariants Enforced
|
|
│ ├─ I2: All active shards have owners
|
|
│ ├─ I3: Shards only on healthy nodes
|
|
│ └─ I4: Assignments stable during lease
|
|
│
|
|
└─ Code Location: ClusterManager (cluster/manager.go)
|
|
```
|
|
|
|
### Aggregate 2: LeadershipLease (Root)
|
|
Owns leadership claim and ensures single leader per term.
|
|
|
|
```
|
|
LeadershipLease Aggregate
|
|
├─ Entities
|
|
│ └─ LeadershipLease (root)
|
|
│ ├─ leaderID: string
|
|
│ ├─ term: uint64
|
|
│ ├─ expiresAt: time.Time
|
|
│ └─ startedAt: time.Time
|
|
│
|
|
├─ Commands
|
|
│ ├─ ElectLeader(nodeID)
|
|
│ └─ RenewLeadership(nodeID)
|
|
│
|
|
├─ Events
|
|
│ ├─ LeaderElected
|
|
│ ├─ LeadershipRenewed
|
|
│ └─ LeadershipLost
|
|
│
|
|
├─ Invariants Enforced
|
|
│ ├─ I1: Single leader per term
|
|
│ └─ I5: Leader is active node
|
|
│
|
|
└─ Code Location: LeaderElection (cluster/leader.go)
|
|
```
|
|
|
|
### Aggregate 3: ShardAssignment (Root)
|
|
Owns shard-to-node mappings and validates assignments.
|
|
|
|
```
|
|
ShardAssignment Aggregate
|
|
├─ Entities
|
|
│ └─ ShardAssignment (root)
|
|
│ ├─ version: uint64
|
|
│ ├─ assignments: Map[ShardID → []NodeID]
|
|
│ ├─ nodes: Map[NodeID → NodeInfo]
|
|
│ └─ updateTime: time.Time
|
|
│
|
|
├─ Commands
|
|
│ ├─ AssignShard(shardID, nodeList)
|
|
│ └─ RebalanceFromTopology(nodes)
|
|
│
|
|
├─ Events
|
|
│ ├─ ShardAssigned
|
|
│ └─ ShardMigrated
|
|
│
|
|
├─ Invariants Enforced
|
|
│ ├─ I2: All shards assigned
|
|
│ └─ I3: Only healthy nodes
|
|
│
|
|
└─ Code Location: ShardManager (cluster/shard.go)
|
|
```
|
|
|
|
---
|
|
|
|
## Command Flow Diagrams
|
|
|
|
### Scenario 1: Node Joins Cluster
|
|
|
|
```
|
|
┌─────────┐ NodeJoined ┌──────────────────┐
|
|
│New Node │─────────────▶│ClusterManager │
|
|
└─────────┘ │.JoinCluster() │
|
|
└────────┬─────────┘
|
|
│
|
|
┌─────────────┼─────────────┐
|
|
▼ ▼ ▼
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│Validate │ │Update │ │Publish │
|
|
│ID unique │ │topology │ │NodeJoined│
|
|
│Capacity>0 hashRing │ │event │
|
|
└────┬─────┘ └──────────┘ └──────────┘
|
|
│
|
|
┌────▼────────────────────────┐
|
|
│Is this node leader? │
|
|
│If yes: trigger rebalance │
|
|
└─────────────────────────────┘
|
|
│
|
|
┌───────────┴───────────┐
|
|
▼ ▼
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│RebalanceShards │ │(nothing) │
|
|
│ command │ │ │
|
|
└──────────────────┘ └──────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ConsistentHashPlacement │
|
|
│ .RebalanceShards() │
|
|
│ (compute new assignments) │
|
|
└────────────┬─────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ShardManager.AssignShards() │
|
|
│ (validate & apply) │
|
|
└────────────┬─────────────────────┘
|
|
│
|
|
┌────┴──────────────────────┐
|
|
▼ ▼
|
|
┌──────────────┐ ┌─────────────────┐
|
|
│For each │ │Publish │
|
|
│shard moved │ │ShardMigrated │
|
|
│ │ │event per shard │
|
|
└──────────────┘ └─────────────────┘
|
|
```
|
|
|
|
### Scenario 2: Node Failure Detected
|
|
|
|
```
|
|
┌──────────────────────┐
|
|
│Heartbeat timeout │
|
|
│(LastSeen > 90s) │
|
|
└──────────┬───────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ClusterManager │
|
|
│.MarkNodeFailed() │
|
|
│ ├─ Mark status=Failed │
|
|
│ ├─ Remove from hashRing │
|
|
│ └─ Publish NodeFailed event │
|
|
└────────────┬─────────────────┘
|
|
│
|
|
┌────────▼────────────┐
|
|
│Is this node leader? │
|
|
└────────┬────────────┘
|
|
│
|
|
┌────────┴─────────────────┐
|
|
│ YES │ NO
|
|
▼ ▼
|
|
┌──────────────┐ ┌──────────────────┐
|
|
│Trigger │ │(nothing) │
|
|
│Rebalance │ │ │
|
|
└──────────────┘ └──────────────────┘
|
|
│
|
|
└─▶ [Same as Scenario 1 from RebalanceShards]
|
|
```
|
|
|
|
### Scenario 3: Leader Election (Implicit)
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│All nodes: electionLoop runs every 2s│
|
|
└────────────┬────────────────────────┘
|
|
│
|
|
┌────────▼────────────┐
|
|
│Am I leader? │
|
|
└────────┬────────────┘
|
|
│
|
|
┌────────┴──────────────────────────┐
|
|
│ YES │ NO
|
|
▼ ▼
|
|
┌──────────────┐ ┌─────────────────────────┐
|
|
│Do nothing │ │Should try election? │
|
|
│(already │ │ ├─ No leader exists │
|
|
│leading) │ │ ├─ Lease expired │
|
|
└──────────────┘ │ └─ (other conditions) │
|
|
└────────┬────────────────┘
|
|
│
|
|
┌─────────▼──────────┐
|
|
│try AtomicCreate │
|
|
│"leader" key in KV │
|
|
└────────┬───────────┘
|
|
│
|
|
┌─────────────┴──────────────┐
|
|
│ SUCCESS │ FAILED
|
|
▼ ▼
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│Became Leader! │ │Try claim expired │
|
|
│Publish │ │lease; if success,│
|
|
│LeaderElected │ │become leader │
|
|
└──────────────────┘ │Else: stay on │
|
|
│ │bench │
|
|
▼ └──────────────────┘
|
|
┌──────────────────┐
|
|
│Start lease │
|
|
│renewal loop │
|
|
│(every 3s) │
|
|
└──────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Decision Trees
|
|
|
|
### Decision 1: Is Node Healthy?
|
|
|
|
```
|
|
Query: Is Node X healthy?
|
|
|
|
┌─────────────────────────────────────┐
|
|
│Get node status from Cluster.nodes │
|
|
└────────────┬────────────────────────┘
|
|
│
|
|
┌────────▼────────────────┐
|
|
│Check node.Status field │
|
|
└────────┬────────────────┘
|
|
│
|
|
┌────────┴───────────────┬─────────────────────┬──────────────┐
|
|
│ │ │ │
|
|
▼ ▼ ▼ ▼
|
|
┌────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────┐
|
|
│Active │ │Draining │ │Failed │ │Unknown │
|
|
├────────┤ ├────────────┤ ├──────────────┤ ├─────────┤
|
|
│✓Healthy│ │⚠ Draining │ │✗ Unhealthy │ │✗ Error │
|
|
│Can host│ │Should not │ │Don't use for │ │ │
|
|
│shards │ │get new │ │sharding │ │ │
|
|
└────────┘ │shards, but │ │Delete shards │ └─────────┘
|
|
│existing OK │ │ASAP │
|
|
└────────────┘ └──────────────┘
|
|
```
|
|
|
|
### Decision 2: Should This Node Rebalance Shards?
|
|
|
|
```
|
|
Command: RebalanceShards(nodeID, reason)
|
|
|
|
┌──────────────────────────────────┐
|
|
│Is nodeID the current leader? │
|
|
└────────┬─────────────────────────┘
|
|
│
|
|
┌────┴──────────────────┐
|
|
│ YES │ NO
|
|
▼ ▼
|
|
┌────────────┐ ┌──────────────────────┐
|
|
│Continue │ │REJECT: NotLeader │
|
|
│rebalancing │ │ │
|
|
└─────┬──────┘ │Only leader can │
|
|
│ │initiate rebalancing │
|
|
│ └──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│Are there active nodes? │
|
|
└────────┬────────────────────────────┘
|
|
│
|
|
┌────┴──────────────────┐
|
|
│ YES │ NO
|
|
▼ ▼
|
|
┌────────────┐ ┌──────────────────────┐
|
|
│Continue │ │REJECT: NoActiveNodes │
|
|
│rebalancing │ │ │
|
|
└─────┬──────┘ │Can't assign shards │
|
|
│ │with no healthy nodes │
|
|
│ └──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────┐
|
|
│Call PlacementStrategy.RebalanceShards() │
|
|
│ (compute new assignments) │
|
|
└────────┬─────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────┐
|
|
│Call ShardManager.AssignShards() │
|
|
│ (validate & apply new assignments) │
|
|
└────────┬─────────────────────────────────┘
|
|
│
|
|
┌────┴──────────────────┐
|
|
│ SUCCESS │ FAILURE
|
|
▼ ▼
|
|
┌────────────┐ ┌──────────────────────┐
|
|
│Publish │ │Publish │
|
|
│Shard │ │RebalancingFailed │
|
|
│Migrated │ │event │
|
|
│events │ │ │
|
|
│ │ │Log error, backoff │
|
|
│Publish │ │try again in 5 min │
|
|
│Rebalancing │ │ │
|
|
│Completed │ │ │
|
|
└────────────┘ └──────────────────────┘
|
|
```
|
|
|
|
### Decision 3: Can We Assign This Shard to This Node?
|
|
|
|
```
|
|
Command: AssignShard(shardID, nodeID)
|
|
|
|
┌──────────────────────────────────┐
|
|
│Is nodeID in Cluster.nodes? │
|
|
└────────┬─────────────────────────┘
|
|
│
|
|
┌────┴──────────────────┐
|
|
│ YES │ NO
|
|
▼ ▼
|
|
┌────────────┐ ┌──────────────────────┐
|
|
│Continue │ │REJECT: NodeNotFound │
|
|
│assignment │ │ │
|
|
└─────┬──────┘ │Can't assign shard │
|
|
│ │to non-existent node │
|
|
│ └──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│Check node.Status │
|
|
└────────┬─────────────────────────────┘
|
|
│
|
|
┌────┴──────────────────┐
|
|
│Active or Draining │ Failed
|
|
▼ ▼
|
|
┌────────────┐ ┌──────────────────────┐
|
|
│Continue │ │REJECT: UnhealthyNode │
|
|
│assignment │ │ │
|
|
└─────┬──────┘ │Can't assign to │
|
|
│ │failed node; it can't │
|
|
│ │execute shards │
|
|
│ └──────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│Check replication factor │
|
|
│ (existing nodes < replication limit?)│
|
|
└────────┬─────────────────────────────┘
|
|
│
|
|
┌────┴──────────────────┐
|
|
│ YES │ NO
|
|
▼ ▼
|
|
┌────────────┐ ┌──────────────────────┐
|
|
│ACCEPT │ │REJECT: TooManyReplicas
|
|
│Add node to │ │ │
|
|
│shard's │ │Already have max │
|
|
│replica │ │replicas for shard │
|
|
│list │ │ │
|
|
└────────────┘ └──────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## State Transitions
|
|
|
|
### Cluster State Machine
|
|
|
|
```
|
|
┌────────────────┐
|
|
│ Initializing │
|
|
│ (no nodes) │
|
|
└────────┬───────┘
|
|
│ NodeJoined
|
|
▼
|
|
┌────────────────┐
|
|
│ Single Node │
|
|
│ (one node only)│
|
|
└────────┬───────┘
|
|
│ NodeJoined
|
|
▼
|
|
┌────────────────────────────────────────────┐
|
|
│ Multi-Node Cluster │
|
|
│ ├─ Stable (healthy nodes, shards assigned) │
|
|
│ ├─ Rebalancing (shards moving) │
|
|
│ └─ Degraded (failed node waiting for heal) │
|
|
└────────────┬───────────────────────────────┘
|
|
│ (All nodes left or failed)
|
|
▼
|
|
┌────────────────┐
|
|
│ No Nodes │
|
|
│ (cluster dead) │
|
|
└────────────────┘
|
|
```
|
|
|
|
### Node State Machine (per node)
|
|
|
|
```
|
|
┌────────────────┐
|
|
│ Discovered │
|
|
│ (new heartbeat)│
|
|
└────────┬───────┘
|
|
│ JoinCluster
|
|
▼
|
|
┌────────────────┐
|
|
│ Active │
|
|
│ (healthy, │
|
|
│ can host │
|
|
│ shards) │
|
|
└────────┬───────┘
|
|
│
|
|
┌────────┴──────────────┬─────────────────┐
|
|
│ │ │
|
|
│ (graceful) │ (heartbeat miss)│
|
|
▼ ▼ ▼
|
|
┌────────────┐ ┌────────────────┐ ┌────────────────┐
|
|
│ Draining │ │ Failed │ │ Failed │
|
|
│ (stop new │ │ (timeout:90s) │ │ (detected) │
|
|
│ shards, │ │ │ │ (admin/health) │
|
|
│ preserve │ │ Rebalance │ │ │
|
|
│ existing) │ │ shards ASAP │ │ Rebalance │
|
|
│ │ │ │ │ shards ASAP │
|
|
│ │ │Recover? │ │ │
|
|
│ │ │ ├─ Yes: │ │Recover? │
|
|
│ │ │ │ → Active │ │ ├─ Yes: │
|
|
│ │ │ └─ No: │ │ │ → Active │
|
|
│ │ │ → Deleted │ │ └─ No: │
|
|
│ │ │ │ │ → Deleted │
|
|
│ │ └────────────────┘ └────────────────┘
|
|
└────┬───────┘
|
|
│ Removed
|
|
▼
|
|
┌────────────────┐
|
|
│ Deleted │
|
|
│ (left cluster) │
|
|
└────────────────┘
|
|
```
|
|
|
|
### Leadership State Machine (per node)
|
|
|
|
```
|
|
┌──────────────────┐
|
|
│ Not a Leader │
|
|
│ (waiting) │
|
|
└────────┬─────────┘
|
|
│ Try Election (every 2s)
|
|
│ Atomic create "leader" succeeds
|
|
▼
|
|
┌──────────────────┐
|
|
│ Candidate │
|
|
│ (won election) │
|
|
└────────┬─────────┘
|
|
│ Start lease renewal loop
|
|
▼
|
|
┌──────────────────┐
|
|
│ Leader │
|
|
│ (holding lease) │
|
|
└────────┬─────────┘
|
|
│
|
|
┌──────┴───────────┬──────────────────────┐
|
|
│ │ │
|
|
│ Renew lease │ Lease expires │ Graceful
|
|
│ (every 3s) │ (90s timeout) │ shutdown
|
|
│ ✓ Success │ ✗ Failure │
|
|
▼ ▼ ▼
|
|
[stays] ┌──────────────────┐ ┌──────────────────┐
|
|
│ Lost Leadership │ │ Lost Leadership │
|
|
│ (lease expired) │ │ (graceful) │
|
|
└────────┬─────────┘ └────────┬─────────┘
|
|
│ │
|
|
└──────────┬───────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Not a Leader │
|
|
│ (back to bench) │
|
|
└──────────────────┘
|
|
│
|
|
└─▶ [Back to top]
|
|
```
|
|
|
|
---
|
|
|
|
## Concurrency Model
|
|
|
|
### Thread Safety
|
|
|
|
All aggregates use `sync.RWMutex` for thread safety:
|
|
|
|
```go
|
|
type ClusterManager struct {
|
|
mutex sync.RWMutex // Protects access to:
|
|
nodes map[string]*NodeInfo // - nodes
|
|
shardMap *ShardMap // - shardMap
|
|
// ...
|
|
}
|
|
|
|
// Read operation (multiple goroutines)
|
|
func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo {
|
|
cm.mutex.RLock() // Shared lock
|
|
defer cm.mutex.RUnlock()
|
|
// ...
|
|
}
|
|
|
|
// Write operation (exclusive)
|
|
func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
|
|
cm.mutex.Lock() // Exclusive lock
|
|
defer cm.mutex.Unlock()
|
|
// ... (only one writer at a time)
|
|
}
|
|
```
|
|
|
|
### Background Goroutines
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────┐
|
|
│ DistributedVM.Start() │
|
|
├─────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────────────────────────────────┐ │
|
|
│ │ ClusterManager.Start() │ │
|
|
│ │ ├─ go election.Start() │ │
|
|
│ │ │ ├─ go electionLoop() [ticker: 2s] │ │
|
|
│ │ │ ├─ go leaseRenewalLoop() [ticker: 3s]│ │
|
|
│ │ │ └─ go monitorLeadership() [watcher] │ │
|
|
│ │ │ │ │
|
|
│ │ ├─ go monitorNodes() [ticker: 30s] │ │
|
|
│ │ └─ go rebalanceLoop() [ticker: 5m] │ │
|
|
│ └──────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────┐ │
|
|
│ │ NodeDiscovery.Start() │ │
|
|
│ │ ├─ go announceNode() [ticker: 30s] │ │
|
|
│ │ └─ Subscribe to "aether.discovery" │ │
|
|
│ └──────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────┐ │
|
|
│ │ NATS subscriptions │ │
|
|
│ │ ├─ "aether.cluster.*" → messages │ │
|
|
│ │ ├─ "aether.discovery" → node updates │ │
|
|
│ │ └─ "aether-leader-election" → KV watch │ │
|
|
│ └──────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Event Sequences
|
|
|
|
### Event: Node Join, Rebalance, Shard Assignment
|
|
|
|
```
|
|
Timeline:
|
|
|
|
T=0s
|
|
├─ Node-4 joins cluster
|
|
├─ NodeDiscovery announces NodeJoined via NATS
|
|
└─ ClusterManager receives and processes
|
|
|
|
T=0.1s
|
|
├─ ClusterManager.JoinCluster() executes
|
|
├─ Updates nodes map, hashRing
|
|
├─ Publishes NodeJoined event
|
|
└─ If leader: triggers rebalancing
|
|
|
|
T=0.5s (if leader)
|
|
├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing)
|
|
├─ PlacementStrategy.RebalanceShards() computes new assignments
|
|
└─ ShardManager.AssignShards() applies new assignments
|
|
|
|
T=1.0s
|
|
├─ Publishes ShardMigrated events (one per shard moved)
|
|
├─ All nodes subscribe to these events
|
|
├─ Each node routing table updated
|
|
└─ Actors aware of new shard locations
|
|
|
|
T=1.5s onwards
|
|
├─ Actors on moved shards migrated (application layer)
|
|
├─ Actor Runtime subscribes to ShardMigrated
|
|
├─ Triggers actor migration via ActorMigration
|
|
└─ Eventually: rebalancing complete
|
|
|
|
T=5m
|
|
├─ Periodic rebalance check (5m timer)
|
|
├─ If no changes: no-op
|
|
└─ If imbalance detected: trigger rebalance again
|
|
```
|
|
|
|
### Event: Node Failure Detection and Recovery
|
|
|
|
```
|
|
Timeline:
|
|
|
|
T=0s
|
|
├─ Node-2 healthy, last heartbeat received
|
|
└─ Node-2.LastSeen = now
|
|
|
|
T=30s
|
|
├─ Node-2 healthcheck runs (every 30s timer)
|
|
├─ Publishes heartbeat
|
|
└─ Node-2.LastSeen updated
|
|
|
|
T=60s
|
|
├─ (Node-2 still healthy)
|
|
└─ Heartbeat received, LastSeen updated
|
|
|
|
T=65s
|
|
├─ Node-2 CRASH (network failure or process crash)
|
|
├─ No more heartbeats sent
|
|
└─ Node-2.LastSeen = 60s
|
|
|
|
T=90s (timeout)
|
|
├─ ClusterManager.checkNodeHealth() detects timeout
|
|
├─ now - LastSeen > 90s → mark node Failed
|
|
├─ ClusterManager.MarkNodeFailed() executes
|
|
├─ Publishes NodeFailed event
|
|
├─ If leader: triggers rebalancing
|
|
└─ (If not leader: waits for leader to rebalance)
|
|
|
|
T=91s (if leader)
|
|
├─ RebalanceShards triggered
|
|
├─ PlacementStrategy computes new topology without Node-2
|
|
├─ ShardManager.AssignShards() reassigns shards
|
|
└─ Publishes ShardMigrated events
|
|
|
|
T=92s onwards
|
|
├─ Actors migrated from Node-2 to healthy nodes
|
|
└─ No actor loss (assuming replication or migration succeeded)
|
|
|
|
T=120s (Node-2 recovery)
|
|
├─ Node-2 process restarts
|
|
├─ NodeDiscovery announces NodeJoined again
|
|
├─ Status: Active
|
|
└─ (Back to Node Join sequence if leader decides)
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration & Tuning
|
|
|
|
### Key Parameters
|
|
|
|
```go
|
|
// From cluster/types.go
|
|
const (
|
|
// LeaderLeaseTimeout: how long before leader must renew
|
|
LeaderLeaseTimeout = 10 * time.Second
|
|
|
|
// HeartbeatInterval: how often leader renews
|
|
HeartbeatInterval = 3 * time.Second
|
|
|
|
// ElectionTimeout: how often nodes try election
|
|
ElectionTimeout = 2 * time.Second
|
|
|
|
// Node failure detection (in manager.go)
|
|
nodeFailureTimeout = 90 * time.Second
|
|
)
|
|
|
|
// From cluster/types.go
|
|
const (
|
|
// DefaultNumShards: total shards in cluster
|
|
DefaultNumShards = 1024
|
|
|
|
// DefaultVirtualNodes: per-node virtual replicas for distribution
|
|
DefaultVirtualNodes = 150
|
|
)
|
|
```
|
|
|
|
### Tuning Guide
|
|
|
|
| Parameter | Current | Rationale | Trade-off |
|
|
|-----------|---------|-----------|-----------|
|
|
| LeaderLeaseTimeout | 10s | Fast failure detection | May cause thrashing in high-latency networks |
|
|
| HeartbeatInterval | 3s | Leader alive signal every 3s | Overhead 3x per 9s window |
|
|
| ElectionTimeout | 2s | Retry elections frequently | CPU cost, but quick recovery |
|
|
| NodeFailureTimeout | 90s | 3x heartbeat interval | Tolerance for temp network issues |
|
|
| DefaultNumShards | 1024 | Good granularity for large clusters | More shards = more metadata |
|
|
| DefaultVirtualNodes | 150 | Balance between distribution and overhead | Lower = worse distribution, higher = more ring operations |
|
|
|
|
---
|
|
|
|
## Failure Scenarios & Recovery
|
|
|
|
### Scenario A: Single Node Fails
|
|
|
|
```
|
|
Before: [A (leader), B, C, D] with 1024 shards
|
|
├─ A: 256 shards (+ leader)
|
|
├─ B: 256 shards
|
|
├─ C: 256 shards
|
|
└─ D: 256 shards
|
|
|
|
B crashes (no recovery) → waits 90s → marked Failed
|
|
|
|
After Rebalance:
|
|
[A (leader), C, D] with 1024 shards
|
|
├─ A: 341 shards (+ leader)
|
|
├─ C: 341 shards
|
|
└─ D: 342 shards
|
|
```
|
|
|
|
**Recovery:** Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this)
|
|
|
|
---
|
|
|
|
### Scenario B: Leader Fails
|
|
|
|
```
|
|
Before: [A (leader), B, C, D]
|
|
|
|
A crashes → waits 90s → marked Failed
|
|
no lease renewal → lease expires after 10s
|
|
|
|
B, C, or D wins election → new leader
|
|
→ triggers rebalance
|
|
→ reshuffles A's shards
|
|
|
|
After: [B (leader), C, D]
|
|
```
|
|
|
|
**Recovery:** New leader elected within 10s; rebalancing within 100s; no loss if replicas present
|
|
|
|
---
|
|
|
|
### Scenario C: Network Partition
|
|
|
|
```
|
|
Before: [A (leader), B, C, D]
|
|
Partition: {A} isolated | {B, C, D} connected
|
|
|
|
At T=10s (lease expires):
|
|
├─ A: can't reach NATS, can't renew → loses leadership
|
|
├─ B, C, D: A's lease expired, one wins election
|
|
└─ New leader coordinates rebalance
|
|
|
|
Risk: If A can reach NATS (just isolated from app), might try to renew
|
|
but atomic update fails because term mismatch
|
|
|
|
Result: Single leader maintained; no split-brain
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Key Metrics to Track
|
|
|
|
```
|
|
# Cluster Topology
|
|
gauge: cluster.nodes.count [active|draining|failed]
|
|
gauge: cluster.shards.assigned [0, 1024]
|
|
gauge: cluster.shards.orphaned [0, 1024]
|
|
|
|
# Leadership
|
|
gauge: cluster.leader.is_leader [0, 1]
|
|
gauge: cluster.leader.term [0, ∞]
|
|
gauge: cluster.leader.lease_expires_in_seconds [0, 10]
|
|
|
|
# Rebalancing
|
|
counter: cluster.rebalancing.triggered [reason]
|
|
gauge: cluster.rebalancing.active [0, 1]
|
|
counter: cluster.rebalancing.completed [shards_moved]
|
|
counter: cluster.rebalancing.failed [reason]
|
|
|
|
# Node Health
|
|
gauge: cluster.node.heartbeat_latency_ms [per node]
|
|
gauge: cluster.node.load [per node]
|
|
gauge: cluster.node.vm_count [per node]
|
|
counter: cluster.node.failures [reason]
|
|
```
|
|
|
|
### Alerts
|
|
|
|
```
|
|
- Leader heartbeat missing > 5s → election may be stuck
|
|
- Rebalancing in progress > 5min → something wrong
|
|
- Orphaned shards > 0 → invariant violation
|
|
- Node failure > 50% of cluster → investigate
|
|
```
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Full domain model
|
|
- [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation roadmap
|
|
- [manager.go](./manager.go) - ClusterManager implementation
|
|
- [leader.go](./leader.go) - LeaderElection implementation
|
|
- [shard.go](./shard.go) - ShardManager implementation
|
|
- [discovery.go](./discovery.go) - NodeDiscovery implementation
|
|
- [distributed.go](./distributed.go) - DistributedVM orchestrator
|
|
|