Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
833
.product-strategy/cluster/ARCHITECTURE.md
Normal file
833
.product-strategy/cluster/ARCHITECTURE.md
Normal file
@@ -0,0 +1,833 @@
|
||||
# Cluster Coordination: Architecture Reference
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Aether Cluster Runtime │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ DistributedVM (Orchestrator - not an aggregate) │ │
|
||||
│ │ ├─ Local Runtime (executes actors) │ │
|
||||
│ │ ├─ NodeDiscovery (heartbeat → cluster awareness) │ │
|
||||
│ │ ├─ ClusterManager (Cluster aggregate root) │ │
|
||||
│ │ │ ├─ nodes: Map[ID → NodeInfo] │ │
|
||||
│ │ │ ├─ shardMap: ShardMap (current assignments) │ │
|
||||
│ │ │ ├─ hashRing: ConsistentHashRing (util) │ │
|
||||
│ │ │ └─ election: LeaderElection │ │
|
||||
│ │ └─ ShardManager (ShardAssignment aggregate) │ │
|
||||
│ │ ├─ shardCount: int │ │
|
||||
│ │ ├─ shardMap: ShardMap │ │
|
||||
│ │ └─ placement: PlacementStrategy │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │ NATS │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ NATS Cluster │ │
|
||||
│ │ ├─ Subject: aether.discovery (heartbeats) │ │
|
||||
│ │ ├─ Subject: aether.cluster.* (messages) │ │
|
||||
│ │ └─ KeyValue: aether-leader-election (lease) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aggregate Boundaries
|
||||
|
||||
### Aggregate 1: Cluster (Root)
|
||||
Owns node topology, shard assignments, and rebalancing decisions.
|
||||
|
||||
```
|
||||
Cluster Aggregate
|
||||
├─ Entities
|
||||
│ └─ Cluster (root)
|
||||
│ ├─ nodes: Map[NodeID → NodeInfo]
|
||||
│ ├─ shardMap: ShardMap
|
||||
│ ├─ hashRing: ConsistentHashRing
|
||||
│ └─ currentLeaderID: string
|
||||
│
|
||||
├─ Commands
|
||||
│ ├─ JoinCluster(nodeInfo)
|
||||
│ ├─ MarkNodeFailed(nodeID)
|
||||
│ ├─ AssignShards(shardMap)
|
||||
│ └─ RebalanceShards(reason)
|
||||
│
|
||||
├─ Events
|
||||
│ ├─ NodeJoined
|
||||
│ ├─ NodeFailed
|
||||
│ ├─ NodeLeft
|
||||
│ ├─ ShardAssigned
|
||||
│ ├─ ShardMigrated
|
||||
│ └─ RebalancingTriggered
|
||||
│
|
||||
├─ Invariants Enforced
|
||||
│ ├─ I2: All active shards have owners
|
||||
│ ├─ I3: Shards only on healthy nodes
|
||||
│ └─ I4: Assignments stable during lease
|
||||
│
|
||||
└─ Code Location: ClusterManager (cluster/manager.go)
|
||||
```
|
||||
|
||||
### Aggregate 2: LeadershipLease (Root)
|
||||
Owns leadership claim and ensures single leader per term.
|
||||
|
||||
```
|
||||
LeadershipLease Aggregate
|
||||
├─ Entities
|
||||
│ └─ LeadershipLease (root)
|
||||
│ ├─ leaderID: string
|
||||
│ ├─ term: uint64
|
||||
│ ├─ expiresAt: time.Time
|
||||
│ └─ startedAt: time.Time
|
||||
│
|
||||
├─ Commands
|
||||
│ ├─ ElectLeader(nodeID)
|
||||
│ └─ RenewLeadership(nodeID)
|
||||
│
|
||||
├─ Events
|
||||
│ ├─ LeaderElected
|
||||
│ ├─ LeadershipRenewed
|
||||
│ └─ LeadershipLost
|
||||
│
|
||||
├─ Invariants Enforced
|
||||
│ ├─ I1: Single leader per term
|
||||
│ └─ I5: Leader is active node
|
||||
│
|
||||
└─ Code Location: LeaderElection (cluster/leader.go)
|
||||
```
|
||||
|
||||
### Aggregate 3: ShardAssignment (Root)
|
||||
Owns shard-to-node mappings and validates assignments.
|
||||
|
||||
```
|
||||
ShardAssignment Aggregate
|
||||
├─ Entities
|
||||
│ └─ ShardAssignment (root)
|
||||
│ ├─ version: uint64
|
||||
│ ├─ assignments: Map[ShardID → []NodeID]
|
||||
│ ├─ nodes: Map[NodeID → NodeInfo]
|
||||
│ └─ updateTime: time.Time
|
||||
│
|
||||
├─ Commands
|
||||
│ ├─ AssignShard(shardID, nodeList)
|
||||
│ └─ RebalanceFromTopology(nodes)
|
||||
│
|
||||
├─ Events
|
||||
│ ├─ ShardAssigned
|
||||
│ └─ ShardMigrated
|
||||
│
|
||||
├─ Invariants Enforced
|
||||
│ ├─ I2: All shards assigned
|
||||
│ └─ I3: Only healthy nodes
|
||||
│
|
||||
└─ Code Location: ShardManager (cluster/shard.go)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Command Flow Diagrams
|
||||
|
||||
### Scenario 1: Node Joins Cluster
|
||||
|
||||
```
|
||||
┌─────────┐ NodeJoined ┌──────────────────┐
|
||||
│New Node │─────────────▶│ClusterManager │
|
||||
└─────────┘ │.JoinCluster() │
|
||||
└────────┬─────────┘
|
||||
│
|
||||
┌─────────────┼─────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│Validate │ │Update │ │Publish │
|
||||
│ID unique │ │topology │ │NodeJoined│
|
||||
│Capacity>0 hashRing │ │event │
|
||||
└────┬─────┘ └──────────┘ └──────────┘
|
||||
│
|
||||
┌────▼────────────────────────┐
|
||||
│Is this node leader? │
|
||||
│If yes: trigger rebalance │
|
||||
└─────────────────────────────┘
|
||||
│
|
||||
┌───────────┴───────────┐
|
||||
▼ ▼
|
||||
┌──────────────────┐ ┌──────────────────┐
|
||||
│RebalanceShards │ │(nothing) │
|
||||
│ command │ │ │
|
||||
└──────────────────┘ └──────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────┐
|
||||
│ConsistentHashPlacement │
|
||||
│ .RebalanceShards() │
|
||||
│ (compute new assignments) │
|
||||
└────────────┬─────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────┐
|
||||
│ShardManager.AssignShards() │
|
||||
│ (validate & apply) │
|
||||
└────────────┬─────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────────┐
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌─────────────────┐
|
||||
│For each │ │Publish │
|
||||
│shard moved │ │ShardMigrated │
|
||||
│ │ │event per shard │
|
||||
└──────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### Scenario 2: Node Failure Detected
|
||||
|
||||
```
|
||||
┌──────────────────────┐
|
||||
│Heartbeat timeout │
|
||||
│(LastSeen > 90s) │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ClusterManager │
|
||||
│.MarkNodeFailed() │
|
||||
│ ├─ Mark status=Failed │
|
||||
│ ├─ Remove from hashRing │
|
||||
│ └─ Publish NodeFailed event │
|
||||
└────────────┬─────────────────┘
|
||||
│
|
||||
┌────────▼────────────┐
|
||||
│Is this node leader? │
|
||||
└────────┬────────────┘
|
||||
│
|
||||
┌────────┴─────────────────┐
|
||||
│ YES │ NO
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────────┐
|
||||
│Trigger │ │(nothing) │
|
||||
│Rebalance │ │ │
|
||||
└──────────────┘ └──────────────────┘
|
||||
│
|
||||
└─▶ [Same as Scenario 1 from RebalanceShards]
|
||||
```
|
||||
|
||||
### Scenario 3: Leader Election (Implicit)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│All nodes: electionLoop runs every 2s│
|
||||
└────────────┬────────────────────────┘
|
||||
│
|
||||
┌────────▼────────────┐
|
||||
│Am I leader? │
|
||||
└────────┬────────────┘
|
||||
│
|
||||
┌────────┴──────────────────────────┐
|
||||
│ YES │ NO
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌─────────────────────────┐
|
||||
│Do nothing │ │Should try election? │
|
||||
│(already │ │ ├─ No leader exists │
|
||||
│leading) │ │ ├─ Lease expired │
|
||||
└──────────────┘ │ └─ (other conditions) │
|
||||
└────────┬────────────────┘
|
||||
│
|
||||
┌─────────▼──────────┐
|
||||
│try AtomicCreate │
|
||||
│"leader" key in KV │
|
||||
└────────┬───────────┘
|
||||
│
|
||||
┌─────────────┴──────────────┐
|
||||
│ SUCCESS │ FAILED
|
||||
▼ ▼
|
||||
┌──────────────────┐ ┌──────────────────┐
|
||||
│Became Leader! │ │Try claim expired │
|
||||
│Publish │ │lease; if success,│
|
||||
│LeaderElected │ │become leader │
|
||||
└──────────────────┘ │Else: stay on │
|
||||
│ │bench │
|
||||
▼ └──────────────────┘
|
||||
┌──────────────────┐
|
||||
│Start lease │
|
||||
│renewal loop │
|
||||
│(every 3s) │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Decision Trees
|
||||
|
||||
### Decision 1: Is Node Healthy?
|
||||
|
||||
```
|
||||
Query: Is Node X healthy?
|
||||
|
||||
┌─────────────────────────────────────┐
|
||||
│Get node status from Cluster.nodes │
|
||||
└────────────┬────────────────────────┘
|
||||
│
|
||||
┌────────▼────────────────┐
|
||||
│Check node.Status field │
|
||||
└────────┬────────────────┘
|
||||
│
|
||||
┌────────┴───────────────┬─────────────────────┬──────────────┐
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────┐
|
||||
│Active │ │Draining │ │Failed │ │Unknown │
|
||||
├────────┤ ├────────────┤ ├──────────────┤ ├─────────┤
|
||||
│✓Healthy│ │⚠ Draining │ │✗ Unhealthy │ │✗ Error │
|
||||
│Can host│ │Should not │ │Don't use for │ │ │
|
||||
│shards │ │get new │ │sharding │ │ │
|
||||
└────────┘ │shards, but │ │Delete shards │ └─────────┘
|
||||
│existing OK │ │ASAP │
|
||||
└────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
### Decision 2: Should This Node Rebalance Shards?
|
||||
|
||||
```
|
||||
Command: RebalanceShards(nodeID, reason)
|
||||
|
||||
┌──────────────────────────────────┐
|
||||
│Is nodeID the current leader? │
|
||||
└────────┬─────────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────┐
|
||||
│ YES │ NO
|
||||
▼ ▼
|
||||
┌────────────┐ ┌──────────────────────┐
|
||||
│Continue │ │REJECT: NotLeader │
|
||||
│rebalancing │ │ │
|
||||
└─────┬──────┘ │Only leader can │
|
||||
│ │initiate rebalancing │
|
||||
│ └──────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│Are there active nodes? │
|
||||
└────────┬────────────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────┐
|
||||
│ YES │ NO
|
||||
▼ ▼
|
||||
┌────────────┐ ┌──────────────────────┐
|
||||
│Continue │ │REJECT: NoActiveNodes │
|
||||
│rebalancing │ │ │
|
||||
└─────┬──────┘ │Can't assign shards │
|
||||
│ │with no healthy nodes │
|
||||
│ └──────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────┐
|
||||
│Call PlacementStrategy.RebalanceShards() │
|
||||
│ (compute new assignments) │
|
||||
└────────┬─────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────┐
|
||||
│Call ShardManager.AssignShards() │
|
||||
│ (validate & apply new assignments) │
|
||||
└────────┬─────────────────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────┐
|
||||
│ SUCCESS │ FAILURE
|
||||
▼ ▼
|
||||
┌────────────┐ ┌──────────────────────┐
|
||||
│Publish │ │Publish │
|
||||
│Shard │ │RebalancingFailed │
|
||||
│Migrated │ │event │
|
||||
│events │ │ │
|
||||
│ │ │Log error, backoff │
|
||||
│Publish │ │try again in 5 min │
|
||||
│Rebalancing │ │ │
|
||||
│Completed │ │ │
|
||||
└────────────┘ └──────────────────────┘
|
||||
```
|
||||
|
||||
### Decision 3: Can We Assign This Shard to This Node?
|
||||
|
||||
```
|
||||
Command: AssignShard(shardID, nodeID)
|
||||
|
||||
┌──────────────────────────────────┐
|
||||
│Is nodeID in Cluster.nodes? │
|
||||
└────────┬─────────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────┐
|
||||
│ YES │ NO
|
||||
▼ ▼
|
||||
┌────────────┐ ┌──────────────────────┐
|
||||
│Continue │ │REJECT: NodeNotFound │
|
||||
│assignment │ │ │
|
||||
└─────┬──────┘ │Can't assign shard │
|
||||
│ │to non-existent node │
|
||||
│ └──────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│Check node.Status │
|
||||
└────────┬─────────────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────┐
|
||||
│Active or Draining │ Failed
|
||||
▼ ▼
|
||||
┌────────────┐ ┌──────────────────────┐
|
||||
│Continue │ │REJECT: UnhealthyNode │
|
||||
│assignment │ │ │
|
||||
└─────┬──────┘ │Can't assign to │
|
||||
│ │failed node; it can't │
|
||||
│ │execute shards │
|
||||
│ └──────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│Check replication factor │
|
||||
│ (existing nodes < replication limit?)│
|
||||
└────────┬─────────────────────────────┘
|
||||
│
|
||||
┌────┴──────────────────┐
|
||||
│ YES │ NO
|
||||
▼ ▼
|
||||
┌────────────┐ ┌──────────────────────┐
|
||||
│ACCEPT │ │REJECT: TooManyReplicas
|
||||
│Add node to │ │ │
|
||||
│shard's │ │Already have max │
|
||||
│replica │ │replicas for shard │
|
||||
│list │ │ │
|
||||
└────────────┘ └──────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## State Transitions
|
||||
|
||||
### Cluster State Machine
|
||||
|
||||
```
|
||||
┌────────────────┐
|
||||
│ Initializing │
|
||||
│ (no nodes) │
|
||||
└────────┬───────┘
|
||||
│ NodeJoined
|
||||
▼
|
||||
┌────────────────┐
|
||||
│ Single Node │
|
||||
│ (one node only)│
|
||||
└────────┬───────┘
|
||||
│ NodeJoined
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ Multi-Node Cluster │
|
||||
│ ├─ Stable (healthy nodes, shards assigned) │
|
||||
│ ├─ Rebalancing (shards moving) │
|
||||
│ └─ Degraded (failed node waiting for heal) │
|
||||
└────────────┬───────────────────────────────┘
|
||||
│ (All nodes left or failed)
|
||||
▼
|
||||
┌────────────────┐
|
||||
│ No Nodes │
|
||||
│ (cluster dead) │
|
||||
└────────────────┘
|
||||
```
|
||||
|
||||
### Node State Machine (per node)
|
||||
|
||||
```
|
||||
┌────────────────┐
|
||||
│ Discovered │
|
||||
│ (new heartbeat)│
|
||||
└────────┬───────┘
|
||||
│ JoinCluster
|
||||
▼
|
||||
┌────────────────┐
|
||||
│ Active │
|
||||
│ (healthy, │
|
||||
│ can host │
|
||||
│ shards) │
|
||||
└────────┬───────┘
|
||||
│
|
||||
┌────────┴──────────────┬─────────────────┐
|
||||
│ │ │
|
||||
│ (graceful) │ (heartbeat miss)│
|
||||
▼ ▼ ▼
|
||||
┌────────────┐ ┌────────────────┐ ┌────────────────┐
|
||||
│ Draining │ │ Failed │ │ Failed │
|
||||
│ (stop new │ │ (timeout:90s) │ │ (detected) │
|
||||
│ shards, │ │ │ │ (admin/health) │
|
||||
│ preserve │ │ Rebalance │ │ │
|
||||
│ existing) │ │ shards ASAP │ │ Rebalance │
|
||||
│ │ │ │ │ shards ASAP │
|
||||
│ │ │Recover? │ │ │
|
||||
│ │ │ ├─ Yes: │ │Recover? │
|
||||
│ │ │ │ → Active │ │ ├─ Yes: │
|
||||
│ │ │ └─ No: │ │ │ → Active │
|
||||
│ │ │ → Deleted │ │ └─ No: │
|
||||
│ │ │ │ │ → Deleted │
|
||||
│ │ └────────────────┘ └────────────────┘
|
||||
└────┬───────┘
|
||||
│ Removed
|
||||
▼
|
||||
┌────────────────┐
|
||||
│ Deleted │
|
||||
│ (left cluster) │
|
||||
└────────────────┘
|
||||
```
|
||||
|
||||
### Leadership State Machine (per node)
|
||||
|
||||
```
|
||||
┌──────────────────┐
|
||||
│ Not a Leader │
|
||||
│ (waiting) │
|
||||
└────────┬─────────┘
|
||||
│ Try Election (every 2s)
|
||||
│ Atomic create "leader" succeeds
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ Candidate │
|
||||
│ (won election) │
|
||||
└────────┬─────────┘
|
||||
│ Start lease renewal loop
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ Leader │
|
||||
│ (holding lease) │
|
||||
└────────┬─────────┘
|
||||
│
|
||||
┌──────┴───────────┬──────────────────────┐
|
||||
│ │ │
|
||||
│ Renew lease │ Lease expires │ Graceful
|
||||
│ (every 3s) │ (90s timeout) │ shutdown
|
||||
│ ✓ Success │ ✗ Failure │
|
||||
▼ ▼ ▼
|
||||
[stays] ┌──────────────────┐ ┌──────────────────┐
|
||||
│ Lost Leadership │ │ Lost Leadership │
|
||||
│ (lease expired) │ │ (graceful) │
|
||||
└────────┬─────────┘ └────────┬─────────┘
|
||||
│ │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ Not a Leader │
|
||||
│ (back to bench) │
|
||||
└──────────────────┘
|
||||
│
|
||||
└─▶ [Back to top]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
### Thread Safety
|
||||
|
||||
All aggregates use `sync.RWMutex` for thread safety:
|
||||
|
||||
```go
|
||||
type ClusterManager struct {
|
||||
mutex sync.RWMutex // Protects access to:
|
||||
nodes map[string]*NodeInfo // - nodes
|
||||
shardMap *ShardMap // - shardMap
|
||||
// ...
|
||||
}
|
||||
|
||||
// Read operation (multiple goroutines)
|
||||
func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo {
|
||||
cm.mutex.RLock() // Shared lock
|
||||
defer cm.mutex.RUnlock()
|
||||
// ...
|
||||
}
|
||||
|
||||
// Write operation (exclusive)
|
||||
func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
|
||||
cm.mutex.Lock() // Exclusive lock
|
||||
defer cm.mutex.Unlock()
|
||||
// ... (only one writer at a time)
|
||||
}
|
||||
```
|
||||
|
||||
### Background Goroutines
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ DistributedVM.Start() │
|
||||
├─────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────┐ │
|
||||
│ │ ClusterManager.Start() │ │
|
||||
│ │ ├─ go election.Start() │ │
|
||||
│ │ │ ├─ go electionLoop() [ticker: 2s] │ │
|
||||
│ │ │ ├─ go leaseRenewalLoop() [ticker: 3s]│ │
|
||||
│ │ │ └─ go monitorLeadership() [watcher] │ │
|
||||
│ │ │ │ │
|
||||
│ │ ├─ go monitorNodes() [ticker: 30s] │ │
|
||||
│ │ └─ go rebalanceLoop() [ticker: 5m] │ │
|
||||
│ └──────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────┐ │
|
||||
│ │ NodeDiscovery.Start() │ │
|
||||
│ │ ├─ go announceNode() [ticker: 30s] │ │
|
||||
│ │ └─ Subscribe to "aether.discovery" │ │
|
||||
│ └──────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────┐ │
|
||||
│ │ NATS subscriptions │ │
|
||||
│ │ ├─ "aether.cluster.*" → messages │ │
|
||||
│ │ ├─ "aether.discovery" → node updates │ │
|
||||
│ │ └─ "aether-leader-election" → KV watch │ │
|
||||
│ └──────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Event Sequences
|
||||
|
||||
### Event: Node Join, Rebalance, Shard Assignment
|
||||
|
||||
```
|
||||
Timeline:
|
||||
|
||||
T=0s
|
||||
├─ Node-4 joins cluster
|
||||
├─ NodeDiscovery announces NodeJoined via NATS
|
||||
└─ ClusterManager receives and processes
|
||||
|
||||
T=0.1s
|
||||
├─ ClusterManager.JoinCluster() executes
|
||||
├─ Updates nodes map, hashRing
|
||||
├─ Publishes NodeJoined event
|
||||
└─ If leader: triggers rebalancing
|
||||
|
||||
T=0.5s (if leader)
|
||||
├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing)
|
||||
├─ PlacementStrategy.RebalanceShards() computes new assignments
|
||||
└─ ShardManager.AssignShards() applies new assignments
|
||||
|
||||
T=1.0s
|
||||
├─ Publishes ShardMigrated events (one per shard moved)
|
||||
├─ All nodes subscribe to these events
|
||||
├─ Each node routing table updated
|
||||
└─ Actors aware of new shard locations
|
||||
|
||||
T=1.5s onwards
|
||||
├─ Actors on moved shards migrated (application layer)
|
||||
├─ Actor Runtime subscribes to ShardMigrated
|
||||
├─ Triggers actor migration via ActorMigration
|
||||
└─ Eventually: rebalancing complete
|
||||
|
||||
T=5m
|
||||
├─ Periodic rebalance check (5m timer)
|
||||
├─ If no changes: no-op
|
||||
└─ If imbalance detected: trigger rebalance again
|
||||
```
|
||||
|
||||
### Event: Node Failure Detection and Recovery
|
||||
|
||||
```
|
||||
Timeline:
|
||||
|
||||
T=0s
|
||||
├─ Node-2 healthy, last heartbeat received
|
||||
└─ Node-2.LastSeen = now
|
||||
|
||||
T=30s
|
||||
├─ Node-2 healthcheck runs (every 30s timer)
|
||||
├─ Publishes heartbeat
|
||||
└─ Node-2.LastSeen updated
|
||||
|
||||
T=60s
|
||||
├─ (Node-2 still healthy)
|
||||
└─ Heartbeat received, LastSeen updated
|
||||
|
||||
T=65s
|
||||
├─ Node-2 CRASH (network failure or process crash)
|
||||
├─ No more heartbeats sent
|
||||
└─ Node-2.LastSeen = 60s
|
||||
|
||||
T=90s (timeout)
|
||||
├─ ClusterManager.checkNodeHealth() detects timeout
|
||||
├─ now - LastSeen > 90s → mark node Failed
|
||||
├─ ClusterManager.MarkNodeFailed() executes
|
||||
├─ Publishes NodeFailed event
|
||||
├─ If leader: triggers rebalancing
|
||||
└─ (If not leader: waits for leader to rebalance)
|
||||
|
||||
T=91s (if leader)
|
||||
├─ RebalanceShards triggered
|
||||
├─ PlacementStrategy computes new topology without Node-2
|
||||
├─ ShardManager.AssignShards() reassigns shards
|
||||
└─ Publishes ShardMigrated events
|
||||
|
||||
T=92s onwards
|
||||
├─ Actors migrated from Node-2 to healthy nodes
|
||||
└─ No actor loss (assuming replication or migration succeeded)
|
||||
|
||||
T=120s (Node-2 recovery)
|
||||
├─ Node-2 process restarts
|
||||
├─ NodeDiscovery announces NodeJoined again
|
||||
├─ Status: Active
|
||||
└─ (Back to Node Join sequence if leader decides)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration & Tuning
|
||||
|
||||
### Key Parameters
|
||||
|
||||
```go
|
||||
// From cluster/types.go
|
||||
const (
|
||||
// LeaderLeaseTimeout: how long before leader must renew
|
||||
LeaderLeaseTimeout = 10 * time.Second
|
||||
|
||||
// HeartbeatInterval: how often leader renews
|
||||
HeartbeatInterval = 3 * time.Second
|
||||
|
||||
// ElectionTimeout: how often nodes try election
|
||||
ElectionTimeout = 2 * time.Second
|
||||
|
||||
// Node failure detection (in manager.go)
|
||||
nodeFailureTimeout = 90 * time.Second
|
||||
)
|
||||
|
||||
// From cluster/types.go
|
||||
const (
|
||||
// DefaultNumShards: total shards in cluster
|
||||
DefaultNumShards = 1024
|
||||
|
||||
// DefaultVirtualNodes: per-node virtual replicas for distribution
|
||||
DefaultVirtualNodes = 150
|
||||
)
|
||||
```
|
||||
|
||||
### Tuning Guide
|
||||
|
||||
| Parameter | Current | Rationale | Trade-off |
|
||||
|-----------|---------|-----------|-----------|
|
||||
| LeaderLeaseTimeout | 10s | Fast failure detection | May cause thrashing in high-latency networks |
|
||||
| HeartbeatInterval | 3s | Leader alive signal every 3s | Overhead 3x per 9s window |
|
||||
| ElectionTimeout | 2s | Retry elections frequently | CPU cost, but quick recovery |
|
||||
| NodeFailureTimeout | 90s | 3x heartbeat interval | Tolerance for temp network issues |
|
||||
| DefaultNumShards | 1024 | Good granularity for large clusters | More shards = more metadata |
|
||||
| DefaultVirtualNodes | 150 | Balance between distribution and overhead | Lower = worse distribution, higher = more ring operations |
|
||||
|
||||
---
|
||||
|
||||
## Failure Scenarios & Recovery
|
||||
|
||||
### Scenario A: Single Node Fails
|
||||
|
||||
```
|
||||
Before: [A (leader), B, C, D] with 1024 shards
|
||||
├─ A: 256 shards (+ leader)
|
||||
├─ B: 256 shards
|
||||
├─ C: 256 shards
|
||||
└─ D: 256 shards
|
||||
|
||||
B crashes (no recovery) → waits 90s → marked Failed
|
||||
|
||||
After Rebalance:
|
||||
[A (leader), C, D] with 1024 shards
|
||||
├─ A: 341 shards (+ leader)
|
||||
├─ C: 341 shards
|
||||
└─ D: 342 shards
|
||||
```
|
||||
|
||||
**Recovery:** Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this)
|
||||
|
||||
---
|
||||
|
||||
### Scenario B: Leader Fails
|
||||
|
||||
```
|
||||
Before: [A (leader), B, C, D]
|
||||
|
||||
A crashes → waits 90s → marked Failed
|
||||
no lease renewal → lease expires after 10s
|
||||
|
||||
B, C, or D wins election → new leader
|
||||
→ triggers rebalance
|
||||
→ reshuffles A's shards
|
||||
|
||||
After: [B (leader), C, D]
|
||||
```
|
||||
|
||||
**Recovery:** New leader elected within 10s; rebalancing within 100s; no loss if replicas present
|
||||
|
||||
---
|
||||
|
||||
### Scenario C: Network Partition
|
||||
|
||||
```
|
||||
Before: [A (leader), B, C, D]
|
||||
Partition: {A} isolated | {B, C, D} connected
|
||||
|
||||
At T=10s (lease expires):
|
||||
├─ A: can't reach NATS, can't renew → loses leadership
|
||||
├─ B, C, D: A's lease expired, one wins election
|
||||
└─ New leader coordinates rebalance
|
||||
|
||||
Risk: If A can reach NATS (just isolated from app), might try to renew
|
||||
but atomic update fails because term mismatch
|
||||
|
||||
Result: Single leader maintained; no split-brain
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Key Metrics to Track
|
||||
|
||||
```
|
||||
# Cluster Topology
|
||||
gauge: cluster.nodes.count [active|draining|failed]
|
||||
gauge: cluster.shards.assigned [0, 1024]
|
||||
gauge: cluster.shards.orphaned [0, 1024]
|
||||
|
||||
# Leadership
|
||||
gauge: cluster.leader.is_leader [0, 1]
|
||||
gauge: cluster.leader.term [0, ∞]
|
||||
gauge: cluster.leader.lease_expires_in_seconds [0, 10]
|
||||
|
||||
# Rebalancing
|
||||
counter: cluster.rebalancing.triggered [reason]
|
||||
gauge: cluster.rebalancing.active [0, 1]
|
||||
counter: cluster.rebalancing.completed [shards_moved]
|
||||
counter: cluster.rebalancing.failed [reason]
|
||||
|
||||
# Node Health
|
||||
gauge: cluster.node.heartbeat_latency_ms [per node]
|
||||
gauge: cluster.node.load [per node]
|
||||
gauge: cluster.node.vm_count [per node]
|
||||
counter: cluster.node.failures [reason]
|
||||
```
|
||||
|
||||
### Alerts
|
||||
|
||||
```
|
||||
- Leader heartbeat missing > 5s → election may be stuck
|
||||
- Rebalancing in progress > 5min → something wrong
|
||||
- Orphaned shards > 0 → invariant violation
|
||||
- Node failure > 50% of cluster → investigate
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Full domain model
|
||||
- [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation roadmap
|
||||
- [manager.go](./manager.go) - ClusterManager implementation
|
||||
- [leader.go](./leader.go) - LeaderElection implementation
|
||||
- [shard.go](./shard.go) - ShardManager implementation
|
||||
- [discovery.go](./discovery.go) - NodeDiscovery implementation
|
||||
- [distributed.go](./distributed.go) - DistributedVM orchestrator
|
||||
|
||||
Reference in New Issue
Block a user