Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
36 KiB
36 KiB
Cluster Coordination: Architecture Reference
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ Aether Cluster Runtime │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ DistributedVM (Orchestrator - not an aggregate) │ │
│ │ ├─ Local Runtime (executes actors) │ │
│ │ ├─ NodeDiscovery (heartbeat → cluster awareness) │ │
│ │ ├─ ClusterManager (Cluster aggregate root) │ │
│ │ │ ├─ nodes: Map[ID → NodeInfo] │ │
│ │ │ ├─ shardMap: ShardMap (current assignments) │ │
│ │ │ ├─ hashRing: ConsistentHashRing (util) │ │
│ │ │ └─ election: LeaderElection │ │
│ │ └─ ShardManager (ShardAssignment aggregate) │ │
│ │ ├─ shardCount: int │ │
│ │ ├─ shardMap: ShardMap │ │
│ │ └─ placement: PlacementStrategy │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ NATS │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ NATS Cluster │ │
│ │ ├─ Subject: aether.discovery (heartbeats) │ │
│ │ ├─ Subject: aether.cluster.* (messages) │ │
│ │ └─ KeyValue: aether-leader-election (lease) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Aggregate Boundaries
Aggregate 1: Cluster (Root)
Owns node topology, shard assignments, and rebalancing decisions.
Cluster Aggregate
├─ Entities
│ └─ Cluster (root)
│ ├─ nodes: Map[NodeID → NodeInfo]
│ ├─ shardMap: ShardMap
│ ├─ hashRing: ConsistentHashRing
│ └─ currentLeaderID: string
│
├─ Commands
│ ├─ JoinCluster(nodeInfo)
│ ├─ MarkNodeFailed(nodeID)
│ ├─ AssignShards(shardMap)
│ └─ RebalanceShards(reason)
│
├─ Events
│ ├─ NodeJoined
│ ├─ NodeFailed
│ ├─ NodeLeft
│ ├─ ShardAssigned
│ ├─ ShardMigrated
│ └─ RebalancingTriggered
│
├─ Invariants Enforced
│ ├─ I2: All active shards have owners
│ ├─ I3: Shards only on healthy nodes
│ └─ I4: Assignments stable during lease
│
└─ Code Location: ClusterManager (cluster/manager.go)
Aggregate 2: LeadershipLease (Root)
Owns leadership claim and ensures single leader per term.
LeadershipLease Aggregate
├─ Entities
│ └─ LeadershipLease (root)
│ ├─ leaderID: string
│ ├─ term: uint64
│ ├─ expiresAt: time.Time
│ └─ startedAt: time.Time
│
├─ Commands
│ ├─ ElectLeader(nodeID)
│ └─ RenewLeadership(nodeID)
│
├─ Events
│ ├─ LeaderElected
│ ├─ LeadershipRenewed
│ └─ LeadershipLost
│
├─ Invariants Enforced
│ ├─ I1: Single leader per term
│ └─ I5: Leader is active node
│
└─ Code Location: LeaderElection (cluster/leader.go)
Aggregate 3: ShardAssignment (Root)
Owns shard-to-node mappings and validates assignments.
ShardAssignment Aggregate
├─ Entities
│ └─ ShardAssignment (root)
│ ├─ version: uint64
│ ├─ assignments: Map[ShardID → []NodeID]
│ ├─ nodes: Map[NodeID → NodeInfo]
│ └─ updateTime: time.Time
│
├─ Commands
│ ├─ AssignShard(shardID, nodeList)
│ └─ RebalanceFromTopology(nodes)
│
├─ Events
│ ├─ ShardAssigned
│ └─ ShardMigrated
│
├─ Invariants Enforced
│ ├─ I2: All shards assigned
│ └─ I3: Only healthy nodes
│
└─ Code Location: ShardManager (cluster/shard.go)
Command Flow Diagrams
Scenario 1: Node Joins Cluster
┌─────────┐ NodeJoined ┌──────────────────┐
│New Node │─────────────▶│ClusterManager │
└─────────┘ │.JoinCluster() │
└────────┬─────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Validate │ │Update │ │Publish │
│ID unique │ │topology │ │NodeJoined│
│Capacity>0 hashRing │ │event │
└────┬─────┘ └──────────┘ └──────────┘
│
┌────▼────────────────────────┐
│Is this node leader? │
│If yes: trigger rebalance │
└─────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│RebalanceShards │ │(nothing) │
│ command │ │ │
└──────────────────┘ └──────────────────┘
│
▼
┌──────────────────────────────────┐
│ConsistentHashPlacement │
│ .RebalanceShards() │
│ (compute new assignments) │
└────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────┐
│ShardManager.AssignShards() │
│ (validate & apply) │
└────────────┬─────────────────────┘
│
┌────┴──────────────────────┐
▼ ▼
┌──────────────┐ ┌─────────────────┐
│For each │ │Publish │
│shard moved │ │ShardMigrated │
│ │ │event per shard │
└──────────────┘ └─────────────────┘
Scenario 2: Node Failure Detected
┌──────────────────────┐
│Heartbeat timeout │
│(LastSeen > 90s) │
└──────────┬───────────┘
│
▼
┌──────────────────────────────┐
│ClusterManager │
│.MarkNodeFailed() │
│ ├─ Mark status=Failed │
│ ├─ Remove from hashRing │
│ └─ Publish NodeFailed event │
└────────────┬─────────────────┘
│
┌────────▼────────────┐
│Is this node leader? │
└────────┬────────────┘
│
┌────────┴─────────────────┐
│ YES │ NO
▼ ▼
┌──────────────┐ ┌──────────────────┐
│Trigger │ │(nothing) │
│Rebalance │ │ │
└──────────────┘ └──────────────────┘
│
└─▶ [Same as Scenario 1 from RebalanceShards]
Scenario 3: Leader Election (Implicit)
┌─────────────────────────────────────┐
│All nodes: electionLoop runs every 2s│
└────────────┬────────────────────────┘
│
┌────────▼────────────┐
│Am I leader? │
└────────┬────────────┘
│
┌────────┴──────────────────────────┐
│ YES │ NO
▼ ▼
┌──────────────┐ ┌─────────────────────────┐
│Do nothing │ │Should try election? │
│(already │ │ ├─ No leader exists │
│leading) │ │ ├─ Lease expired │
└──────────────┘ │ └─ (other conditions) │
└────────┬────────────────┘
│
┌─────────▼──────────┐
│try AtomicCreate │
│"leader" key in KV │
└────────┬───────────┘
│
┌─────────────┴──────────────┐
│ SUCCESS │ FAILED
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│Became Leader! │ │Try claim expired │
│Publish │ │lease; if success,│
│LeaderElected │ │become leader │
└──────────────────┘ │Else: stay on │
│ │bench │
▼ └──────────────────┘
┌──────────────────┐
│Start lease │
│renewal loop │
│(every 3s) │
└──────────────────┘
Decision Trees
Decision 1: Is Node Healthy?
Query: Is Node X healthy?
┌─────────────────────────────────────┐
│Get node status from Cluster.nodes │
└────────────┬────────────────────────┘
│
┌────────▼────────────────┐
│Check node.Status field │
└────────┬────────────────┘
│
┌────────┴───────────────┬─────────────────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────┐
│Active │ │Draining │ │Failed │ │Unknown │
├────────┤ ├────────────┤ ├──────────────┤ ├─────────┤
│✓Healthy│ │⚠ Draining │ │✗ Unhealthy │ │✗ Error │
│Can host│ │Should not │ │Don't use for │ │ │
│shards │ │get new │ │sharding │ │ │
└────────┘ │shards, but │ │Delete shards │ └─────────┘
│existing OK │ │ASAP │
└────────────┘ └──────────────┘
Decision 2: Should This Node Rebalance Shards?
Command: RebalanceShards(nodeID, reason)
┌──────────────────────────────────┐
│Is nodeID the current leader? │
└────────┬─────────────────────────┘
│
┌────┴──────────────────┐
│ YES │ NO
▼ ▼
┌────────────┐ ┌──────────────────────┐
│Continue │ │REJECT: NotLeader │
│rebalancing │ │ │
└─────┬──────┘ │Only leader can │
│ │initiate rebalancing │
│ └──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│Are there active nodes? │
└────────┬────────────────────────────┘
│
┌────┴──────────────────┐
│ YES │ NO
▼ ▼
┌────────────┐ ┌──────────────────────┐
│Continue │ │REJECT: NoActiveNodes │
│rebalancing │ │ │
└─────┬──────┘ │Can't assign shards │
│ │with no healthy nodes │
│ └──────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│Call PlacementStrategy.RebalanceShards() │
│ (compute new assignments) │
└────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│Call ShardManager.AssignShards() │
│ (validate & apply new assignments) │
└────────┬─────────────────────────────────┘
│
┌────┴──────────────────┐
│ SUCCESS │ FAILURE
▼ ▼
┌────────────┐ ┌──────────────────────┐
│Publish │ │Publish │
│Shard │ │RebalancingFailed │
│Migrated │ │event │
│events │ │ │
│ │ │Log error, backoff │
│Publish │ │try again in 5 min │
│Rebalancing │ │ │
│Completed │ │ │
└────────────┘ └──────────────────────┘
Decision 3: Can We Assign This Shard to This Node?
Command: AssignShard(shardID, nodeID)
┌──────────────────────────────────┐
│Is nodeID in Cluster.nodes? │
└────────┬─────────────────────────┘
│
┌────┴──────────────────┐
│ YES │ NO
▼ ▼
┌────────────┐ ┌──────────────────────┐
│Continue │ │REJECT: NodeNotFound │
│assignment │ │ │
└─────┬──────┘ │Can't assign shard │
│ │to non-existent node │
│ └──────────────────────┘
│
▼
┌──────────────────────────────────────┐
│Check node.Status │
└────────┬─────────────────────────────┘
│
┌────┴──────────────────┐
│Active or Draining │ Failed
▼ ▼
┌────────────┐ ┌──────────────────────┐
│Continue │ │REJECT: UnhealthyNode │
│assignment │ │ │
└─────┬──────┘ │Can't assign to │
│ │failed node; it can't │
│ │execute shards │
│ └──────────────────────┘
│
▼
┌──────────────────────────────────────┐
│Check replication factor │
│ (existing nodes < replication limit?)│
└────────┬─────────────────────────────┘
│
┌────┴──────────────────┐
│ YES │ NO
▼ ▼
┌────────────┐ ┌──────────────────────┐
│ACCEPT │ │REJECT: TooManyReplicas
│Add node to │ │ │
│shard's │ │Already have max │
│replica │ │replicas for shard │
│list │ │ │
└────────────┘ └──────────────────────┘
State Transitions
Cluster State Machine
┌────────────────┐
│ Initializing │
│ (no nodes) │
└────────┬───────┘
│ NodeJoined
▼
┌────────────────┐
│ Single Node │
│ (one node only)│
└────────┬───────┘
│ NodeJoined
▼
┌────────────────────────────────────────────┐
│ Multi-Node Cluster │
│ ├─ Stable (healthy nodes, shards assigned) │
│ ├─ Rebalancing (shards moving) │
│ └─ Degraded (failed node waiting for heal) │
└────────────┬───────────────────────────────┘
│ (All nodes left or failed)
▼
┌────────────────┐
│ No Nodes │
│ (cluster dead) │
└────────────────┘
Node State Machine (per node)
┌────────────────┐
│ Discovered │
│ (new heartbeat)│
└────────┬───────┘
│ JoinCluster
▼
┌────────────────┐
│ Active │
│ (healthy, │
│ can host │
│ shards) │
└────────┬───────┘
│
┌────────┴──────────────┬─────────────────┐
│ │ │
│ (graceful) │ (heartbeat miss)│
▼ ▼ ▼
┌────────────┐ ┌────────────────┐ ┌────────────────┐
│ Draining │ │ Failed │ │ Failed │
│ (stop new │ │ (timeout:90s) │ │ (detected) │
│ shards, │ │ │ │ (admin/health) │
│ preserve │ │ Rebalance │ │ │
│ existing) │ │ shards ASAP │ │ Rebalance │
│ │ │ │ │ shards ASAP │
│ │ │Recover? │ │ │
│ │ │ ├─ Yes: │ │Recover? │
│ │ │ │ → Active │ │ ├─ Yes: │
│ │ │ └─ No: │ │ │ → Active │
│ │ │ → Deleted │ │ └─ No: │
│ │ │ │ │ → Deleted │
│ │ └────────────────┘ └────────────────┘
└────┬───────┘
│ Removed
▼
┌────────────────┐
│ Deleted │
│ (left cluster) │
└────────────────┘
Leadership State Machine (per node)
┌──────────────────┐
│ Not a Leader │
│ (waiting) │
└────────┬─────────┘
│ Try Election (every 2s)
│ Atomic create "leader" succeeds
▼
┌──────────────────┐
│ Candidate │
│ (won election) │
└────────┬─────────┘
│ Start lease renewal loop
▼
┌──────────────────┐
│ Leader │
│ (holding lease) │
└────────┬─────────┘
│
┌──────┴───────────┬──────────────────────┐
│ │ │
│ Renew lease │ Lease expires │ Graceful
│ (every 3s) │ (90s timeout) │ shutdown
│ ✓ Success │ ✗ Failure │
▼ ▼ ▼
[stays] ┌──────────────────┐ ┌──────────────────┐
│ Lost Leadership │ │ Lost Leadership │
│ (lease expired) │ │ (graceful) │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬───────────┘
│
▼
┌──────────────────┐
│ Not a Leader │
│ (back to bench) │
└──────────────────┘
│
└─▶ [Back to top]
Concurrency Model
Thread Safety
All aggregates use sync.RWMutex for thread safety:
type ClusterManager struct {
mutex sync.RWMutex // Protects access to:
nodes map[string]*NodeInfo // - nodes
shardMap *ShardMap // - shardMap
// ...
}
// Read operation (multiple goroutines)
func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo {
cm.mutex.RLock() // Shared lock
defer cm.mutex.RUnlock()
// ...
}
// Write operation (exclusive)
func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
cm.mutex.Lock() // Exclusive lock
defer cm.mutex.Unlock()
// ... (only one writer at a time)
}
Background Goroutines
┌─────────────────────────────────────────────────┐
│ DistributedVM.Start() │
├─────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ ClusterManager.Start() │ │
│ │ ├─ go election.Start() │ │
│ │ │ ├─ go electionLoop() [ticker: 2s] │ │
│ │ │ ├─ go leaseRenewalLoop() [ticker: 3s]│ │
│ │ │ └─ go monitorLeadership() [watcher] │ │
│ │ │ │ │
│ │ ├─ go monitorNodes() [ticker: 30s] │ │
│ │ └─ go rebalanceLoop() [ticker: 5m] │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ NodeDiscovery.Start() │ │
│ │ ├─ go announceNode() [ticker: 30s] │ │
│ │ └─ Subscribe to "aether.discovery" │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ NATS subscriptions │ │
│ │ ├─ "aether.cluster.*" → messages │ │
│ │ ├─ "aether.discovery" → node updates │ │
│ │ └─ "aether-leader-election" → KV watch │ │
│ └──────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────┘
Event Sequences
Event: Node Join, Rebalance, Shard Assignment
Timeline:
T=0s
├─ Node-4 joins cluster
├─ NodeDiscovery announces NodeJoined via NATS
└─ ClusterManager receives and processes
T=0.1s
├─ ClusterManager.JoinCluster() executes
├─ Updates nodes map, hashRing
├─ Publishes NodeJoined event
└─ If leader: triggers rebalancing
T=0.5s (if leader)
├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing)
├─ PlacementStrategy.RebalanceShards() computes new assignments
└─ ShardManager.AssignShards() applies new assignments
T=1.0s
├─ Publishes ShardMigrated events (one per shard moved)
├─ All nodes subscribe to these events
├─ Each node routing table updated
└─ Actors aware of new shard locations
T=1.5s onwards
├─ Actors on moved shards migrated (application layer)
├─ Actor Runtime subscribes to ShardMigrated
├─ Triggers actor migration via ActorMigration
└─ Eventually: rebalancing complete
T=5m
├─ Periodic rebalance check (5m timer)
├─ If no changes: no-op
└─ If imbalance detected: trigger rebalance again
Event: Node Failure Detection and Recovery
Timeline:
T=0s
├─ Node-2 healthy, last heartbeat received
└─ Node-2.LastSeen = now
T=30s
├─ Node-2 healthcheck runs (every 30s timer)
├─ Publishes heartbeat
└─ Node-2.LastSeen updated
T=60s
├─ (Node-2 still healthy)
└─ Heartbeat received, LastSeen updated
T=65s
├─ Node-2 CRASH (network failure or process crash)
├─ No more heartbeats sent
└─ Node-2.LastSeen = 60s
T=90s (timeout)
├─ ClusterManager.checkNodeHealth() detects timeout
├─ now - LastSeen > 90s → mark node Failed
├─ ClusterManager.MarkNodeFailed() executes
├─ Publishes NodeFailed event
├─ If leader: triggers rebalancing
└─ (If not leader: waits for leader to rebalance)
T=91s (if leader)
├─ RebalanceShards triggered
├─ PlacementStrategy computes new topology without Node-2
├─ ShardManager.AssignShards() reassigns shards
└─ Publishes ShardMigrated events
T=92s onwards
├─ Actors migrated from Node-2 to healthy nodes
└─ No actor loss (assuming replication or migration succeeded)
T=120s (Node-2 recovery)
├─ Node-2 process restarts
├─ NodeDiscovery announces NodeJoined again
├─ Status: Active
└─ (Back to Node Join sequence if leader decides)
Configuration & Tuning
Key Parameters
// From cluster/types.go
const (
// LeaderLeaseTimeout: how long before leader must renew
LeaderLeaseTimeout = 10 * time.Second
// HeartbeatInterval: how often leader renews
HeartbeatInterval = 3 * time.Second
// ElectionTimeout: how often nodes try election
ElectionTimeout = 2 * time.Second
// Node failure detection (in manager.go)
nodeFailureTimeout = 90 * time.Second
)
// From cluster/types.go
const (
// DefaultNumShards: total shards in cluster
DefaultNumShards = 1024
// DefaultVirtualNodes: per-node virtual replicas for distribution
DefaultVirtualNodes = 150
)
Tuning Guide
| Parameter | Current | Rationale | Trade-off |
|---|---|---|---|
| LeaderLeaseTimeout | 10s | Fast failure detection | May cause thrashing in high-latency networks |
| HeartbeatInterval | 3s | Leader alive signal every 3s | Overhead 3x per 9s window |
| ElectionTimeout | 2s | Retry elections frequently | CPU cost, but quick recovery |
| NodeFailureTimeout | 90s | 3x heartbeat interval | Tolerance for temp network issues |
| DefaultNumShards | 1024 | Good granularity for large clusters | More shards = more metadata |
| DefaultVirtualNodes | 150 | Balance between distribution and overhead | Lower = worse distribution, higher = more ring operations |
Failure Scenarios & Recovery
Scenario A: Single Node Fails
Before: [A (leader), B, C, D] with 1024 shards
├─ A: 256 shards (+ leader)
├─ B: 256 shards
├─ C: 256 shards
└─ D: 256 shards
B crashes (no recovery) → waits 90s → marked Failed
After Rebalance:
[A (leader), C, D] with 1024 shards
├─ A: 341 shards (+ leader)
├─ C: 341 shards
└─ D: 342 shards
Recovery: Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this)
Scenario B: Leader Fails
Before: [A (leader), B, C, D]
A crashes → waits 90s → marked Failed
no lease renewal → lease expires after 10s
B, C, or D wins election → new leader
→ triggers rebalance
→ reshuffles A's shards
After: [B (leader), C, D]
Recovery: New leader elected within 10s; rebalancing within 100s; no loss if replicas present
Scenario C: Network Partition
Before: [A (leader), B, C, D]
Partition: {A} isolated | {B, C, D} connected
At T=10s (lease expires):
├─ A: can't reach NATS, can't renew → loses leadership
├─ B, C, D: A's lease expired, one wins election
└─ New leader coordinates rebalance
Risk: If A can reach NATS (just isolated from app), might try to renew
but atomic update fails because term mismatch
Result: Single leader maintained; no split-brain
Monitoring & Observability
Key Metrics to Track
# Cluster Topology
gauge: cluster.nodes.count [active|draining|failed]
gauge: cluster.shards.assigned [0, 1024]
gauge: cluster.shards.orphaned [0, 1024]
# Leadership
gauge: cluster.leader.is_leader [0, 1]
gauge: cluster.leader.term [0, ∞]
gauge: cluster.leader.lease_expires_in_seconds [0, 10]
# Rebalancing
counter: cluster.rebalancing.triggered [reason]
gauge: cluster.rebalancing.active [0, 1]
counter: cluster.rebalancing.completed [shards_moved]
counter: cluster.rebalancing.failed [reason]
# Node Health
gauge: cluster.node.heartbeat_latency_ms [per node]
gauge: cluster.node.load [per node]
gauge: cluster.node.vm_count [per node]
counter: cluster.node.failures [reason]
Alerts
- Leader heartbeat missing > 5s → election may be stuck
- Rebalancing in progress > 5min → something wrong
- Orphaned shards > 0 → invariant violation
- Node failure > 50% of cluster → investigate
References
- DOMAIN_MODEL.md - Full domain model
- REFACTORING_SUMMARY.md - Implementation roadmap
- manager.go - ClusterManager implementation
- leader.go - LeaderElection implementation
- shard.go - ShardManager implementation
- discovery.go - NodeDiscovery implementation
- distributed.go - DistributedVM orchestrator