Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:11 +01:00
parent 18ea677585
commit 271f5db444
26 changed files with 16521 additions and 0 deletions
--- a/.product-strategy/cluster/ARCHITECTURE.md
+++ b/.product-strategy/cluster/ARCHITECTURE.md
@@ -0,0 +1,833 @@
+# Cluster Coordination: Architecture Reference
+
+## High-Level Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Aether Cluster Runtime                     │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │ DistributedVM (Orchestrator - not an aggregate)      │   │
+│  │ ├─ Local Runtime (executes actors)                   │   │
+│  │ ├─ NodeDiscovery (heartbeat → cluster awareness)    │   │
+│  │ ├─ ClusterManager (Cluster aggregate root)          │   │
+│  │ │  ├─ nodes: Map[ID → NodeInfo]                    │   │
+│  │ │  ├─ shardMap: ShardMap (current assignments)     │   │
+│  │ │  ├─ hashRing: ConsistentHashRing (util)          │   │
+│  │ │  └─ election: LeaderElection                      │   │
+│  │ └─ ShardManager (ShardAssignment aggregate)         │   │
+│  │    ├─ shardCount: int                              │   │
+│  │    ├─ shardMap: ShardMap                           │   │
+│  │    └─ placement: PlacementStrategy                  │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                          │ NATS                              │
+│                          ▼                                    │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │ NATS Cluster                                         │   │
+│  │ ├─ Subject: aether.discovery (heartbeats)           │   │
+│  │ ├─ Subject: aether.cluster.* (messages)             │   │
+│  │ └─ KeyValue: aether-leader-election (lease)         │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                                                               │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Aggregate Boundaries
+
+### Aggregate 1: Cluster (Root)
+Owns node topology, shard assignments, and rebalancing decisions.
+
+```
+Cluster Aggregate
+├─ Entities
+│  └─ Cluster (root)
+│     ├─ nodes: Map[NodeID → NodeInfo]
+│     ├─ shardMap: ShardMap
+│     ├─ hashRing: ConsistentHashRing
+│     └─ currentLeaderID: string
+│
+├─ Commands
+│  ├─ JoinCluster(nodeInfo)
+│  ├─ MarkNodeFailed(nodeID)
+│  ├─ AssignShards(shardMap)
+│  └─ RebalanceShards(reason)
+│
+├─ Events
+│  ├─ NodeJoined
+│  ├─ NodeFailed
+│  ├─ NodeLeft
+│  ├─ ShardAssigned
+│  ├─ ShardMigrated
+│  └─ RebalancingTriggered
+│
+├─ Invariants Enforced
+│  ├─ I2: All active shards have owners
+│  ├─ I3: Shards only on healthy nodes
+│  └─ I4: Assignments stable during lease
+│
+└─ Code Location: ClusterManager (cluster/manager.go)
+```
+
+### Aggregate 2: LeadershipLease (Root)
+Owns leadership claim and ensures single leader per term.
+
+```
+LeadershipLease Aggregate
+├─ Entities
+│  └─ LeadershipLease (root)
+│     ├─ leaderID: string
+│     ├─ term: uint64
+│     ├─ expiresAt: time.Time
+│     └─ startedAt: time.Time
+│
+├─ Commands
+│  ├─ ElectLeader(nodeID)
+│  └─ RenewLeadership(nodeID)
+│
+├─ Events
+│  ├─ LeaderElected
+│  ├─ LeadershipRenewed
+│  └─ LeadershipLost
+│
+├─ Invariants Enforced
+│  ├─ I1: Single leader per term
+│  └─ I5: Leader is active node
+│
+└─ Code Location: LeaderElection (cluster/leader.go)
+```
+
+### Aggregate 3: ShardAssignment (Root)
+Owns shard-to-node mappings and validates assignments.
+
+```
+ShardAssignment Aggregate
+├─ Entities
+│  └─ ShardAssignment (root)
+│     ├─ version: uint64
+│     ├─ assignments: Map[ShardID → []NodeID]
+│     ├─ nodes: Map[NodeID → NodeInfo]
+│     └─ updateTime: time.Time
+│
+├─ Commands
+│  ├─ AssignShard(shardID, nodeList)
+│  └─ RebalanceFromTopology(nodes)
+│
+├─ Events
+│  ├─ ShardAssigned
+│  └─ ShardMigrated
+│
+├─ Invariants Enforced
+│  ├─ I2: All shards assigned
+│  └─ I3: Only healthy nodes
+│
+└─ Code Location: ShardManager (cluster/shard.go)
+```
+
+---
+
+## Command Flow Diagrams
+
+### Scenario 1: Node Joins Cluster
+
+```
+┌─────────┐ NodeJoined   ┌──────────────────┐
+│New Node │─────────────▶│ClusterManager    │
+└─────────┘              │.JoinCluster()    │
+                         └────────┬─────────┘
+                                  │
+                    ┌─────────────┼─────────────┐
+                    ▼             ▼             ▼
+              ┌──────────┐  ┌──────────┐  ┌──────────┐
+              │Validate  │  │Update    │  │Publish   │
+              │ID unique │  │topology  │  │NodeJoined│
+              │Capacity>0   hashRing  │  │event     │
+              └────┬─────┘  └──────────┘  └──────────┘
+                   │
+              ┌────▼────────────────────────┐
+              │Is this node leader?         │
+              │If yes: trigger rebalance    │
+              └─────────────────────────────┘
+                        │
+            ┌───────────┴───────────┐
+            ▼                       ▼
+    ┌──────────────────┐   ┌──────────────────┐
+    │RebalanceShards   │   │(nothing)         │
+    │   command        │   │                  │
+    └──────────────────┘   └──────────────────┘
+            │
+            ▼
+    ┌──────────────────────────────────┐
+    │ConsistentHashPlacement           │
+    │ .RebalanceShards()               │
+    │ (compute new assignments)        │
+    └────────────┬─────────────────────┘
+                 │
+                 ▼
+    ┌──────────────────────────────────┐
+    │ShardManager.AssignShards()       │
+    │ (validate & apply)               │
+    └────────────┬─────────────────────┘
+                 │
+            ┌────┴──────────────────────┐
+            ▼                           ▼
+    ┌──────────────┐           ┌─────────────────┐
+    │For each      │           │Publish          │
+    │shard moved   │           │ShardMigrated    │
+    │             │           │event per shard  │
+    └──────────────┘           └─────────────────┘
+```
+
+### Scenario 2: Node Failure Detected
+
+```
+┌──────────────────────┐
+│Heartbeat timeout     │
+│(LastSeen > 90s)      │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────────────┐
+│ClusterManager                │
+│.MarkNodeFailed()             │
+│ ├─ Mark status=Failed        │
+│ ├─ Remove from hashRing      │
+│ └─ Publish NodeFailed event  │
+└────────────┬─────────────────┘
+             │
+    ┌────────▼────────────┐
+    │Is this node leader? │
+    └────────┬────────────┘
+             │
+    ┌────────┴─────────────────┐
+    │ YES                      │ NO
+    ▼                          ▼
+┌──────────────┐   ┌──────────────────┐
+│Trigger       │   │(nothing)         │
+│Rebalance     │   │                  │
+└──────────────┘   └──────────────────┘
+    │
+    └─▶ [Same as Scenario 1 from RebalanceShards]
+```
+
+### Scenario 3: Leader Election (Implicit)
+
+```
+┌─────────────────────────────────────┐
+│All nodes: electionLoop runs every 2s│
+└────────────┬────────────────────────┘
+             │
+    ┌────────▼────────────┐
+    │Am I leader?         │
+    └────────┬────────────┘
+             │
+    ┌────────┴──────────────────────────┐
+    │ YES                               │ NO
+    ▼                                   ▼
+┌──────────────┐          ┌─────────────────────────┐
+│Do nothing    │          │Should try election?     │
+│(already      │          │ ├─ No leader exists     │
+│leading)      │          │ ├─ Lease expired        │
+└──────────────┘          │ └─ (other conditions)   │
+                          └────────┬────────────────┘
+                                   │
+                         ┌─────────▼──────────┐
+                         │try AtomicCreate    │
+                         │"leader" key in KV  │
+                         └────────┬───────────┘
+                                  │
+                    ┌─────────────┴──────────────┐
+                    │ SUCCESS                    │ FAILED
+                    ▼                            ▼
+            ┌──────────────────┐      ┌──────────────────┐
+            │Became Leader!    │      │Try claim expired │
+            │Publish           │      │lease; if success,│
+            │LeaderElected     │      │become leader     │
+            └──────────────────┘      │Else: stay on     │
+                    │                 │bench              │
+                    ▼                 └──────────────────┘
+            ┌──────────────────┐
+            │Start lease       │
+            │renewal loop      │
+            │(every 3s)        │
+            └──────────────────┘
+```
+
+---
+
+## Decision Trees
+
+### Decision 1: Is Node Healthy?
+
+```
+Query: Is Node X healthy?
+
+┌─────────────────────────────────────┐
+│Get node status from Cluster.nodes   │
+└────────────┬────────────────────────┘
+             │
+    ┌────────▼────────────────┐
+    │Check node.Status field  │
+    └────────┬────────────────┘
+             │
+    ┌────────┴───────────────┬─────────────────────┬──────────────┐
+    │                        │                     │              │
+    ▼                        ▼                     ▼              ▼
+┌────────┐          ┌────────────┐      ┌──────────────┐  ┌─────────┐
+│Active  │          │Draining    │      │Failed        │  │Unknown  │
+├────────┤          ├────────────┤      ├──────────────┤  ├─────────┤
+│✓Healthy│          │⚠ Draining  │      │✗ Unhealthy   │  │✗ Error  │
+│Can host│          │Should not  │      │Don't use for │  │         │
+│shards  │          │get new     │      │sharding      │  │         │
+└────────┘          │shards, but │      │Delete shards │  └─────────┘
+                    │existing OK │      │ASAP          │
+                    └────────────┘      └──────────────┘
+```
+
+### Decision 2: Should This Node Rebalance Shards?
+
+```
+Command: RebalanceShards(nodeID, reason)
+
+┌──────────────────────────────────┐
+│Is nodeID the current leader?     │
+└────────┬─────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: NotLeader     │
+│rebalancing │      │                      │
+└─────┬──────┘      │Only leader can       │
+      │             │initiate rebalancing  │
+      │             └──────────────────────┘
+      │
+      ▼
+┌─────────────────────────────────────┐
+│Are there active nodes?              │
+└────────┬────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: NoActiveNodes │
+│rebalancing │      │                      │
+└─────┬──────┘      │Can't assign shards   │
+      │             │with no healthy nodes │
+      │             └──────────────────────┘
+      │
+      ▼
+┌──────────────────────────────────────────┐
+│Call PlacementStrategy.RebalanceShards()  │
+│ (compute new assignments)                │
+└────────┬─────────────────────────────────┘
+         │
+         ▼
+┌──────────────────────────────────────────┐
+│Call ShardManager.AssignShards()          │
+│ (validate & apply new assignments)       │
+└────────┬─────────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ SUCCESS               │ FAILURE
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Publish     │      │Publish               │
+│Shard       │      │RebalancingFailed     │
+│Migrated    │      │event                 │
+│events      │      │                      │
+│            │      │Log error, backoff    │
+│Publish     │      │try again in 5 min    │
+│Rebalancing │      │                      │
+│Completed   │      │                      │
+└────────────┘      └──────────────────────┘
+```
+
+### Decision 3: Can We Assign This Shard to This Node?
+
+```
+Command: AssignShard(shardID, nodeID)
+
+┌──────────────────────────────────┐
+│Is nodeID in Cluster.nodes?       │
+└────────┬─────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: NodeNotFound  │
+│assignment  │      │                      │
+└─────┬──────┘      │Can't assign shard    │
+      │             │to non-existent node  │
+      │             └──────────────────────┘
+      │
+      ▼
+┌──────────────────────────────────────┐
+│Check node.Status                     │
+└────────┬─────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │Active or Draining     │ Failed
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: UnhealthyNode │
+│assignment  │      │                      │
+└─────┬──────┘      │Can't assign to       │
+      │             │failed node; it can't │
+      │             │execute shards        │
+      │             └──────────────────────┘
+      │
+      ▼
+┌──────────────────────────────────────┐
+│Check replication factor              │
+│ (existing nodes < replication limit?)│
+└────────┬─────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│ACCEPT      │      │REJECT: TooManyReplicas
+│Add node to │      │                      │
+│shard's     │      │Already have max      │
+│replica     │      │replicas for shard    │
+│list        │      │                      │
+└────────────┘      └──────────────────────┘
+```
+
+---
+
+## State Transitions
+
+### Cluster State Machine
+
+```
+                    ┌────────────────┐
+                    │ Initializing   │
+                    │ (no nodes)     │
+                    └────────┬───────┘
+                             │ NodeJoined
+                             ▼
+                    ┌────────────────┐
+                    │ Single Node    │
+                    │ (one node only)│
+                    └────────┬───────┘
+                             │ NodeJoined
+                             ▼
+    ┌────────────────────────────────────────────┐
+    │ Multi-Node Cluster                         │
+    │ ├─ Stable (healthy nodes, shards assigned) │
+    │ ├─ Rebalancing (shards moving)             │
+    │ └─ Degraded (failed node waiting for heal) │
+    └────────────┬───────────────────────────────┘
+                 │ (All nodes left or failed)
+                 ▼
+            ┌────────────────┐
+            │ No Nodes       │
+            │ (cluster dead) │
+            └────────────────┘
+```
+
+### Node State Machine (per node)
+
+```
+    ┌────────────────┐
+    │ Discovered     │
+    │ (new heartbeat)│
+    └────────┬───────┘
+             │ JoinCluster
+             ▼
+    ┌────────────────┐
+    │ Active         │
+    │ (healthy,      │
+    │  can host      │
+    │  shards)       │
+    └────────┬───────┘
+             │
+    ┌────────┴──────────────┬─────────────────┐
+    │                       │                 │
+    │ (graceful)            │ (heartbeat miss)│
+    ▼                       ▼                 ▼
+┌────────────┐      ┌────────────────┐  ┌────────────────┐
+│ Draining   │      │ Failed         │  │ Failed         │
+│ (stop new  │      │ (timeout:90s)  │  │ (detected)     │
+│  shards,   │      │                │  │ (admin/health) │
+│  preserve  │      │ Rebalance      │  │                │
+│  existing) │      │ shards ASAP    │  │ Rebalance      │
+│            │      │                │  │ shards ASAP    │
+│            │      │Recover?        │  │                │
+│            │      │  ├─ Yes:       │  │Recover?        │
+│            │      │  │  → Active   │  │  ├─ Yes:       │
+│            │      │  └─ No:        │  │  │  → Active   │
+│            │      │     → Deleted  │  │  └─ No:        │
+│            │      │                │  │     → Deleted  │
+│            │      └────────────────┘  └────────────────┘
+└────┬───────┘
+     │ Removed
+     ▼
+┌────────────────┐
+│ Deleted        │
+│ (left cluster) │
+└────────────────┘
+```
+
+### Leadership State Machine (per node)
+
+```
+    ┌──────────────────┐
+    │ Not a Leader     │
+    │ (waiting)        │
+    └────────┬─────────┘
+             │ Try Election (every 2s)
+             │ Atomic create "leader" succeeds
+             ▼
+    ┌──────────────────┐
+    │ Candidate        │
+    │ (won election)   │
+    └────────┬─────────┘
+             │ Start lease renewal loop
+             ▼
+    ┌──────────────────┐
+    │ Leader           │
+    │ (holding lease)  │
+    └────────┬─────────┘
+             │
+      ┌──────┴───────────┬──────────────────────┐
+      │                  │                      │
+      │ Renew lease      │ Lease expires        │ Graceful
+      │ (every 3s)       │ (90s timeout)        │ shutdown
+      │ ✓ Success        │ ✗ Failure            │
+      ▼                  ▼                      ▼
+    [stays]         ┌──────────────────┐   ┌──────────────────┐
+                    │ Lost Leadership  │   │ Lost Leadership  │
+                    │ (lease expired)  │   │ (graceful)       │
+                    └────────┬─────────┘   └────────┬─────────┘
+                             │                      │
+                             └──────────┬───────────┘
+                                        │
+                                        ▼
+                            ┌──────────────────┐
+                            │ Not a Leader     │
+                            │ (back to bench)  │
+                            └──────────────────┘
+                                    │
+                                    └─▶ [Back to top]
+```
+
+---
+
+## Concurrency Model
+
+### Thread Safety
+
+All aggregates use `sync.RWMutex` for thread safety:
+
+```go
+type ClusterManager struct {
+    mutex    sync.RWMutex              // Protects access to:
+    nodes    map[string]*NodeInfo      //   - nodes
+    shardMap *ShardMap                 //   - shardMap
+    // ...
+}
+
+// Read operation (multiple goroutines)
+func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo {
+    cm.mutex.RLock()                   // Shared lock
+    defer cm.mutex.RUnlock()
+    // ...
+}
+
+// Write operation (exclusive)
+func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
+    cm.mutex.Lock()                    // Exclusive lock
+    defer cm.mutex.Unlock()
+    // ... (only one writer at a time)
+}
+```
+
+### Background Goroutines
+
+```
+┌─────────────────────────────────────────────────┐
+│ DistributedVM.Start()                           │
+├─────────────────────────────────────────────────┤
+│                                                 │
+│  ┌──────────────────────────────────────────┐  │
+│  │ ClusterManager.Start()                   │  │
+│  │ ├─ go election.Start()                   │  │
+│  │ │  ├─ go electionLoop() [ticker: 2s]    │  │
+│  │ │  ├─ go leaseRenewalLoop() [ticker: 3s]│  │
+│  │ │  └─ go monitorLeadership() [watcher]  │  │
+│  │ │                                        │  │
+│  │ ├─ go monitorNodes() [ticker: 30s]      │  │
+│  │ └─ go rebalanceLoop() [ticker: 5m]      │  │
+│  └──────────────────────────────────────────┘  │
+│                                                 │
+│  ┌──────────────────────────────────────────┐  │
+│  │ NodeDiscovery.Start()                    │  │
+│  │ ├─ go announceNode() [ticker: 30s]       │  │
+│  │ └─ Subscribe to "aether.discovery"       │  │
+│  └──────────────────────────────────────────┘  │
+│                                                 │
+│  ┌──────────────────────────────────────────┐  │
+│  │ NATS subscriptions                       │  │
+│  │ ├─ "aether.cluster.*" → messages         │  │
+│  │ ├─ "aether.discovery" → node updates     │  │
+│  │ └─ "aether-leader-election" → KV watch   │  │
+│  └──────────────────────────────────────────┘  │
+│                                                 │
+└─────────────────────────────────────────────────┘
+```
+
+---
+
+## Event Sequences
+
+### Event: Node Join, Rebalance, Shard Assignment
+
+```
+Timeline:
+
+T=0s
+  ├─ Node-4 joins cluster
+  ├─ NodeDiscovery announces NodeJoined via NATS
+  └─ ClusterManager receives and processes
+
+T=0.1s
+  ├─ ClusterManager.JoinCluster() executes
+  ├─ Updates nodes map, hashRing
+  ├─ Publishes NodeJoined event
+  └─ If leader: triggers rebalancing
+
+T=0.5s (if leader)
+  ├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing)
+  ├─ PlacementStrategy.RebalanceShards() computes new assignments
+  └─ ShardManager.AssignShards() applies new assignments
+
+T=1.0s
+  ├─ Publishes ShardMigrated events (one per shard moved)
+  ├─ All nodes subscribe to these events
+  ├─ Each node routing table updated
+  └─ Actors aware of new shard locations
+
+T=1.5s onwards
+  ├─ Actors on moved shards migrated (application layer)
+  ├─ Actor Runtime subscribes to ShardMigrated
+  ├─ Triggers actor migration via ActorMigration
+  └─ Eventually: rebalancing complete
+
+T=5m
+  ├─ Periodic rebalance check (5m timer)
+  ├─ If no changes: no-op
+  └─ If imbalance detected: trigger rebalance again
+```
+
+### Event: Node Failure Detection and Recovery
+
+```
+Timeline:
+
+T=0s
+  ├─ Node-2 healthy, last heartbeat received
+  └─ Node-2.LastSeen = now
+
+T=30s
+  ├─ Node-2 healthcheck runs (every 30s timer)
+  ├─ Publishes heartbeat
+  └─ Node-2.LastSeen updated
+
+T=60s
+  ├─ (Node-2 still healthy)
+  └─ Heartbeat received, LastSeen updated
+
+T=65s
+  ├─ Node-2 CRASH (network failure or process crash)
+  ├─ No more heartbeats sent
+  └─ Node-2.LastSeen = 60s
+
+T=90s (timeout)
+  ├─ ClusterManager.checkNodeHealth() detects timeout
+  ├─ now - LastSeen > 90s → mark node Failed
+  ├─ ClusterManager.MarkNodeFailed() executes
+  ├─ Publishes NodeFailed event
+  ├─ If leader: triggers rebalancing
+  └─ (If not leader: waits for leader to rebalance)
+
+T=91s (if leader)
+  ├─ RebalanceShards triggered
+  ├─ PlacementStrategy computes new topology without Node-2
+  ├─ ShardManager.AssignShards() reassigns shards
+  └─ Publishes ShardMigrated events
+
+T=92s onwards
+  ├─ Actors migrated from Node-2 to healthy nodes
+  └─ No actor loss (assuming replication or migration succeeded)
+
+T=120s (Node-2 recovery)
+  ├─ Node-2 process restarts
+  ├─ NodeDiscovery announces NodeJoined again
+  ├─ Status: Active
+  └─ (Back to Node Join sequence if leader decides)
+```
+
+---
+
+## Configuration & Tuning
+
+### Key Parameters
+
+```go
+// From cluster/types.go
+const (
+    // LeaderLeaseTimeout: how long before leader must renew
+    LeaderLeaseTimeout = 10 * time.Second
+
+    // HeartbeatInterval: how often leader renews
+    HeartbeatInterval = 3 * time.Second
+
+    // ElectionTimeout: how often nodes try election
+    ElectionTimeout = 2 * time.Second
+
+    // Node failure detection (in manager.go)
+    nodeFailureTimeout = 90 * time.Second
+)
+
+// From cluster/types.go
+const (
+    // DefaultNumShards: total shards in cluster
+    DefaultNumShards = 1024
+
+    // DefaultVirtualNodes: per-node virtual replicas for distribution
+    DefaultVirtualNodes = 150
+)
+```
+
+### Tuning Guide
+
+| Parameter | Current | Rationale | Trade-off |
+|-----------|---------|-----------|-----------|
+| LeaderLeaseTimeout | 10s | Fast failure detection | May cause thrashing in high-latency networks |
+| HeartbeatInterval | 3s | Leader alive signal every 3s | Overhead 3x per 9s window |
+| ElectionTimeout | 2s | Retry elections frequently | CPU cost, but quick recovery |
+| NodeFailureTimeout | 90s | 3x heartbeat interval | Tolerance for temp network issues |
+| DefaultNumShards | 1024 | Good granularity for large clusters | More shards = more metadata |
+| DefaultVirtualNodes | 150 | Balance between distribution and overhead | Lower = worse distribution, higher = more ring operations |
+
+---
+
+## Failure Scenarios & Recovery
+
+### Scenario A: Single Node Fails
+
+```
+Before:  [A (leader), B, C, D] with 1024 shards
+         ├─ A: 256 shards (+ leader)
+         ├─ B: 256 shards
+         ├─ C: 256 shards
+         └─ D: 256 shards
+
+B crashes (no recovery) → waits 90s → marked Failed
+
+After Rebalance:
+         [A (leader), C, D] with 1024 shards
+         ├─ A: 341 shards (+ leader)
+         ├─ C: 341 shards
+         └─ D: 342 shards
+```
+
+**Recovery:** Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this)
+
+---
+
+### Scenario B: Leader Fails
+
+```
+Before:  [A (leader), B, C, D]
+
+A crashes → waits 90s → marked Failed
+            no lease renewal → lease expires after 10s
+
+B, C, or D wins election → new leader
+                        → triggers rebalance
+                        → reshuffles A's shards
+
+After:   [B (leader), C, D]
+```
+
+**Recovery:** New leader elected within 10s; rebalancing within 100s; no loss if replicas present
+
+---
+
+### Scenario C: Network Partition
+
+```
+Before:  [A (leader), B, C, D]
+Partition: {A} isolated | {B, C, D} connected
+
+At T=10s (lease expires):
+  ├─ A: can't reach NATS, can't renew → loses leadership
+  ├─ B, C, D: A's lease expired, one wins election
+  └─ New leader coordinates rebalance
+
+Risk: If A can reach NATS (just isolated from app), might try to renew
+      but atomic update fails because term mismatch
+
+Result: Single leader maintained; no split-brain
+```
+
+---
+
+## Monitoring & Observability
+
+### Key Metrics to Track
+
+```
+# Cluster Topology
+gauge: cluster.nodes.count                [active|draining|failed]
+gauge: cluster.shards.assigned            [0, 1024]
+gauge: cluster.shards.orphaned            [0, 1024]
+
+# Leadership
+gauge: cluster.leader.is_leader           [0, 1]
+gauge: cluster.leader.term                [0, ∞]
+gauge: cluster.leader.lease_expires_in_seconds  [0, 10]
+
+# Rebalancing
+counter: cluster.rebalancing.triggered    [reason]
+gauge: cluster.rebalancing.active         [0, 1]
+counter: cluster.rebalancing.completed    [shards_moved]
+counter: cluster.rebalancing.failed       [reason]
+
+# Node Health
+gauge: cluster.node.heartbeat_latency_ms  [per node]
+gauge: cluster.node.load                  [per node]
+gauge: cluster.node.vm_count              [per node]
+counter: cluster.node.failures            [reason]
+```
+
+### Alerts
+
+```
+- Leader heartbeat missing > 5s → election may be stuck
+- Rebalancing in progress > 5min → something wrong
+- Orphaned shards > 0 → invariant violation
+- Node failure > 50% of cluster → investigate
+```
+
+---
+
+## References
+
+- [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Full domain model
+- [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation roadmap
+- [manager.go](./manager.go) - ClusterManager implementation
+- [leader.go](./leader.go) - LeaderElection implementation
+- [shard.go](./shard.go) - ShardManager implementation
+- [discovery.go](./discovery.go) - NodeDiscovery implementation
+- [distributed.go](./distributed.go) - DistributedVM orchestrator
+