Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:11 +01:00
parent 18ea677585
commit 271f5db444
26 changed files with 16521 additions and 0 deletions
--- a/.product-strategy/cluster/ARCHITECTURE.md
+++ b/.product-strategy/cluster/ARCHITECTURE.md
@@ -0,0 +1,833 @@
+# Cluster Coordination: Architecture Reference
+
+## High-Level Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Aether Cluster Runtime                     │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │ DistributedVM (Orchestrator - not an aggregate)      │   │
+│  │ ├─ Local Runtime (executes actors)                   │   │
+│  │ ├─ NodeDiscovery (heartbeat → cluster awareness)    │   │
+│  │ ├─ ClusterManager (Cluster aggregate root)          │   │
+│  │ │  ├─ nodes: Map[ID → NodeInfo]                    │   │
+│  │ │  ├─ shardMap: ShardMap (current assignments)     │   │
+│  │ │  ├─ hashRing: ConsistentHashRing (util)          │   │
+│  │ │  └─ election: LeaderElection                      │   │
+│  │ └─ ShardManager (ShardAssignment aggregate)         │   │
+│  │    ├─ shardCount: int                              │   │
+│  │    ├─ shardMap: ShardMap                           │   │
+│  │    └─ placement: PlacementStrategy                  │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                          │ NATS                              │
+│                          ▼                                    │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │ NATS Cluster                                         │   │
+│  │ ├─ Subject: aether.discovery (heartbeats)           │   │
+│  │ ├─ Subject: aether.cluster.* (messages)             │   │
+│  │ └─ KeyValue: aether-leader-election (lease)         │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                                                               │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Aggregate Boundaries
+
+### Aggregate 1: Cluster (Root)
+Owns node topology, shard assignments, and rebalancing decisions.
+
+```
+Cluster Aggregate
+├─ Entities
+│  └─ Cluster (root)
+│     ├─ nodes: Map[NodeID → NodeInfo]
+│     ├─ shardMap: ShardMap
+│     ├─ hashRing: ConsistentHashRing
+│     └─ currentLeaderID: string
+│
+├─ Commands
+│  ├─ JoinCluster(nodeInfo)
+│  ├─ MarkNodeFailed(nodeID)
+│  ├─ AssignShards(shardMap)
+│  └─ RebalanceShards(reason)
+│
+├─ Events
+│  ├─ NodeJoined
+│  ├─ NodeFailed
+│  ├─ NodeLeft
+│  ├─ ShardAssigned
+│  ├─ ShardMigrated
+│  └─ RebalancingTriggered
+│
+├─ Invariants Enforced
+│  ├─ I2: All active shards have owners
+│  ├─ I3: Shards only on healthy nodes
+│  └─ I4: Assignments stable during lease
+│
+└─ Code Location: ClusterManager (cluster/manager.go)
+```
+
+### Aggregate 2: LeadershipLease (Root)
+Owns leadership claim and ensures single leader per term.
+
+```
+LeadershipLease Aggregate
+├─ Entities
+│  └─ LeadershipLease (root)
+│     ├─ leaderID: string
+│     ├─ term: uint64
+│     ├─ expiresAt: time.Time
+│     └─ startedAt: time.Time
+│
+├─ Commands
+│  ├─ ElectLeader(nodeID)
+│  └─ RenewLeadership(nodeID)
+│
+├─ Events
+│  ├─ LeaderElected
+│  ├─ LeadershipRenewed
+│  └─ LeadershipLost
+│
+├─ Invariants Enforced
+│  ├─ I1: Single leader per term
+│  └─ I5: Leader is active node
+│
+└─ Code Location: LeaderElection (cluster/leader.go)
+```
+
+### Aggregate 3: ShardAssignment (Root)
+Owns shard-to-node mappings and validates assignments.
+
+```
+ShardAssignment Aggregate
+├─ Entities
+│  └─ ShardAssignment (root)
+│     ├─ version: uint64
+│     ├─ assignments: Map[ShardID → []NodeID]
+│     ├─ nodes: Map[NodeID → NodeInfo]
+│     └─ updateTime: time.Time
+│
+├─ Commands
+│  ├─ AssignShard(shardID, nodeList)
+│  └─ RebalanceFromTopology(nodes)
+│
+├─ Events
+│  ├─ ShardAssigned
+│  └─ ShardMigrated
+│
+├─ Invariants Enforced
+│  ├─ I2: All shards assigned
+│  └─ I3: Only healthy nodes
+│
+└─ Code Location: ShardManager (cluster/shard.go)
+```
+
+---
+
+## Command Flow Diagrams
+
+### Scenario 1: Node Joins Cluster
+
+```
+┌─────────┐ NodeJoined   ┌──────────────────┐
+│New Node │─────────────▶│ClusterManager    │
+└─────────┘              │.JoinCluster()    │
+                         └────────┬─────────┘
+                                  │
+                    ┌─────────────┼─────────────┐
+                    ▼             ▼             ▼
+              ┌──────────┐  ┌──────────┐  ┌──────────┐
+              │Validate  │  │Update    │  │Publish   │
+              │ID unique │  │topology  │  │NodeJoined│
+              │Capacity>0   hashRing  │  │event     │
+              └────┬─────┘  └──────────┘  └──────────┘
+                   │
+              ┌────▼────────────────────────┐
+              │Is this node leader?         │
+              │If yes: trigger rebalance    │
+              └─────────────────────────────┘
+                        │
+            ┌───────────┴───────────┐
+            ▼                       ▼
+    ┌──────────────────┐   ┌──────────────────┐
+    │RebalanceShards   │   │(nothing)         │
+    │   command        │   │                  │
+    └──────────────────┘   └──────────────────┘
+            │
+            ▼
+    ┌──────────────────────────────────┐
+    │ConsistentHashPlacement           │
+    │ .RebalanceShards()               │
+    │ (compute new assignments)        │
+    └────────────┬─────────────────────┘
+                 │
+                 ▼
+    ┌──────────────────────────────────┐
+    │ShardManager.AssignShards()       │
+    │ (validate & apply)               │
+    └────────────┬─────────────────────┘
+                 │
+            ┌────┴──────────────────────┐
+            ▼                           ▼
+    ┌──────────────┐           ┌─────────────────┐
+    │For each      │           │Publish          │
+    │shard moved   │           │ShardMigrated    │
+    │             │           │event per shard  │
+    └──────────────┘           └─────────────────┘
+```
+
+### Scenario 2: Node Failure Detected
+
+```
+┌──────────────────────┐
+│Heartbeat timeout     │
+│(LastSeen > 90s)      │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────────────┐
+│ClusterManager                │
+│.MarkNodeFailed()             │
+│ ├─ Mark status=Failed        │
+│ ├─ Remove from hashRing      │
+│ └─ Publish NodeFailed event  │
+└────────────┬─────────────────┘
+             │
+    ┌────────▼────────────┐
+    │Is this node leader? │
+    └────────┬────────────┘
+             │
+    ┌────────┴─────────────────┐
+    │ YES                      │ NO
+    ▼                          ▼
+┌──────────────┐   ┌──────────────────┐
+│Trigger       │   │(nothing)         │
+│Rebalance     │   │                  │
+└──────────────┘   └──────────────────┘
+    │
+    └─▶ [Same as Scenario 1 from RebalanceShards]
+```
+
+### Scenario 3: Leader Election (Implicit)
+
+```
+┌─────────────────────────────────────┐
+│All nodes: electionLoop runs every 2s│
+└────────────┬────────────────────────┘
+             │
+    ┌────────▼────────────┐
+    │Am I leader?         │
+    └────────┬────────────┘
+             │
+    ┌────────┴──────────────────────────┐
+    │ YES                               │ NO
+    ▼                                   ▼
+┌──────────────┐          ┌─────────────────────────┐
+│Do nothing    │          │Should try election?     │
+│(already      │          │ ├─ No leader exists     │
+│leading)      │          │ ├─ Lease expired        │
+└──────────────┘          │ └─ (other conditions)   │
+                          └────────┬────────────────┘
+                                   │
+                         ┌─────────▼──────────┐
+                         │try AtomicCreate    │
+                         │"leader" key in KV  │
+                         └────────┬───────────┘
+                                  │
+                    ┌─────────────┴──────────────┐
+                    │ SUCCESS                    │ FAILED
+                    ▼                            ▼
+            ┌──────────────────┐      ┌──────────────────┐
+            │Became Leader!    │      │Try claim expired │
+            │Publish           │      │lease; if success,│
+            │LeaderElected     │      │become leader     │
+            └──────────────────┘      │Else: stay on     │
+                    │                 │bench              │
+                    ▼                 └──────────────────┘
+            ┌──────────────────┐
+            │Start lease       │
+            │renewal loop      │
+            │(every 3s)        │
+            └──────────────────┘
+```
+
+---
+
+## Decision Trees
+
+### Decision 1: Is Node Healthy?
+
+```
+Query: Is Node X healthy?
+
+┌─────────────────────────────────────┐
+│Get node status from Cluster.nodes   │
+└────────────┬────────────────────────┘
+             │
+    ┌────────▼────────────────┐
+    │Check node.Status field  │
+    └────────┬────────────────┘
+             │
+    ┌────────┴───────────────┬─────────────────────┬──────────────┐
+    │                        │                     │              │
+    ▼                        ▼                     ▼              ▼
+┌────────┐          ┌────────────┐      ┌──────────────┐  ┌─────────┐
+│Active  │          │Draining    │      │Failed        │  │Unknown  │
+├────────┤          ├────────────┤      ├──────────────┤  ├─────────┤
+│✓Healthy│          │⚠ Draining  │      │✗ Unhealthy   │  │✗ Error  │
+│Can host│          │Should not  │      │Don't use for │  │         │
+│shards  │          │get new     │      │sharding      │  │         │
+└────────┘          │shards, but │      │Delete shards │  └─────────┘
+                    │existing OK │      │ASAP          │
+                    └────────────┘      └──────────────┘
+```
+
+### Decision 2: Should This Node Rebalance Shards?
+
+```
+Command: RebalanceShards(nodeID, reason)
+
+┌──────────────────────────────────┐
+│Is nodeID the current leader?     │
+└────────┬─────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: NotLeader     │
+│rebalancing │      │                      │
+└─────┬──────┘      │Only leader can       │
+      │             │initiate rebalancing  │
+      │             └──────────────────────┘
+      │
+      ▼
+┌─────────────────────────────────────┐
+│Are there active nodes?              │
+└────────┬────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: NoActiveNodes │
+│rebalancing │      │                      │
+└─────┬──────┘      │Can't assign shards   │
+      │             │with no healthy nodes │
+      │             └──────────────────────┘
+      │
+      ▼
+┌──────────────────────────────────────────┐
+│Call PlacementStrategy.RebalanceShards()  │
+│ (compute new assignments)                │
+└────────┬─────────────────────────────────┘
+         │
+         ▼
+┌──────────────────────────────────────────┐
+│Call ShardManager.AssignShards()          │
+│ (validate & apply new assignments)       │
+└────────┬─────────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ SUCCESS               │ FAILURE
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Publish     │      │Publish               │
+│Shard       │      │RebalancingFailed     │
+│Migrated    │      │event                 │
+│events      │      │                      │
+│            │      │Log error, backoff    │
+│Publish     │      │try again in 5 min    │
+│Rebalancing │      │                      │
+│Completed   │      │                      │
+└────────────┘      └──────────────────────┘
+```
+
+### Decision 3: Can We Assign This Shard to This Node?
+
+```
+Command: AssignShard(shardID, nodeID)
+
+┌──────────────────────────────────┐
+│Is nodeID in Cluster.nodes?       │
+└────────┬─────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: NodeNotFound  │
+│assignment  │      │                      │
+└─────┬──────┘      │Can't assign shard    │
+      │             │to non-existent node  │
+      │             └──────────────────────┘
+      │
+      ▼
+┌──────────────────────────────────────┐
+│Check node.Status                     │
+└────────┬─────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │Active or Draining     │ Failed
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│Continue    │      │REJECT: UnhealthyNode │
+│assignment  │      │                      │
+└─────┬──────┘      │Can't assign to       │
+      │             │failed node; it can't │
+      │             │execute shards        │
+      │             └──────────────────────┘
+      │
+      ▼
+┌──────────────────────────────────────┐
+│Check replication factor              │
+│ (existing nodes < replication limit?)│
+└────────┬─────────────────────────────┘
+         │
+    ┌────┴──────────────────┐
+    │ YES                   │ NO
+    ▼                       ▼
+┌────────────┐      ┌──────────────────────┐
+│ACCEPT      │      │REJECT: TooManyReplicas
+│Add node to │      │                      │
+│shard's     │      │Already have max      │
+│replica     │      │replicas for shard    │
+│list        │      │                      │
+└────────────┘      └──────────────────────┘
+```
+
+---
+
+## State Transitions
+
+### Cluster State Machine
+
+```
+                    ┌────────────────┐
+                    │ Initializing   │
+                    │ (no nodes)     │
+                    └────────┬───────┘
+                             │ NodeJoined
+                             ▼
+                    ┌────────────────┐
+                    │ Single Node    │
+                    │ (one node only)│
+                    └────────┬───────┘
+                             │ NodeJoined
+                             ▼
+    ┌────────────────────────────────────────────┐
+    │ Multi-Node Cluster                         │
+    │ ├─ Stable (healthy nodes, shards assigned) │
+    │ ├─ Rebalancing (shards moving)             │
+    │ └─ Degraded (failed node waiting for heal) │
+    └────────────┬───────────────────────────────┘
+                 │ (All nodes left or failed)
+                 ▼
+            ┌────────────────┐
+            │ No Nodes       │
+            │ (cluster dead) │
+            └────────────────┘
+```
+
+### Node State Machine (per node)
+
+```
+    ┌────────────────┐
+    │ Discovered     │
+    │ (new heartbeat)│
+    └────────┬───────┘
+             │ JoinCluster
+             ▼
+    ┌────────────────┐
+    │ Active         │
+    │ (healthy,      │
+    │  can host      │
+    │  shards)       │
+    └────────┬───────┘
+             │
+    ┌────────┴──────────────┬─────────────────┐
+    │                       │                 │
+    │ (graceful)            │ (heartbeat miss)│
+    ▼                       ▼                 ▼
+┌────────────┐      ┌────────────────┐  ┌────────────────┐
+│ Draining   │      │ Failed         │  │ Failed         │
+│ (stop new  │      │ (timeout:90s)  │  │ (detected)     │
+│  shards,   │      │                │  │ (admin/health) │
+│  preserve  │      │ Rebalance      │  │                │
+│  existing) │      │ shards ASAP    │  │ Rebalance      │
+│            │      │                │  │ shards ASAP    │
+│            │      │Recover?        │  │                │
+│            │      │  ├─ Yes:       │  │Recover?        │
+│            │      │  │  → Active   │  │  ├─ Yes:       │
+│            │      │  └─ No:        │  │  │  → Active   │
+│            │      │     → Deleted  │  │  └─ No:        │
+│            │      │                │  │     → Deleted  │
+│            │      └────────────────┘  └────────────────┘
+└────┬───────┘
+     │ Removed
+     ▼
+┌────────────────┐
+│ Deleted        │
+│ (left cluster) │
+└────────────────┘
+```
+
+### Leadership State Machine (per node)
+
+```
+    ┌──────────────────┐
+    │ Not a Leader     │
+    │ (waiting)        │
+    └────────┬─────────┘
+             │ Try Election (every 2s)
+             │ Atomic create "leader" succeeds
+             ▼
+    ┌──────────────────┐
+    │ Candidate        │
+    │ (won election)   │
+    └────────┬─────────┘
+             │ Start lease renewal loop
+             ▼
+    ┌──────────────────┐
+    │ Leader           │
+    │ (holding lease)  │
+    └────────┬─────────┘
+             │
+      ┌──────┴───────────┬──────────────────────┐
+      │                  │                      │
+      │ Renew lease      │ Lease expires        │ Graceful
+      │ (every 3s)       │ (90s timeout)        │ shutdown
+      │ ✓ Success        │ ✗ Failure            │
+      ▼                  ▼                      ▼
+    [stays]         ┌──────────────────┐   ┌──────────────────┐
+                    │ Lost Leadership  │   │ Lost Leadership  │
+                    │ (lease expired)  │   │ (graceful)       │
+                    └────────┬─────────┘   └────────┬─────────┘
+                             │                      │
+                             └──────────┬───────────┘
+                                        │
+                                        ▼
+                            ┌──────────────────┐
+                            │ Not a Leader     │
+                            │ (back to bench)  │
+                            └──────────────────┘
+                                    │
+                                    └─▶ [Back to top]
+```
+
+---
+
+## Concurrency Model
+
+### Thread Safety
+
+All aggregates use `sync.RWMutex` for thread safety:
+
+```go
+type ClusterManager struct {
+    mutex    sync.RWMutex              // Protects access to:
+    nodes    map[string]*NodeInfo      //   - nodes
+    shardMap *ShardMap                 //   - shardMap
+    // ...
+}
+
+// Read operation (multiple goroutines)
+func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo {
+    cm.mutex.RLock()                   // Shared lock
+    defer cm.mutex.RUnlock()
+    // ...
+}
+
+// Write operation (exclusive)
+func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
+    cm.mutex.Lock()                    // Exclusive lock
+    defer cm.mutex.Unlock()
+    // ... (only one writer at a time)
+}
+```
+
+### Background Goroutines
+
+```
+┌─────────────────────────────────────────────────┐
+│ DistributedVM.Start()                           │
+├─────────────────────────────────────────────────┤
+│                                                 │
+│  ┌──────────────────────────────────────────┐  │
+│  │ ClusterManager.Start()                   │  │
+│  │ ├─ go election.Start()                   │  │
+│  │ │  ├─ go electionLoop() [ticker: 2s]    │  │
+│  │ │  ├─ go leaseRenewalLoop() [ticker: 3s]│  │
+│  │ │  └─ go monitorLeadership() [watcher]  │  │
+│  │ │                                        │  │
+│  │ ├─ go monitorNodes() [ticker: 30s]      │  │
+│  │ └─ go rebalanceLoop() [ticker: 5m]      │  │
+│  └──────────────────────────────────────────┘  │
+│                                                 │
+│  ┌──────────────────────────────────────────┐  │
+│  │ NodeDiscovery.Start()                    │  │
+│  │ ├─ go announceNode() [ticker: 30s]       │  │
+│  │ └─ Subscribe to "aether.discovery"       │  │
+│  └──────────────────────────────────────────┘  │
+│                                                 │
+│  ┌──────────────────────────────────────────┐  │
+│  │ NATS subscriptions                       │  │
+│  │ ├─ "aether.cluster.*" → messages         │  │
+│  │ ├─ "aether.discovery" → node updates     │  │
+│  │ └─ "aether-leader-election" → KV watch   │  │
+│  └──────────────────────────────────────────┘  │
+│                                                 │
+└─────────────────────────────────────────────────┘
+```
+
+---
+
+## Event Sequences
+
+### Event: Node Join, Rebalance, Shard Assignment
+
+```
+Timeline:
+
+T=0s
+  ├─ Node-4 joins cluster
+  ├─ NodeDiscovery announces NodeJoined via NATS
+  └─ ClusterManager receives and processes
+
+T=0.1s
+  ├─ ClusterManager.JoinCluster() executes
+  ├─ Updates nodes map, hashRing
+  ├─ Publishes NodeJoined event
+  └─ If leader: triggers rebalancing
+
+T=0.5s (if leader)
+  ├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing)
+  ├─ PlacementStrategy.RebalanceShards() computes new assignments
+  └─ ShardManager.AssignShards() applies new assignments
+
+T=1.0s
+  ├─ Publishes ShardMigrated events (one per shard moved)
+  ├─ All nodes subscribe to these events
+  ├─ Each node routing table updated
+  └─ Actors aware of new shard locations
+
+T=1.5s onwards
+  ├─ Actors on moved shards migrated (application layer)
+  ├─ Actor Runtime subscribes to ShardMigrated
+  ├─ Triggers actor migration via ActorMigration
+  └─ Eventually: rebalancing complete
+
+T=5m
+  ├─ Periodic rebalance check (5m timer)
+  ├─ If no changes: no-op
+  └─ If imbalance detected: trigger rebalance again
+```
+
+### Event: Node Failure Detection and Recovery
+
+```
+Timeline:
+
+T=0s
+  ├─ Node-2 healthy, last heartbeat received
+  └─ Node-2.LastSeen = now
+
+T=30s
+  ├─ Node-2 healthcheck runs (every 30s timer)
+  ├─ Publishes heartbeat
+  └─ Node-2.LastSeen updated
+
+T=60s
+  ├─ (Node-2 still healthy)
+  └─ Heartbeat received, LastSeen updated
+
+T=65s
+  ├─ Node-2 CRASH (network failure or process crash)
+  ├─ No more heartbeats sent
+  └─ Node-2.LastSeen = 60s
+
+T=90s (timeout)
+  ├─ ClusterManager.checkNodeHealth() detects timeout
+  ├─ now - LastSeen > 90s → mark node Failed
+  ├─ ClusterManager.MarkNodeFailed() executes
+  ├─ Publishes NodeFailed event
+  ├─ If leader: triggers rebalancing
+  └─ (If not leader: waits for leader to rebalance)
+
+T=91s (if leader)
+  ├─ RebalanceShards triggered
+  ├─ PlacementStrategy computes new topology without Node-2
+  ├─ ShardManager.AssignShards() reassigns shards
+  └─ Publishes ShardMigrated events
+
+T=92s onwards
+  ├─ Actors migrated from Node-2 to healthy nodes
+  └─ No actor loss (assuming replication or migration succeeded)
+
+T=120s (Node-2 recovery)
+  ├─ Node-2 process restarts
+  ├─ NodeDiscovery announces NodeJoined again
+  ├─ Status: Active
+  └─ (Back to Node Join sequence if leader decides)
+```
+
+---
+
+## Configuration & Tuning
+
+### Key Parameters
+
+```go
+// From cluster/types.go
+const (
+    // LeaderLeaseTimeout: how long before leader must renew
+    LeaderLeaseTimeout = 10 * time.Second
+
+    // HeartbeatInterval: how often leader renews
+    HeartbeatInterval = 3 * time.Second
+
+    // ElectionTimeout: how often nodes try election
+    ElectionTimeout = 2 * time.Second
+
+    // Node failure detection (in manager.go)
+    nodeFailureTimeout = 90 * time.Second
+)
+
+// From cluster/types.go
+const (
+    // DefaultNumShards: total shards in cluster
+    DefaultNumShards = 1024
+
+    // DefaultVirtualNodes: per-node virtual replicas for distribution
+    DefaultVirtualNodes = 150
+)
+```
+
+### Tuning Guide
+
+| Parameter | Current | Rationale | Trade-off |
+|-----------|---------|-----------|-----------|
+| LeaderLeaseTimeout | 10s | Fast failure detection | May cause thrashing in high-latency networks |
+| HeartbeatInterval | 3s | Leader alive signal every 3s | Overhead 3x per 9s window |
+| ElectionTimeout | 2s | Retry elections frequently | CPU cost, but quick recovery |
+| NodeFailureTimeout | 90s | 3x heartbeat interval | Tolerance for temp network issues |
+| DefaultNumShards | 1024 | Good granularity for large clusters | More shards = more metadata |
+| DefaultVirtualNodes | 150 | Balance between distribution and overhead | Lower = worse distribution, higher = more ring operations |
+
+---
+
+## Failure Scenarios & Recovery
+
+### Scenario A: Single Node Fails
+
+```
+Before:  [A (leader), B, C, D] with 1024 shards
+         ├─ A: 256 shards (+ leader)
+         ├─ B: 256 shards
+         ├─ C: 256 shards
+         └─ D: 256 shards
+
+B crashes (no recovery) → waits 90s → marked Failed
+
+After Rebalance:
+         [A (leader), C, D] with 1024 shards
+         ├─ A: 341 shards (+ leader)
+         ├─ C: 341 shards
+         └─ D: 342 shards
+```
+
+**Recovery:** Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this)
+
+---
+
+### Scenario B: Leader Fails
+
+```
+Before:  [A (leader), B, C, D]
+
+A crashes → waits 90s → marked Failed
+            no lease renewal → lease expires after 10s
+
+B, C, or D wins election → new leader
+                        → triggers rebalance
+                        → reshuffles A's shards
+
+After:   [B (leader), C, D]
+```
+
+**Recovery:** New leader elected within 10s; rebalancing within 100s; no loss if replicas present
+
+---
+
+### Scenario C: Network Partition
+
+```
+Before:  [A (leader), B, C, D]
+Partition: {A} isolated | {B, C, D} connected
+
+At T=10s (lease expires):
+  ├─ A: can't reach NATS, can't renew → loses leadership
+  ├─ B, C, D: A's lease expired, one wins election
+  └─ New leader coordinates rebalance
+
+Risk: If A can reach NATS (just isolated from app), might try to renew
+      but atomic update fails because term mismatch
+
+Result: Single leader maintained; no split-brain
+```
+
+---
+
+## Monitoring & Observability
+
+### Key Metrics to Track
+
+```
+# Cluster Topology
+gauge: cluster.nodes.count                [active|draining|failed]
+gauge: cluster.shards.assigned            [0, 1024]
+gauge: cluster.shards.orphaned            [0, 1024]
+
+# Leadership
+gauge: cluster.leader.is_leader           [0, 1]
+gauge: cluster.leader.term                [0, ∞]
+gauge: cluster.leader.lease_expires_in_seconds  [0, 10]
+
+# Rebalancing
+counter: cluster.rebalancing.triggered    [reason]
+gauge: cluster.rebalancing.active         [0, 1]
+counter: cluster.rebalancing.completed    [shards_moved]
+counter: cluster.rebalancing.failed       [reason]
+
+# Node Health
+gauge: cluster.node.heartbeat_latency_ms  [per node]
+gauge: cluster.node.load                  [per node]
+gauge: cluster.node.vm_count              [per node]
+counter: cluster.node.failures            [reason]
+```
+
+### Alerts
+
+```
+- Leader heartbeat missing > 5s → election may be stuck
+- Rebalancing in progress > 5min → something wrong
+- Orphaned shards > 0 → invariant violation
+- Node failure > 50% of cluster → investigate
+```
+
+---
+
+## References
+
+- [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Full domain model
+- [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation roadmap
+- [manager.go](./manager.go) - ClusterManager implementation
+- [leader.go](./leader.go) - LeaderElection implementation
+- [shard.go](./shard.go) - ShardManager implementation
+- [discovery.go](./discovery.go) - NodeDiscovery implementation
+- [distributed.go](./distributed.go) - DistributedVM orchestrator
+
--- a/.product-strategy/cluster/DOMAIN_MODEL.md
+++ b/.product-strategy/cluster/DOMAIN_MODEL.md
@@ -0,0 +1,997 @@
+# Domain Model: Cluster Coordination
+
+## Summary
+
+The Cluster Coordination context manages the distributed topology of actor nodes in an Aether cluster. Its core responsibility is to maintain consistency invariants: exactly one leader per term, all active shards assigned to at least one node, and no orphaned shards. It coordinates node discovery (via NATS heartbeats), leader election (lease-based), shard assignment (via consistent hashing), and rebalancing (when topology changes). The context enforces that only the leader can initiate rebalancing, and that node failures trigger shard reassignment to prevent actor orphaning.
+
+**Key insight:** Cluster Coordination is *not* actor placement or routing (that's the application's responsibility via ShardManager). It owns the *topology* and *leadership*, enabling routing decisions by publishing shard assignments.
+
+---
+
+## Invariants
+
+These are the business rules that must never be violated:
+
+### Invariant 1: Single Leader Per Term
+- **Rule:** At any point in time, at most one node is the leader for the current leadership term.
+- **Scope:** LeadershipLease aggregate
+- **Why:** Multiple leaders (split-brain) lead to conflicting rebalancing decisions and inconsistent shard assignments.
+- **Enforcement:** LeaderElection enforces via NATS KV atomic operations (create/update with revision). Only one node can atomically claim the "leader" key.
+
+### Invariant 2: All Active Shards Have Owner(s)
+- **Rule:** Every shard ID in [0, ShardCount) must be assigned to at least one active node if the cluster is healthy.
+- **Scope:** ShardAssignment aggregate
+- **Why:** Unassigned shards mean actors on those shards have no home; messages will orphan.
+- **Enforcement:** LeaderElection enforces (only leader can assign). ClusterManager validates before applying assignments.
+
+### Invariant 3: Assigned Shards Exist on Healthy Nodes Only
+- **Rule:** A shard assignment to node N is only valid if N is in NodeStatusActive.
+- **Scope:** ShardAssignment + Cluster aggregates (coupled)
+- **Why:** Assigning shards to failed nodes means actors can't execute.
+- **Enforcement:** When node fails (NodeStatusFailed), leader rebalances shards off that node. handleNodeUpdate marks nodes failed after 90s heartbeat miss.
+
+### Invariant 4: Shard Assignments Stable During Leadership Lease
+- **Rule:** Shard assignments only change in response to LeaderElected or NodeFailed; they don't arbitrarily shift during a stable leadership term.
+- **Scope:** ShardAssignment + LeadershipLease (coupled)
+- **Why:** Frequent rebalancing causes thrashing and actor migration overhead.
+- **Enforcement:** rebalanceLoop (every 5 min) only runs if leader; triggerShardRebalancing only called on node changes (NodeJoined/Left/Failed).
+
+### Invariant 5: Leader Is an Active Node
+- **Rule:** If LeaderID is set, the node with that ID must exist in Cluster.nodes with status=Active.
+- **Scope:** Cluster + LeadershipLease (coupled)
+- **Why:** A failed leader cannot coordinate cluster decisions.
+- **Enforcement:** handleNodeUpdate marks nodes failed after timeout; leader renewal fails if node is marked failed. Split-brain risk: partition could allow multiple leaders, but lease expiration + atomic update mitigates.
+
+---
+
+## Aggregates
+
+### Aggregate 1: Cluster (Root)
+
+**Invariants enforced:**
+- Invariant 2: All active shards have owners
+- Invariant 3: Shards assigned only to healthy nodes
+- Invariant 4: Shard assignments stable during leadership lease
+- Invariant 5: Leader is an active node
+
+**Entities:**
+- **Cluster** (root): Represents the distributed topology and orchestrates rebalancing
+  - `nodes`: Map[NodeID → NodeInfo] - all known nodes, their status, load, capacity
+  - `shardMap`: ShardMap - current shard-to-node assignments
+  - `hashRing`: ConsistentHashRing - used to compute which node owns which shard
+  - `currentLeaderID`: String - who is leading this term
+  - `term`: uint64 - leadership term counter
+
+**Value Objects:**
+- `NodeInfo`: ID, Address, Port, Status, Capacity, Load, LastSeen, Metadata, VMCount, ShardIDs
+  - Represents a physical node in the cluster; immutable after creation, mutated only via NodeUpdate commands
+- `ShardMap`: Version, Shards (map[ShardID → []NodeID]), Nodes (map[NodeID → NodeInfo]), UpdateTime
+  - Snapshot of current shard topology; immutable (replaced, not mutated)
+- `NodeStatus`: Enum (Active, Draining, Failed)
+  - Indicates health state of a node
+
+**Lifecycle:**
+- **Created when:** ClusterManager is instantiated (Cluster exists as singleton during runtime)
+- **Destroyed when:** Cluster shuts down or node is permanently removed
+- **Transitions:**
+  - NodeJoined → add node to nodes, add to hashRing, trigger rebalance (if leader)
+  - NodeLeft → remove node from nodes, remove from hashRing, trigger rebalance (if leader)
+  - NodeFailed (detected) → mark node as failed, trigger rebalance (if leader)
+  - LeaderElected → update currentLeaderID, may trigger rebalance
+  - ShardAssigned → update shardMap, increment version
+
+**Behavior Methods (not just getters/setters):**
+- `addNode(nodeInfo)` → NodeJoined event + may trigger rebalance
+- `removeNode(nodeID)` → NodeLeft event + trigger rebalance
+- `markNodeFailed(nodeID)` → NodeFailed event + trigger rebalance
+- `assignShards(shardMap)` → ShardAssigned event (leader only)
+- `rebalanceTopology()` → ShardMigrated events (leader only)
+
+---
+
+### Aggregate 2: LeadershipLease (Root)
+
+**Invariants enforced:**
+- Invariant 1: Single leader per term
+- Invariant 5: Leader is an active node
+
+**Entities:**
+- **LeadershipLease** (root): Represents the current leadership claim
+  - `leaderID`: String - which node holds the lease
+  - `term`: uint64 - monotonically increasing term number
+  - `expiresAt`: Timestamp - when this lease expires (now + LeaderLeaseTimeout)
+  - `startedAt`: Timestamp - when leader was elected
+
+**Value Objects:**
+- None (all properties immutable; lease is replaced, not mutated)
+
+**Lifecycle:**
+- **Created when:** A node wins election and creates the "leader" key in NATS KV
+- **Destroyed when:** Lease expires and is not renewed, or leader resigns
+- **Transitions:**
+  - TryBecomeLeader → attempt atomic create; if fails, maybe claim expired lease
+  - RenewLease (every 3s) → atomically update expiresAt to now + 10s
+  - LeaseExpired (detected) → remove from KV, allow new election
+  - NodeFailed (detected) → if failed node is leader, expiration will trigger new election
+
+**Behavior Methods:**
+- `tryAcquire(nodeID)` → LeaderElected event (if succeeds)
+- `renewLease(nodeID)` → LeadershipRenewed event (internal, not exposed as command)
+- `isExpired()` → Boolean
+- `isLeader(nodeID)` → Boolean
+
+**Invariant enforcement mechanism:**
+- **Atomic operations in NATS KV:** Only one node can successfully create "leader" key (or update with correct revision), ensuring single leader per term.
+- **Lease expiration:** If leader crashes without renewing, lease expires after 10s, allowing another node to claim it.
+- **Revision-based updates:** Update to lease must include correct revision (optimistic concurrency control), preventing stale leader from renewing.
+
+---
+
+### Aggregate 3: ShardAssignment (Root)
+
+**Invariants enforced:**
+- Invariant 2: All active shards have owners
+- Invariant 3: Shards assigned only to healthy nodes
+
+**Entities:**
+- **ShardAssignment** (root): Maps shards to their owning nodes
+  - `version`: uint64 - incremented on each change, enables version comparison for replication
+  - `assignments`: Map[ShardID → []NodeID] - shard to primary+replica nodes
+  - `nodes`: Map[NodeID → NodeInfo] - snapshot of active nodes at assignment time
+  - `updateTime`: Timestamp
+
+**Value Objects:**
+- None (structure is just data; immutability via replacement)
+
+**Lifecycle:**
+- **Created when:** Cluster initializes (empty assignments)
+- **Updated when:** Leader calls rebalanceTopology() → new ShardAssignment created (old one replaced)
+- **Destroyed when:** Cluster shuts down
+
+**Behavior Methods:**
+- `assignShard(shardID, nodeList)` → validates all nodes in nodeList are active
+- `rebalanceFromTopology(topology, strategy)` → calls strategy to compute new assignments
+- `validateAssignments()` → checks all shards assigned, all owners healthy
+- `getAssignmentsForNode(nodeID)` → []ShardID
+
+**Validation Rules:**
+- All nodes in assignment must be in nodes map with status=Active
+- All shard IDs in [0, ShardCount) must appear in assignments (no orphans)
+- Replication factor respected (each shard has 1..ReplicationFactor owners)
+
+---
+
+## Commands
+
+Commands represent user or system intents to change the cluster state. Only aggregates handle commands.
+
+### Command 1: JoinCluster
+- **Aggregate:** Cluster
+- **Actor:** Node joining (or discovery service announcing)
+- **Input:** nodeID, address, port, capacity, metadata
+- **Validates:**
+  - nodeID is not empty
+  - capacity > 0
+  - address is reachable (optional)
+- **Invariants enforced:** Invariant 2 (rebalance if needed)
+- **Success:** NodeJoined event published
+- **Failure:** DuplicateNodeError (node already in cluster), ValidationError
+
+### Command 2: ElectLeader
+- **Aggregate:** LeadershipLease
+- **Actor:** Node attempting election (triggered periodically)
+- **Input:** nodeID, currentTerm
+- **Validates:**
+  - nodeID matches a current cluster member (in Cluster.nodes, status=Active)
+  - Can attempt if no current leader OR lease is expired
+- **Invariants enforced:** Invariant 1, 5
+- **Success:** LeaderElected event published (if atomic create succeeds); LeadershipRenewed (if claim expired lease)
+- **Failure:** LeaderElectionFailed (atomic operation lost), NodeNotHealthy
+
+### Command 3: RenewLeadership
+- **Aggregate:** LeadershipLease
+- **Actor:** Current leader (triggered every 3s)
+- **Input:** nodeID, currentTerm
+- **Validates:**
+  - nodeID is current leader
+  - term matches current term
+  - node status is Active (else fail and lose leadership)
+- **Invariants enforced:** Invariant 1, 5
+- **Success:** LeadershipRenewed (internal event, triggers heartbeat log entry)
+- **Failure:** LeadershipLost (node is no longer healthy or lost atomic update race)
+
+### Command 4: MarkNodeFailed
+- **Aggregate:** Cluster
+- **Actor:** System (monitoring service) or leader (if heartbeat misses)
+- **Input:** nodeID, reason
+- **Validates:**
+  - nodeID exists in cluster
+  - node is currently Active (don't re-fail already-failed nodes)
+- **Invariants enforced:** Invariant 2, 3, 5 (rebalance to move shards off failed node)
+- **Success:** NodeFailed event published; RebalanceTriggered (if leader)
+- **Failure:** NodeNotFound, NodeAlreadyFailed
+
+### Command 5: AssignShards
+- **Aggregate:** ShardAssignment (+ reads Cluster topology)
+- **Actor:** Leader (only leader can assign)
+- **Input:** nodeID (must be leader), newAssignments (Map[ShardID → []NodeID])
+- **Validates:**
+  - nodeID is current leader
+  - all nodes in assignments are Active
+  - all shards in [0, ShardCount) are covered
+  - replication factor respected
+- **Invariants enforced:** Invariant 2, 3 (assignment only valid if all nodes healthy)
+- **Success:** ShardAssigned event published with new ShardMap
+- **Failure:** NotLeader, InvalidAssignment (node not found), UnhealthyNode, IncompleteAssignment (missing shards)
+
+### Command 6: RebalanceShards
+- **Aggregate:** Cluster (orchestrates) + ShardAssignment (executes)
+- **Actor:** Leader (triggered by node changes or periodic check)
+- **Input:** nodeID (must be leader), strategy (optional placement strategy)
+- **Validates:**
+  - nodeID is current leader
+  - cluster has active nodes
+- **Invariants enforced:** Invariant 2 (all shards still assigned), Invariant 3 (only to healthy nodes)
+- **Success:** RebalancingCompleted event; zero or more ShardMigrated events (one per shard moved)
+- **Failure:** NotLeader, NoActiveNodes, RebalancingFailed (unexpected topology change mid-rebalance)
+
+---
+
+## Events
+
+Events represent facts that happened. They are published after successful command execution.
+
+### Event 1: NodeJoined
+- **Triggered by:** JoinCluster command
+- **Aggregate:** Cluster
+- **Data:** nodeID, address, port, capacity, metadata, timestamp
+- **Consumed by:**
+  - Cluster (adds node to ring)
+  - Policies (RebalancingTriggerPolicy)
+- **Semantics:** A new node entered the cluster and is ready to host actors
+- **Immutability:** Once published, never changes
+
+### Event 2: NodeDiscovered
+- **Triggered by:** NodeDiscovery announces node via NATS pub/sub (implicit)
+- **Aggregate:** Cluster (discovery feeds into cluster topology)
+- **Data:** nodeID, nodeInfo, timestamp
+- **Consumed by:** Cluster topology sync
+- **Semantics:** Node became visible to the cluster; may be new or rediscovered after network partition
+- **Note:** Implicit event; not explicitly commanded, but captured in domain language
+
+### Event 3: LeaderElected
+- **Triggered by:** ElectLeader command (atomic KV create succeeds) or ReclaimExpiredLease
+- **Aggregate:** LeadershipLease
+- **Data:** leaderID, term, expiresAt, startedAt, timestamp
+- **Consumed by:**
+  - Cluster (updates currentLeaderID)
+  - Policies (LeaderElectionCompletePolicy)
+- **Semantics:** A node has acquired leadership for the given term
+- **Guarantee:** At most one node can succeed in creating this event per term
+
+### Event 4: LeadershipLost
+- **Triggered by:** Lease expires (detected by monitorLeadership watcher) or RenewLeadership fails
+- **Aggregate:** LeadershipLease
+- **Data:** leaderID, term, reason (LeaseExpired, FailedToRenew, NodeFailed), timestamp
+- **Consumed by:**
+  - Cluster (clears currentLeaderID)
+  - Policies (trigger new election)
+- **Semantics:** The leader is no longer valid and coordination authority is vacant
+- **Trigger:** No renewal received for 10s, or atomic update fails
+
+### Event 5: LeadershipRenewed
+- **Triggered by:** RenewLeadership command (succeeds every 3s)
+- **Aggregate:** LeadershipLease
+- **Data:** leaderID, term, expiresAt, timestamp
+- **Consumed by:** Internal use (heartbeat signal); not published to other contexts
+- **Semantics:** Leader is alive and ready to coordinate
+- **Frequency:** Every 3s per leader
+
+### Event 6: ShardAssigned
+- **Triggered by:** AssignShards command or RebalanceShards command
+- **Aggregate:** ShardAssignment
+- **Data:** shardID, nodeIDs (primary + replicas), version, timestamp
+- **Consumed by:**
+  - ShardManager (updates routing)
+  - Policies (ShardOwnershipPolicy)
+  - Other contexts (if they subscribe to shard topology changes)
+- **Semantics:** Shard N is now owned by these nodes (primary first)
+- **Bulk event:** Often published multiple times in one rebalance operation
+
+### Event 7: NodeFailed
+- **Triggered by:** MarkNodeFailed command
+- **Aggregate:** Cluster
+- **Data:** nodeID, reason (HeartbeatTimeout, AdminMarked, etc.), timestamp
+- **Consumed by:**
+  - Cluster (removes from active pool)
+  - Policies (RebalancingTriggerPolicy, actor migration)
+  - Other contexts (may need to relocate actors)
+- **Semantics:** Node is unresponsive and should be treated as offline
+- **Detection:** heartbeat miss after 90s, or explicit admin action
+
+### Event 8: NodeLeft
+- **Triggered by:** Node gracefully shuts down (announceNode(NodeLeft)) or MarkNodeFailed (for draining)
+- **Aggregate:** Cluster
+- **Data:** nodeID, reason (GracefulShutdown, AdminRemoved, etc.), timestamp
+- **Consumed by:** Policies (same as NodeFailed, triggers rebalance)
+- **Semantics:** Node is intentionally leaving and will not rejoin
+- **Difference from NodeFailed:** Intent signal; failed nodes might rejoin after network partition heals
+
+### Event 9: ShardMigrated
+- **Triggered by:** RebalanceShards command (one event per shard reassigned)
+- **Aggregate:** Cluster
+- **Data:** shardID, fromNodes (old owners), toNodes (new owners), timestamp
+- **Consumed by:**
+  - Local runtime (via ShardManager; triggers actor migration)
+  - Other contexts (if they track actor locations)
+- **Semantics:** A shard's ownership changed; actors on that shard may need to migrate
+- **Migration strategy:** Application owns how to move actors (via ActorMigration); cluster just signals the change
+
+### Event 10: RebalancingTriggered
+- **Triggered by:** RebalanceShards command (start)
+- **Aggregate:** Cluster
+- **Data:** leaderID, reason (NodeJoined, NodeFailed, Manual), timestamp
+- **Consumed by:** Monitoring/debugging
+- **Semantics:** Leader has initiated a rebalancing cycle
+- **Note:** Informational; subsequent ShardMigrated events describe the actual changes
+
+### Event 11: RebalancingCompleted
+- **Triggered by:** RebalanceShards command (finish)
+- **Aggregate:** Cluster
+- **Data:** leaderID, completedAt, migrationsCount, timestamp
+- **Consumed by:** Monitoring/debugging, other contexts may wait for this before proceeding
+- **Semantics:** All shard migrations have been assigned; doesn't mean they're complete on actors
+- **Note:** ShardMigrated is the signal to move actors; this is the coordination signal
+
+---
+
+## Policies
+
+Policies are automated reactions to events. They connect events to commands across aggregates and contexts.
+
+### Policy 1: Single Leader Policy
+- **Trigger:** When LeadershipLost event
+- **Action:** Any node can attempt ElectLeader command
+- **Context:** Only one will succeed due to atomic NATS KV operation
+- **Rationale:** Ensure leadership is re-established quickly after vacancy
+- **Implementation:** electionLoop in LeaderElection runs every 2s, calls tryBecomeLeader if not leader
+
+### Policy 2: Lease Renewal Policy
+- **Trigger:** Periodic timer (every 3s)
+- **Action:** If IsLeader, execute RenewLeadership command
+- **Context:** Heartbeat mechanism to prove leader is alive
+- **Rationale:** Detect leader failure via lease expiration after 10s inactivity
+- **Implementation:** leaseRenewalLoop in LeaderElection; failure triggers loseLeadership()
+
+### Policy 3: Lease Expiration Policy
+- **Trigger:** When LeadershipLease.expiresAt < now (detected by monitorLeadership watcher)
+- **Action:** Clear currentLeader, publish LeadershipLost, trigger SingleLeaderPolicy
+- **Context:** Automatic failover when leader stops renewing
+- **Rationale:** Prevent stale leaders from coordinating during network partitions
+- **Implementation:** monitorLeadership watches "leader" KV key; if deleted or expired, calls handleLeadershipUpdate
+
+### Policy 4: Node Heartbeat Policy
+- **Trigger:** Periodic timer (every 30s) - NodeDiscovery announces
+- **Action:** Publish node status via NATS "aether.discovery" subject
+- **Context:** Membership discovery; all nodes broadcast presence
+- **Rationale:** Other nodes learn topology via heartbeats; leader detects failures via absence
+- **Implementation:** NodeDiscovery.Start() runs heartbeat ticker
+
+### Policy 5: Node Failure Detection Policy
+- **Trigger:** When NodeUpdate received with LastSeen > 90s ago
+- **Action:** Mark node as NodeStatusFailed; if leader, trigger RebalanceShards
+- **Context:** Eventual failure detection (passive, via heartbeat miss)
+- **Rationale:** Failed nodes may still hold shard assignments; rebalance moves shards to healthy nodes
+- **Implementation:** handleNodeUpdate checks LastSeen and marks nodes failed; checkNodeHealth periodic check
+
+### Policy 6: Shard Rebalancing Trigger Policy
+- **Trigger:** When NodeJoined, NodeLeft, or NodeFailed event
+- **Action:** If leader, execute RebalanceShards command
+- **Context:** Topology change → redistribute actors
+- **Rationale:** New node should get load; failed node's shards must be reassigned
+- **Implementation:** handleNodeUpdate calls triggerShardRebalancing if leader
+
+### Policy 7: Shard Ownership Enforcement Policy
+- **Trigger:** When ShardAssigned event
+- **Action:** Update local ShardMap; nodes use this for actor routing
+- **Context:** All nodes must agree on shard ownership for routing consistency
+- **Rationale:** Single source of truth (published by leader) prevents routing conflicts
+- **Implementation:** ClusterManager receives ShardAssigned via NATS; updates shardMap
+
+### Policy 8: Shard Coverage Policy
+- **Trigger:** Periodic check (every 5 min) or after NodeFailed
+- **Action:** Validate all shards in [0, ShardCount) are assigned; if any missing, trigger RebalanceShards
+- **Context:** Safety check to prevent shard orphaning
+- **Rationale:** Ensure no actor can be born on an unassigned shard
+- **Implementation:** rebalanceLoop calls triggerShardRebalancing with reason "periodic rebalance check"
+
+### Policy 9: Leader-Only Rebalancing Policy
+- **Trigger:** RebalanceShards command
+- **Action:** Validate nodeID is currentLeader before executing
+- **Context:** Only leader can initiate topology changes
+- **Rationale:** Prevent cascading rebalancing from multiple nodes; single coordinator
+- **Implementation:** triggerShardRebalancing checks IsLeader() at start
+
+### Policy 10: Graceful Shutdown Policy
+- **Trigger:** NodeDiscovery.Stop() called
+- **Action:** Publish NodeLeft event
+- **Context:** Signal that this node is intentionally leaving
+- **Rationale:** Other nodes should rebalance shards away from this node; different from failure
+- **Implementation:** Stop() calls announceNode(NodeLeft) before shutting down
+
+---
+
+## Read Models
+
+Read models project state for queries. They have no invariants and can be eventually consistent.
+
+### Read Model 1: GetClusterTopology
+- **Purpose:** What nodes are currently in the cluster?
+- **Data:**
+  - `nodes`: []NodeInfo (filtered to status=Active only)
+  - `timestamp`: When snapshot was taken
+- **Source:** Cluster.nodes, filtered by status != Failed
+- **Updated:** After NodeJoined, NodeLeft, NodeFailed events
+- **Queryable by:** nodeID, status, capacity, load
+- **Eventual consistency:** Replica nodes lag leader by a few heartbeats
+
+### Read Model 2: GetLeader
+- **Purpose:** Who is the current leader?
+- **Data:**
+  - `leaderID`: Current leader node ID, or null if no leader
+  - `term`: Leadership term number
+  - `expiresAt`: When current leadership lease expires
+  - `confidence`: "high" (just renewed), "medium" (recent), "low" (about to expire)
+- **Source:** LeadershipLease
+- **Updated:** After LeaderElected, LeadershipRenewed, LeadershipLost events
+- **Queryable by:** leaderID, term, expiration time
+- **Eventual consistency:** Non-leader nodes lag by up to 10s (lease timeout)
+
+### Read Model 3: GetShardAssignments
+- **Purpose:** Where does each shard live?
+- **Data:**
+  - `shardID`: Shard number
+  - `primaryNode`: Node ID (shardMap.Shards[shardID][0])
+  - `replicaNodes`: []NodeID (shardMap.Shards[shardID][1:])
+  - `version`: ShardMap version (for optimistic concurrency)
+- **Source:** Cluster.shardMap
+- **Updated:** After ShardAssigned, ShardMigrated events
+- **Queryable by:** shardID, nodeID (which shards does node own?)
+- **Eventual consistency:** Replicas lag leader by one NATS publish; consistent within a term
+
+### Read Model 4: GetNodeHealth
+- **Purpose:** Is a given node healthy?
+- **Data:**
+  - `nodeID`: Node identifier
+  - `status`: Active | Draining | Failed
+  - `lastSeen`: Last heartbeat timestamp
+  - `downForSeconds`: (now - lastSeen)
+- **Source:** Cluster.nodes[nodeID]
+- **Updated:** After NodeJoined, NodeUpdated, NodeFailed events
+- **Queryable by:** nodeID, status threshold (e.g., "give me all failed nodes")
+- **Eventual consistency:** Non-leader nodes lag by 30s (heartbeat interval)
+
+### Read Model 5: GetRebalancingStatus
+- **Purpose:** Is rebalancing in progress? How many shards moved?
+- **Data:**
+  - `isRebalancing`: Boolean
+  - `startedAt`: Timestamp
+  - `reason`: "node_joined" | "node_failed" | "periodic" | "manual"
+  - `completedCount`: Number of shards finished
+  - `totalCount`: Total shards to move
+- **Source:** RebalancingTriggered, ShardMigrated, RebalancingCompleted events
+- **Updated:** On rebalancing events
+- **Queryable by:** current status, started within N seconds
+- **Eventual consistency:** Replicas lag by one NATS publish
+
+---
+
+## Value Objects
+
+### Value Object 1: NodeInfo
+Represents a physical node in the cluster.
+
+**Fields:**
+- `ID`: string - unique identifier
+- `Address`: string - IP or hostname
+- `Port`: int - NATS port
+- `Status`: NodeStatus enum (Active, Draining, Failed)
+- `Capacity`: float64 - max load capacity
+- `Load`: float64 - current load
+- `LastSeen`: time.Time - last heartbeat
+- `Timestamp`: time.Time - when created/updated
+- `Metadata`: map[string]string - arbitrary tags (region, version, etc.)
+- `IsLeader`: bool - is this the leader?
+- `VMCount`: int - how many actors on this node
+- `ShardIDs`: []int - which shards are assigned
+
+**Equality:** Two NodeInfos are equal if all fields match (identity-based for clustering purposes, but immutable)
+
+**Validation:**
+- ID non-empty
+- Capacity > 0
+- Status in {Active, Draining, Failed}
+- Port in valid range [1, 65535]
+
+---
+
+### Value Object 2: ShardMap
+Represents the current shard-to-node assignment snapshot.
+
+**Fields:**
+- `Version`: uint64 - incremented on each change; used for optimistic concurrency
+- `Shards`: Map[ShardID → []NodeID] - shard to [primary, replica1, replica2, ...]
+- `Nodes`: Map[NodeID → NodeInfo] - snapshot of nodes known at assignment time
+- `UpdateTime`: time.Time - when created
+
+**Equality:** Two ShardMaps are equal if Version and Shards are equal (Nodes is metadata)
+
+**Validation:**
+- All shard IDs in [0, ShardCount)
+- All node IDs in Shards exist in Nodes
+- All nodes in Nodes have status=Active
+- Replication factor respected (1 ≤ len(Shards[sid]) ≤ ReplicationFactor)
+
+**Immutability:** ShardMap is never mutated; rebalancing creates a new ShardMap
+
+---
+
+### Value Object 3: LeadershipLease
+Represents a leader's claim on coordination authority.
+
+**Fields:**
+- `LeaderID`: string - node ID holding the lease
+- `Term`: uint64 - monotonically increasing term number
+- `ExpiresAt`: time.Time - when lease is no longer valid
+- `StartedAt`: time.Time - when leader was elected
+
+**Equality:** Two leases are equal if LeaderID, Term, and ExpiresAt match
+
+**Validation:**
+- LeaderID non-empty
+- Term ≥ 0
+- ExpiresAt > StartedAt
+- ExpiresAt - StartedAt == LeaderLeaseTimeout
+
+**Lifecycle:**
+- Created: node wins election
+- Renewed: every 3s, ExpiresAt extended
+- Expired: if ExpiresAt < now and not renewed
+- Replaced: next term when new leader elected
+
+---
+
+### Value Object 4: Term
+Represents a leadership term (could be extracted for clarity).
+
+**Fields:**
+- `Number`: uint64 - term counter
+
+**Semantics:** Monotonically increasing; each new leader gets a higher term. Used to detect stale messages.
+
+---
+
+## Code Analysis
+
+### Intended vs Actual: ClusterManager
+
+**Intended (from Domain Model):**
+- Root aggregate owning Cluster topology
+- Enforces invariants: shard coverage, healthy node assignments, rebalancing triggers
+- Commands: JoinCluster, MarkNodeFailed, RebalanceShards
+- Events: NodeJoined, NodeFailed, ShardAssigned, ShardMigrated
+
+**Actual (from /cluster/manager.go):**
+- Partially aggregate-like: owns `nodes`, `shardMap`, `hashRing`
+- Lacks explicit command methods: has `handleClusterMessage()` but not named commands like `JoinCluster()`
+- Lacks explicit event publishing: updates state but doesn't publish domain events
+- Invariant enforcement scattered: node failure detection in `handleNodeUpdate()`, but no central validation
+- Missing behavior: shard assignment logic in `ShardManager`, not in Cluster aggregate
+
+**Misalignment:**
+1. **Anemic aggregate:** ClusterManager reads/writes state but doesn't enforce invariants or publish events
+2. **Responsibility split:** Cluster topology (Manager) vs shard assignment (ShardManager) vs leadership (LeaderElection) are not unified under one aggregate root
+3. **No explicit commands:** Node updates handled via generic message dispatcher, not domain-language commands
+4. **No event sourcing:** State changes don't produce events
+
+**Gaps:**
+- No JoinCluster command handler
+- No MarkNodeFailed command handler (only handleNodeUpdate which detects failures)
+- No explicit ShardAssigned/ShardMigrated events
+- Rebalancing triggers exist (triggerShardRebalancing) but not as domain commands
+
+---
+
+### Intended vs Actual: LeaderElection
+
+**Intended (from Domain Model):**
+- Root aggregate owning LeadershipLease invariant (single leader per term)
+- Commands: ElectLeader, RenewLeadership
+- Events: LeaderElected, LeadershipLost, LeadershipRenewed
+
+**Actual (from /cluster/leader.go):**
+- Correctly implements lease-based election with NATS KV
+- Enforces single leader via atomic operations (create, update with revision)
+- Has implicit command pattern (tryBecomeLeader, renewLease, resignLeadership)
+- Has callbacks for leadership change, but no explicit event publishing
+
+**Alignment:**
+- Atomic operations correctly enforce Invariant 1 (single leader)
+- Lease renewal every 3s enforces lease validity
+- Lease expiration detected via watcher
+- Leadership transitions (elected, lost) well-modeled
+
+**Gaps:**
+- Events not explicitly published; callbacks used instead
+- No event sourcing (events should be recorded in event store, not just callbacks)
+- No term-based validation (could reject stale messages with old term)
+- Could be more explicit about LeaderElected event vs just callback
+
+---
+
+### Intended vs Actual: ConsistentHashRing
+
+**Intended (from Domain Model):**
+- Used by ShardAssignment to compute which node owns which shard
+- Policy: shards assigned via consistent hashing
+- Minimizes reshuffling on node join/leave
+
+**Actual (from /cluster/hashring.go):**
+- Correctly implements consistent hash ring with virtual nodes
+- AddNode/RemoveNode operations are clean
+- GetNode(key) returns responsible node; used for actor placement
+
+**Alignment:**
+- Good separation of concerns (ring is utility, not aggregate)
+- Virtual nodes (150 per node) reduce reshuffling on node change
+- Immutable ring structure (recreated on changes)
+
+**Gaps:**
+- Not actively used by ShardAssignment (ShardManager has own hash logic)
+- Could be used by RebalanceShards policy to compute initial assignments
+- Currently more of a utility than a policy
+
+---
+
+### Intended vs Actual: ShardManager
+
+**Intended (from Domain Model):**
+- ShardAssignment aggregate managing shard-to-node mappings
+- Commands: AssignShard, RebalanceShards (via PlacementStrategy)
+- Enforces invariants: all shards assigned, only to healthy nodes
+- Emits ShardAssigned events
+
+**Actual (from /cluster/shard.go):**
+- Owns ShardMap, but like ClusterManager, is more of a data holder than aggregate
+- Has methods: AssignShard, RebalanceShards (delegates to PlacementStrategy)
+- Lacks invariant validation (doesn't check if nodes are healthy)
+- Lacks event publishing
+
+**Alignment:**
+- PlacementStrategy pattern allows different algorithms (good design)
+- ConsistentHashPlacement exists but is stubbed
+
+**Gaps:**
+- ShardManager.RebalanceShards not integrated with ClusterManager's decision to rebalance
+- No event publishing on shard changes
+- Invariant validation needed: validate nodes in assignments are healthy
+
+---
+
+### Intended vs Actual: NodeDiscovery
+
+**Intended (from Domain Model):**
+- Detects nodes via NATS heartbeats
+- Publishes NodeJoined, NodeUpdated, NodeLeft events via announceNode
+- Triggers policies (node failure detection, rebalancing)
+
+**Actual (from /cluster/discovery.go):**
+- Heartbeats every 30s via announceNode
+- Subscribes to "aether.discovery" channel
+- Publishes NodeUpdate messages, not domain events
+
+**Alignment:**
+- Heartbeat mechanism good; detected failure via 90s timeout in ClusterManager
+- Message-based communication works for event bus
+
+**Gaps:**
+- NodeUpdate is not a domain event; should publish NodeJoined, NodeUpdated, NodeLeft as explicit events
+- Could be clearer about lifecycle: Start announces NodeJoined, Stop announces NodeLeft
+
+---
+
+### Intended vs Actual: DistributedVM
+
+**Intended (from Domain Model):**
+- Orchestrates all cluster components (discovery, election, coordination, sharding)
+- Not itself an aggregate; more of a façade/orchestrator
+
+**Actual (from /cluster/distributed.go):**
+- Correctly orchestrates: discovery + cluster manager + sharding + local runtime
+- DistributedVMRegistry provides VMRegistry interface to ClusterManager
+- Good separation: doesn't force topology decisions on runtime
+
+**Alignment:**
+- Architecture clean; each component has clear responsibility
+- Decoupling via interfaces (Runtime, VirtualMachine, VMProvider) is good
+
+**Gaps:**
+- No explicit orchestration logic (Start method incomplete; only shown first 100 lines)
+- Could coordinate startup order more explicitly
+
+---
+
+## Refactoring Backlog
+
+### Refactoring 1: Extract Cluster Aggregate from ClusterManager
+
+**Current:** ClusterManager is anemic; only stores state
+**Target:** ClusterManager becomes true aggregate root enforcing invariants
+
+**Steps:**
+1. Add explicit command methods to ClusterManager:
+   - `JoinCluster(nodeInfo NodeInfo) error`
+   - `MarkNodeFailed(nodeID string) error`
+   - `AssignShards(shardMap ShardMap) error`
+   - `RebalanceTopology(reason string) error`
+2. Each command:
+   - Validates preconditions
+   - Calls aggregate behavior (private methods)
+   - Publishes events
+   - Returns result
+3. Add event publishing:
+   - Create EventPublisher interface in ClusterManager
+   - Publish NodeJoined, NodeFailed, ShardAssigned, ShardMigrated events
+   - Events captured in event store (optional, or via NATS pub/sub)
+
+**Impact:** Medium - changes ClusterManager interface but not external APIs yet
+**Priority:** High - unblocks event-driven integration with other contexts
+
+---
+
+### Refactoring 2: Extract ShardAssignment Commands from RebalanceShards
+
+**Current:** ShardManager.RebalanceShards delegates to PlacementStrategy; no validation of healthy nodes
+**Target:** ShardAssignment commands validate invariants
+
+**Steps:**
+1. Add to ShardManager:
+   - `AssignShards(assignments map[int][]string, nodes map[string]*NodeInfo) error`
+     - Validates: all nodes exist and are Active
+     - Validates: all shards in [0, ShardCount) assigned
+     - Validates: replication factor respected
+   - `ValidateAssignments() error`
+2. Move shard validation from coordinator to ShardManager
+3. Publish ShardAssigned events on successful assignment
+4. Update ClusterManager to call ShardManager.AssignShards instead of directly mutating ShardMap
+
+**Impact:** Medium - clarifies shard aggregate, adds validation
+**Priority:** High - prevents invalid shard assignments
+
+---
+
+### Refactoring 3: Publish Domain Events from LeaderElection
+
+**Current:** LeaderElection uses callbacks; no event sourcing
+**Target:** Explicit event publishing for leader changes
+
+**Steps:**
+1. Add EventPublisher interface to LeaderElection
+2. In becomeLeader: publish LeaderElected event
+3. In loseLeadership: publish LeadershipLost event
+4. Optional: publish LeadershipRenewed on each renewal (for audit trail)
+5. Events include: leaderID, term, expiresAt, timestamp
+6. Consumers subscribe via NATS and react (no longer callbacks)
+
+**Impact:** Medium - changes LeaderElection interface
+**Priority:** Medium - improves observability and enables event sourcing
+
+---
+
+### Refactoring 4: Unify Node Failure Detection and Rebalancing
+
+**Current:** Node failure detected in handleNodeUpdate (90s timeout) + periodic checkNodeHealth; rebalancing trigger spread across multiple methods
+**Target:** Explicit MarkNodeFailed command, single rebalancing trigger
+
+**Steps:**
+1. Create explicit MarkNodeFailed command handler
+2. Move node failure detection logic to ClusterManager.markNodeFailed()
+3. Consolidate node failure checks (remove duplicate in checkNodeHealth)
+4. Trigger rebalancing only from MarkNodeFailed, not scattered
+5. Add RebalancingTriggered event before starting rebalance
+
+**Impact:** Low - refactoring existing logic, not new behavior
+**Priority:** Medium - improves clarity
+
+---
+
+### Refactoring 5: Implement PlacementStrategy for Rebalancing
+
+**Current:** ConsistentHashPlacement.RebalanceShards is stubbed
+**Target:** Real rebalancing logic using consistent hashing
+
+**Steps:**
+1. Implement ConsistentHashPlacement.RebalanceShards:
+   - Input: current ShardMap, updated nodes (may have added/removed)
+   - Output: new ShardMap with shards redistributed via consistent hash
+   - Minimize movement: use virtual nodes to keep most shards in place
+2. Add RebalancingStrategy interface if other strategies needed (e.g., load-aware)
+3. Test: verify adding/removing node only reshuffles ~1/N shards
+
+**Impact:** Medium - core rebalancing logic, affects all topology changes
+**Priority:** High - currently rebalancing doesn't actually redistribute
+
+---
+
+### Refactoring 6: Add Node Health Check Endpoint
+
+**Current:** No way to query node health directly
+**Target:** Read model for GetNodeHealth
+
+**Steps:**
+1. Add method to ClusterManager: `GetNodeHealth(nodeID string) NodeHealthStatus`
+2. Return: status, lastSeen, downForSeconds
+3. Expose via NATS request/reply (if distributed query needed)
+4. Test: verify timeout logic
+
+**Impact:** Low - new query method, no state changes
+**Priority:** Low - nice to have for monitoring
+
+---
+
+### Refactoring 7: Add Shard Migration Tracking
+
+**Current:** ShardMigrated event published, but no tracking of migration progress
+**Target:** ActorMigration status tracking and completion callback
+
+**Steps:**
+1. Add MigrationTracker in cluster package
+2. On ShardMigrated event: create migration record (pending)
+3. Application reports migration progress (in_progress, completed, failed)
+4. On completion: remove from tracker
+5. Rebalancing can wait for migrations to complete before declaring rebalance done
+
+**Impact:** High - affects how rebalancing coordinates with application
+**Priority:** Medium - improves robustness (don't rebalance while migrations in flight)
+
+---
+
+## Testing Strategy
+
+### Unit Tests
+
+**LeaderElection invariant tests:**
+- Only one node can successfully create "leader" key → test atomic create succeeds once, fails second time
+- Lease expiration triggers new election → create expired lease, verify election succeeds
+- Lease renewal extends expiry → create lease, renew, verify new expiry is ~10s from now
+- Stale leader can't renew → mark node failed, verify renewal fails
+
+**Cluster topology invariant tests:**
+- NodeJoined adds to hashRing → call addNode, verify GetNode routes consistently
+- NodeFailed triggers rebalance → call markNodeFailed, verify rebalance triggered
+- Shard coverage validated → rebalance with 100 nodes, 1024 shards, verify all shards assigned
+- Only healthy nodes get shards → assign to failed node, verify rejected
+
+**ShardManager invariant tests:**
+- AssignShards validates node health → assign to failed node, verify error
+- RebalanceShards covers all shards → simulate topology change, verify no orphans
+- Virtual nodes minimize reshuffling → add node, verify < 1/N shards move
+
+### Integration Tests
+
+**Single leader election:**
+- Create 3 cluster nodes
+- Verify exactly one becomes leader
+- Stop leader
+- Verify new leader elected within 10s
+- Test: leadership term increments
+
+**Node failure and recovery:**
+- Create 5-node cluster with 100 shards
+- Mark node-2 failed
+- Verify shards reassigned from node-2 to others
+- Verify node-3 doesn't become unreasonably overloaded
+- Restart node-2
+- Verify shards rebalanced back
+
+**Graceful shutdown:**
+- Create 3-node cluster
+- Gracefully stop node-1 (announces NodeLeft)
+- Verify no 90s timeout; rebalancing happens immediately
+- Compare to failure case (90s delay)
+
+**Split-brain recovery:**
+- Create 3-node cluster: [A(leader), B, C]
+- Partition network: A isolated, B+C connected
+- Verify A loses leadership after 10s
+- Verify B or C becomes leader
+- Heal partition
+- Verify single leader, no conflicts (A didn't try to be leader again)
+
+**Rebalancing under load:**
+- Create 5-node cluster, 100 shards, with actors running
+- Add node-6
+- Verify actors migrated off other nodes to node-6
+- No actors are orphaned (all still reachable)
+- Measure: reshuffled < 1/5 of shards
+
+### Chaos Testing
+
+- Leader failure mid-rebalance → verify rebalancing resumed by new leader
+- Network partition (leader isolated) → verify quorum (or lease) ensures no split-brain
+- Cascading failures → 5 nodes, fail 3 at once, verify cluster stabilizes
+- High churn → nodes join/leave rapidly, verify topology converges
+
+---
+
+## Boundary Conditions and Limitations
+
+### Design Decisions
+
+**Why lease-based election instead of Raft?**
+- Simpler to implement and reason about
+- Detect failure in 10s (acceptable for coordination)
+- Risk: split-brain if network partition persists > 10s and both partitions have nodes (mitigation: leader must renew in each partition; only one will have NATS connection)
+
+**Why leader-only rebalancing?**
+- Prevent cascading rebalancing decisions
+- Single source of truth (leader decides topology)
+- Risk: leader bottleneck if rebalancing is expensive (mitigation: leader can delegate to algorithms, not compute itself)
+
+**Why consistent hashing instead of load-balancing?**
+- Minimize shard movement on topology change (good for actor locality)
+- Deterministic without central state (nodes can independently compute assignments)
+- Risk: load imbalance if actors heavily skewed (mitigation: application can use custom PlacementStrategy)
+
+**Why 90s failure detection timeout?**
+- 3 heartbeats missed (30s * 3) before declaring failure
+- Allows for some network jitter without false positives
+- Risk: slow failure detection (mitigation: application can force MarkNodeFailed if it detects failure faster)
+
+### Assumptions
+
+- **NATS cluster is available:** If NATS is down, cluster can't communicate (no failover without NATS)
+- **Clocks are reasonably synchronized:** Lease expiration depends on wall clock; major clock skew can break election
+- **Network partitions are rare:** Split-brain only possible if partition > 10s and leader isolated
+- **Rebalancing is not time-critical:** 5-min periodic check is default; no SLA on shard assignment latency
+
+### Known Gaps
+
+1. **No quorum-based election:** Single leader with lease; could add quorum for stronger consistency (Raft-like)
+2. **No actor migration semantics:** Who actually moves actors? Cluster signals ShardMigrated, but application must handle
+3. **No topology versioning:** ShardMap has version, but no way to detect if a node has an outdated topology
+4. **No leader handoff during rebalancing:** If leader fails mid-rebalance, new leader might redo already-started migrations
+5. **No split-brain detection:** Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but cluster doesn't enforce it)
+
+---
+
+## Alignment with Product Vision
+
+**Primitives Over Frameworks:**
+- Cluster Coordination provides primitives (leader election, shard assignment), not a complete framework
+- Application owns actor migration strategy (via ShardManager PlacementStrategy)
+- Application owns failure response (can custom-implement node monitoring)
+
+**NATS-Native:**
+- Leader election uses NATS KV for atomic operations
+- Node discovery uses NATS pub/sub for heartbeats
+- Shard topology can be published via NATS events
+
+**Event-Sourced:**
+- All topology changes produce events (NodeJoined, NodeFailed, ShardAssigned, ShardMigrated)
+- Events enable audit trail and replay (who owns which shard when?)
+
+**Resource Conscious:**
+- Minimal overhead: consistent hashing avoids per-node state explosion
+- Lease-based election lighter than Raft (no log replication)
+- Virtual nodes (150) on modest hardware
+
+---
+
+## References
+
+- **Lease-based election:** Inspired by Chubby, Google's lock service
+- **Consistent hashing:** Karger et al., "Consistent Hashing and Random Trees"
+- **Virtual nodes:** Reduces reshuffling on topology change (Dynamo, Cassandra pattern)
+- **NATS KV:** Used for atomicity; alternatives: etcd, Consul (but less NATS-native)
+
--- a/.product-strategy/cluster/EXECUTIVE_SUMMARY.md
+++ b/.product-strategy/cluster/EXECUTIVE_SUMMARY.md
@@ -0,0 +1,376 @@
+# Cluster Coordination: Domain Model Executive Summary
+
+## Overview
+
+I have completed a comprehensive Domain-Driven Design (DDD) analysis of the **Cluster Coordination** bounded context in Aether. This analysis identifies the core business invariants, models the domain as aggregates/commands/events, compares the intended model against the current implementation, and provides a prioritized refactoring roadmap.
+
+**Key Finding:** The Cluster Coordination context has good architectural foundations (LeaderElection, ConsistentHashRing, NodeDiscovery) but lacks proper DDD patterns (explicit commands, domain events, invariant validation). The refactoring is medium effort with high impact on event-driven integration and observability.
+
+---
+
+## Five Core Invariants
+
+These are the non-negotiable business rules that must never break:
+
+1. **Single Leader Per Term** - At most one node is leader; enforced via NATS KV atomic operations
+2. **All Active Shards Have Owners** - Every shard ID [0, 1024) must be assigned to ≥1 healthy node
+3. **Shards Only on Healthy Nodes** - A shard can only be assigned to nodes in Active status
+4. **Assignments Stable During Lease** - Shard topology doesn't arbitrarily change; only rebalances on topology changes
+5. **Leader Is Active Node** - If LeaderID is set, that node must be in Cluster.nodes with status=Active
+
+---
+
+## Three Root Aggregates
+
+### Cluster (Root Aggregate)
+Owns node topology, shard assignments, and rebalancing orchestration.
+
+**Key Responsibility:** Maintain consistency of cluster topology; only leader can assign shards
+
+**Commands:** JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards
+
+**Events:** NodeJoined, NodeFailed, NodeLeft, ShardAssigned, ShardMigrated, RebalancingTriggered
+
+### LeadershipLease (Root Aggregate)
+Owns the leadership claim and ensures single leader per term via lease-based election.
+
+**Key Responsibility:** Maintain exactly one leader; detect failure via lease expiration
+
+**Commands:** ElectLeader, RenewLeadership
+
+**Events:** LeaderElected, LeadershipRenewed, LeadershipLost
+
+### ShardAssignment (Root Aggregate)
+Owns shard-to-node mappings and validates assignments respect invariants.
+
+**Key Responsibility:** Track which shards live on which nodes; validate healthy nodes only
+
+**Commands:** AssignShard, RebalanceFromTopology
+
+**Events:** ShardAssigned, ShardMigrated
+
+---
+
+## Code Analysis: What's Working & What Isn't
+
+### What Works Well (✓)
+- **LeaderElection** - Correctly implements lease-based election with NATS KV; enforces Invariant 1
+- **ConsistentHashRing** - Proper consistent hashing with virtual nodes; minimizes shard reshuffling
+- **NodeDiscovery** - Good heartbeat mechanism (30s interval) for membership discovery
+- **Architecture** - Interfaces (VMRegistry, Runtime) properly decouple cluster from runtime
+
+### What Needs Work (✗)
+1. **Anemic aggregates** - ClusterManager, ShardManager are data holders, not behavior-enforcing aggregates
+2. **No domain events** - Topology changes don't publish events; impossible to audit or integrate with other contexts
+3. **Responsibility scattered** - Invariant validation in multiple places (handleNodeUpdate, checkNodeHealth)
+4. **Rebalancing stubbed** - ConsistentHashPlacement.RebalanceShards returns unchanged map; doesn't actually redistribute shards
+5. **Implicit commands** - Node updates via generic message handlers instead of explicit domain commands
+6. **Leadership uses callbacks** - LeaderElection publishes via callbacks instead of domain events
+
+**Example Gap:** When a node joins, the current code:
+```go
+cm.nodes[update.Node.ID] = update.Node    // Silent update
+cm.hashRing.AddNode(update.Node.ID)       // No event
+// No way for other contexts to learn "node-5 joined"
+```
+
+Should be:
+```go
+cm.JoinCluster(nodeInfo)                  // Explicit command
+// Publishes: NodeJoined event
+// Consumed by: Monitoring, Audit, Actor Runtime contexts
+```
+
+---
+
+## Refactoring Impact & Effort
+
+### Priority Ranking
+
+**High Priority (Blocks Event-Driven Integration)**
+1. Extract Cluster commands with invariant validation (Medium effort)
+2. Implement real rebalancing strategy (Medium effort)
+3. Publish domain events (Medium effort)
+
+**Medium Priority (Improves Clarity)**
+4. Extract MarkNodeFailed command (Low effort)
+5. Centralize shard invariant validation (Low effort)
+6. Add shard migration tracking (High effort, improves robustness)
+7. Publish LeaderElection events (Low effort, improves observability)
+
+**Total Effort:** ~4-6 weeks (2-3 dev sprints)
+
+### Timeline
+- **Phase 1 (Week 1):** Extract explicit commands (JoinCluster, MarkNodeFailed)
+- **Phase 2 (Week 2):** Publish domain events (NodeJoined, ShardAssigned, ShardMigrated)
+- **Phase 3 (Week 3):** Implement real rebalancing (ConsistentHashPlacement)
+- **Phase 4 (Week 4):** Centralize invariant validation (ShardAssignment)
+
+### Success Metrics
+After Phase 1:
+- ✓ ClusterManager has explicit command methods
+- ✓ Commands validate preconditions
+- ✓ Commands trigger events
+
+After Phase 2:
+- ✓ All topology changes publish events to NATS
+- ✓ Other contexts can subscribe and react
+- ✓ Full audit trail of topology decisions
+
+After Phase 3:
+- ✓ Adding node → shards actually redistribute to it
+- ✓ Removing node → shards reassigned elsewhere
+- ✓ No orphaned shards
+
+After Phase 4:
+- ✓ Invalid assignments rejected (unhealthy node, orphaned shard)
+- ✓ Invariants validated before applying changes
+- ✓ Cluster state always consistent
+
+---
+
+## Design Decisions
+
+### Why Lease-Based Election Instead of Raft?
+**Chosen:** Lease-based (NATS KV with atomic operations)
+
+**Rationale:**
+- Simpler to reason about and implement
+- Detect failure in 10s (acceptable for coordination)
+- Lower overhead
+- Good enough for a library (not a mission-critical system)
+
+**Trade-off:** Risk of split-brain if partition persists >10s and both sides have NATS access (mitigated by atomic operations and term incrementing)
+
+### Why Consistent Hashing for Shard Assignment?
+**Chosen:** Consistent hashing with virtual nodes (150 per node)
+
+**Rationale:**
+- Minimize shard movement on topology change (crucial for actor locality)
+- Deterministic without central state (nodes can independently compute assignments)
+- Well-proven in distributed systems (Dynamo, Cassandra)
+
+**Trade-off:** May not achieve perfect load balance (mitigated by allowing custom PlacementStrategy)
+
+### Why Leader-Only Rebalancing?
+**Chosen:** Only leader can initiate shard rebalancing
+
+**Rationale:**
+- Prevent cascading rebalancing decisions from multiple nodes
+- Single source of truth for topology
+- Simplifies invariant enforcement
+
+**Trade-off:** Leader is bottleneck if rebalancing is expensive (mitigated by leader delegating to algorithms)
+
+---
+
+## Key Policies (Automated Reactions)
+
+The cluster enforces these policies to maintain invariants:
+
+| Policy | Trigger | Action | Rationale |
+|--------|---------|--------|-----------|
+| Single Leader | LeadershipLost | ElectLeader | Ensure leadership is re-established |
+| Lease Renewal | Every 3s | RenewLeadership | Detect leader failure after 10s |
+| Node Failure Detection | Every 30s | Check LastSeen; if >90s, MarkNodeFailed | Detect crash/network partition |
+| Rebalancing Trigger | NodeJoined/NodeFailed | RebalanceShards (if leader) | Redistribute load on topology change |
+| Shard Coverage | Periodic + after failures | Validate all shards assigned | Prevent shard orphaning |
+| Graceful Shutdown | NodeDiscovery.Stop() | Announce NodeLeft | Signal intentional leave (no 90s timeout) |
+
+---
+
+## Testing Strategy
+
+### Unit Tests
+- Commands validate invariants ✓
+- Events publish correctly ✓
+- Value objects enforce constraints ✓
+- Strategies compute assignments ✓
+
+### Integration Tests
+- Single leader election (3 nodes) ✓
+- Leader failure → new leader within 10s ✓
+- Node join → shards redistributed ✓
+- Node failure → shards reassigned ✓
+- Graceful shutdown → no false failures ✓
+
+### Chaos Tests
+- Leader fails mid-rebalance → recovers ✓
+- Network partition → no split-brain ✓
+- Cascading failures → stabilizes ✓
+- High churn → topology converges ✓
+
+---
+
+## Observability & Monitoring
+
+### Key Metrics
+```
+# Topology
+cluster.nodes.count [active|draining|failed]
+cluster.shards.assigned [0, 1024]
+cluster.shards.orphaned [0, 1024]  # RED if > 0
+
+# Leadership
+cluster.leader.is_leader [0|1]
+cluster.leader.term
+cluster.leader.lease_expires_in_seconds
+
+# Rebalancing
+cluster.rebalancing.triggered [reason]
+cluster.rebalancing.active [0|1]
+cluster.rebalancing.completed [shards_moved]
+
+# Node Health
+cluster.node.heartbeat_latency_ms [per node]
+cluster.node.load [per node]
+cluster.node.vm_count [per node]
+```
+
+### Alerts
+- Leader heartbeat missing > 5s → election stuck
+- Rebalancing > 5min → something wrong
+- Orphaned shards > 0 → CRITICAL (invariant violation)
+- Node failure > 50% → investigate
+
+---
+
+## Integration with Other Contexts
+
+Once Cluster Coordination publishes domain events, other contexts can react:
+
+### Actor Runtime Context
+**Subscribes to:** ShardMigrated event
+**Action:** Migrate actors from old node to new node
+**Why:** When shards move, actors must follow
+
+### Monitoring Context
+**Subscribes to:** NodeJoined, NodeFailed, LeaderElected
+**Action:** Update cluster health dashboard
+**Why:** Operators need visibility into topology
+
+### Audit Context
+**Subscribes to:** NodeJoined, NodeFailed, ShardAssigned, LeaderElected
+**Action:** Record topology change log
+**Why:** Compliance, debugging, replaying state
+
+---
+
+## Known Limitations & Gaps
+
+### Current Limitations
+1. **No quorum-based election** - Single leader with lease; could add quorum for stronger consistency
+2. **No actor migration semantics** - Cluster signals ShardMigrated, but application must implement migration
+3. **No topology versioning** - ShardMap.Version exists but not enforced for consistency
+4. **No leader handoff** - If leader fails mid-rebalance, new leader may redo migrations
+5. **No split-brain detection** - Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but system doesn't validate)
+
+### Acceptable for Now
+- **Eventual consistency on topology** - Non-leaders lag by ~100ms (acceptable for routing)
+- **90s failure detection** - Allows for network jitter; can be accelerated by application
+- **No strong consistency** - Leadership is strongly consistent (atomic KV); topology is eventually consistent (NATS pub/sub)
+
+---
+
+## Deliverables
+
+Five comprehensive documents have been created in `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/`:
+
+1. **INDEX.md** (11 KB) - Navigation guide for all documents
+2. **DOMAIN_MODEL.md** (43 KB) - Complete tactical DDD model with invariants, aggregates, commands, events, policies
+3. **REFACTORING_SUMMARY.md** (16 KB) - Gap analysis and prioritized 4-phase implementation plan
+4. **ARCHITECTURE.md** (37 KB) - Visual reference with diagrams, decision trees, state machines, failure scenarios
+5. **PATTERNS.md** (30 KB) - Side-by-side code examples showing current vs intended implementations
+
+**Total:** ~140 KB of documentation with detailed guidance for implementation
+
+---
+
+## Next Steps
+
+### Immediate (This Sprint)
+1. Review DOMAIN_MODEL.md with team (1 hour meeting)
+2. Confirm invariants are correct (discussion)
+3. Agree on Phase 1 priorities (which commands first?)
+
+### Short-Term (Next Sprint)
+1. Implement Phase 1: Extract explicit commands (JoinCluster, MarkNodeFailed)
+2. Add unit tests for commands
+3. Code review against PATTERNS.md examples
+
+### Medium-Term (Following Sprints)
+1. Phase 2: Publish domain events
+2. Phase 3: Implement real rebalancing
+3. Phase 4: Centralize invariant validation
+
+### Integration
+1. Once events are published, other contexts (Actor Runtime, Monitoring) can subscribe
+2. Enables proper event-driven architecture
+3. Full audit trail becomes available
+
+---
+
+## Questions & Discussion Points
+
+1. **Are the 5 invariants correct?** Do we have all the non-negotiable rules captured?
+2. **Are the aggregate boundaries clear?** Should Cluster own ShardAssignment, or is it independent?
+3. **Is the 4-phase plan realistic?** Do we have capacity? Should we combine phases?
+4. **Which contexts will consume events?** Who needs NodeJoined? ShardMigrated? LeaderElected?
+5. **Do we need stronger consistency?** Should we add quorum-based election? Or is lease-based sufficient?
+
+---
+
+## Conclusion
+
+The Cluster Coordination context has solid foundations but needs DDD patterns to reach its full potential:
+
+- **Current state:** Functional but opaque (hard to audit, hard to integrate, hard to test)
+- **Intended state:** Event-driven, auditable, testable, properly aggregated (medium effort)
+- **Impact:** Enables event-sourced architecture, cross-context communication, observability
+
+The refactoring is realistic and phased, allowing incremental value delivery. Phase 1 alone (explicit commands) provides immediate clarity. Phase 2 (events) unblocks other contexts.
+
+**Recommendation:** Start with Phase 1 (Week 1) to validate the DDD approach. If the team finds value, continue to Phase 2-4. If not, we have clearer domain models for reference.
+
+---
+
+## Document References
+
+| Document | Purpose | Best For | Size |
+|----------|---------|----------|------|
+| [INDEX.md](./INDEX.md) | Navigation guide | Quick start, finding what you need | 11 KB |
+| [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) | Complete DDD model | Understanding the domain, design review | 43 KB |
+| [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) | Implementation plan | Planning work, estimating effort | 16 KB |
+| [ARCHITECTURE.md](./ARCHITECTURE.md) | System design & diagrams | Understanding behavior, debugging, tuning | 37 KB |
+| [PATTERNS.md](./PATTERNS.md) | Code examples | Writing the refactoring code | 30 KB |
+
+**Start:** [INDEX.md](./INDEX.md)
+
+**For implementation:** [PATTERNS.md](./PATTERNS.md)
+
+**For design review:** [DOMAIN_MODEL.md](./DOMAIN_MODEL.md)
+
+**For planning:** [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md)
+
+---
+
+## About This Analysis
+
+This domain model was created using systematic Domain-Driven Design analysis:
+
+1. **Identified invariants first** - What business rules must never break?
+2. **Modeled aggregates around invariants** - Which entities enforce which rules?
+3. **Designed commands & events** - What intents and facts describe state changes?
+4. **Compared with existing code** - What's intended vs actual?
+5. **Prioritized refactoring** - What to fix first, second, third?
+
+The approach follows Eric Evans' Domain-Driven Design (2003) and tactical patterns like aggregates, value objects, and event sourcing.
+
+---
+
+**Created:** January 12, 2026
+
+**By:** Domain Modeling Analysis (Claude)
+
+**For:** Aether Project - Cluster Coordination Bounded Context
+
--- a/.product-strategy/cluster/INDEX.md
+++ b/.product-strategy/cluster/INDEX.md
@@ -0,0 +1,352 @@
+# Cluster Coordination: Domain Model Index
+
+This directory contains a complete Domain-Driven Design model for the Cluster Coordination bounded context in Aether. Use this index to navigate the documentation.
+
+---
+
+## Quick Start
+
+**Start here if you're new to this analysis:**
+
+1. Read [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) **Summary** section (1-2 min)
+2. Skim the **Invariants** section to understand the constraints (2 min)
+3. Read [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) **Overview: Code vs Domain Model** (5 min)
+4. Choose your next step based on your role (see below)
+
+---
+
+## Documents Overview
+
+### [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Comprehensive DDD Model
+**What:** Complete tactical DDD model with aggregates, commands, events, policies, read models
+
+**Contains:**
+- Cluster Coordination context summary
+- 5 core invariants (single leader, shard coverage, etc.)
+- 3 root aggregates: Cluster, LeadershipLease, ShardAssignment
+- 6 commands: JoinCluster, ElectLeader, MarkNodeFailed, etc.
+- 11 events: NodeJoined, LeaderElected, ShardMigrated, etc.
+- 10 policies: Single Leader Policy, Lease Renewal Policy, etc.
+- 5 read models: GetClusterTopology, GetLeader, GetShardAssignments, etc.
+- 4 value objects: NodeInfo, ShardMap, LeadershipLease, Term
+- Code analysis comparing intended vs actual implementation
+- 7 refactoring issues with impact assessment
+- Testing strategy (unit, integration, chaos tests)
+- Boundary conditions and limitations
+- Alignment with product vision
+
+**Best for:** Understanding the complete domain model, identifying what needs to change
+
+**Time:** 30-40 minutes for thorough read
+
+---
+
+### [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation Roadmap
+**What:** Prioritized refactoring plan with 4-phase implementation strategy
+
+**Contains:**
+- Current state vs intended state (what's working, what's broken)
+- Gap analysis (6 major gaps identified)
+- Priority matrix (High/Medium/Low priority issues)
+- 4-phase refactoring plan:
+  - Phase 1: Extract cluster commands (Week 1)
+  - Phase 2: Publish domain events (Week 2)
+  - Phase 3: Implement real rebalancing (Week 3)
+  - Phase 4: Unify shard invariants (Week 4)
+- Code examples for each phase
+- Testing checklist
+- Success metrics
+- Integration with other contexts
+
+**Best for:** Planning implementation, deciding what to do first, estimating effort
+
+**Time:** 20-30 minutes for full review
+
+---
+
+### [ARCHITECTURE.md](./ARCHITECTURE.md) - Visual Reference & Decision Trees
+**What:** Diagrams, flowcharts, and decision trees for understanding cluster behavior
+
+**Contains:**
+- High-level architecture diagram
+- Aggregate boundaries diagram
+- 3 command flow diagrams with decision points
+- 3 decision trees (Is node healthy? Should rebalance? Can assign shard?)
+- State transition diagrams (cluster, node, leadership)
+- Concurrency model and thread safety explanation
+- Event sequences with timelines
+- Configuration parameters and tuning guide
+- Failure scenarios & recovery procedures
+- Monitoring & observability metrics
+- Alerts and SLOs
+
+**Best for:** Understanding how the system works, debugging issues, planning changes
+
+**Time:** 20-30 minutes; skim decision trees as needed
+
+---
+
+### [PATTERNS.md](./PATTERNS.md) - Code Patterns & Examples
+**What:** Side-by-side code comparisons showing how to evolve the implementation
+
+**Contains:**
+- 6 refactoring patterns with current vs intended code:
+  1. Commands vs Message Handlers
+  2. Value Objects vs Primitives
+  3. Event Publishing (no events → explicit events)
+  4. Invariant Validation (scattered → centralized)
+  5. Rebalancing Strategy (stubbed → real implementation)
+  6. Testing Aggregates (hard to test → testable with mocks)
+- Full code examples for each pattern
+- Benefits of each approach
+- Mock implementations for testing
+
+**Best for:** Developers writing the refactoring code, understanding specific patterns
+
+**Time:** 30-40 minutes to read all examples
+
+---
+
+## Navigation by Role
+
+### Product Manager / Tech Lead
+**Goal:** Understand what needs to change and why
+
+1. Read REFACTORING_SUMMARY.md **Overview** (5 min)
+2. Read REFACTORING_SUMMARY.md **Refactoring Priority Matrix** (3 min)
+3. Read REFACTORING_SUMMARY.md **Refactoring Plan** - Phase 1 only (5 min)
+4. Decide: Which phases to commit to? Which timeline?
+
+**Time:** 15 minutes
+
+---
+
+### Developer (Implementing Refactoring)
+**Goal:** Understand how to write the code
+
+1. Skim DOMAIN_MODEL.md **Summary** (2 min)
+2. Read DOMAIN_MODEL.md **Invariants** (5 min) - what must never break?
+3. Read DOMAIN_MODEL.md **Aggregates** (5 min) - who owns what?
+4. Read DOMAIN_MODEL.md **Commands** (5 min) - what actions are there?
+5. Read PATTERNS.md sections relevant to your phase (10-20 min)
+6. Refer to ARCHITECTURE.md **Decision Trees** as you code (on-demand)
+
+**Time:** 30-50 minutes of reading; then 2-8 hours of coding per phase
+
+---
+
+### Architect / Design Reviewer
+**Goal:** Validate the domain model and refactoring plan
+
+1. Read DOMAIN_MODEL.md completely (40 min)
+2. Review REFACTORING_SUMMARY.md **Current State** (10 min)
+3. Scan ARCHITECTURE.md diagrams (10 min)
+4. Review PATTERNS.md for code quality (15 min)
+5. Provide feedback on:
+   - Are the invariants correct and complete?
+   - Are the aggregate boundaries clear?
+   - Is the refactoring plan realistic?
+   - Are we missing any patterns?
+
+**Time:** 60-90 minutes
+
+---
+
+### QA / Tester
+**Goal:** Understand what to test
+
+1. Read DOMAIN_MODEL.md **Testing Strategy** (5 min)
+2. Read REFACTORING_SUMMARY.md **Testing Checklist** (5 min)
+3. Read ARCHITECTURE.md **Failure Scenarios** (10 min)
+4. Read PATTERNS.md **Pattern 6: Testing Aggregates** (15 min)
+5. Create test plan covering:
+   - Unit tests for commands
+   - Integration tests for full scenarios
+   - Chaos tests for resilience
+
+**Time:** 40 minutes of planning; then test writing
+
+---
+
+### Operator / DevOps
+**Goal:** Understand how to monitor and operate
+
+1. Read ARCHITECTURE.md **Monitoring & Observability** (10 min)
+2. Read ARCHITECTURE.md **Configuration & Tuning** (10 min)
+3. Read ARCHITECTURE.md **Failure Scenarios** (15 min)
+4. Plan:
+   - Which metrics to export?
+   - Which alerts to set?
+   - How to detect issues?
+   - How to recover?
+
+**Time:** 35 minutes
+
+---
+
+## Key Concepts
+
+### Invariants
+Business rules that must NEVER be violated. The core of the domain model.
+
+- **I1:** At most one leader per term
+- **I2:** All active shards have owners
+- **I3:** Shards only assigned to healthy nodes
+- **I4:** Shard assignments stable during lease
+- **I5:** Leader is an active node
+
+### Aggregates
+Clusters of entities enforcing invariants. Root aggregates own state changes.
+
+- **Cluster** (root) - owns topology, shard assignments
+- **LeadershipLease** (root) - owns leadership
+- **ShardAssignment** (root) - owns shard-to-node mappings
+
+### Commands
+Explicit intent to change state. Named with domain language.
+
+- JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards
+
+### Events
+Facts that happened. Published after successful commands.
+
+- NodeJoined, NodeFailed, LeaderElected, ShardAssigned, ShardMigrated
+
+### Policies
+Automated reactions. Connect events to commands.
+
+- "When NodeJoined then RebalanceShards"
+- "When LeadershipLost then ElectLeader"
+
+---
+
+## Glossary
+
+| Term | Definition |
+|------|-----------|
+| Bounded Context | A boundary within which a domain model is consistent (Cluster Coordination) |
+| Aggregate | A cluster of entities enforcing business invariants; transactional boundary |
+| Aggregate Root | The only entity in an aggregate that external code references |
+| Invariant | A business rule that must always be true |
+| Command | A request to change state (intent-driven) |
+| Event | A fact that happened in the past (immutable) |
+| Policy | An automated reaction to events; connects contexts |
+| Read Model | A projection of state optimized for queries (no invariants) |
+| Value Object | Immutable object defined by attributes, not identity |
+| CQRS | Command Query Responsibility Segregation (commands change state; queries read state) |
+| Event Sourcing | Storing events as source of truth; state is derived by replay |
+
+---
+
+## Related Context Maps
+
+**Upstream (External Dependencies):**
+- **NATS** - Provides pub/sub, KV store, JetStream
+- **Local Runtime** - Executes actors on this node
+- **Event Store** - Persists cluster events (optional)
+
+**Downstream (Consumers):**
+- **Actor Runtime Context** - Migrates actors when shards move (reacts to ShardMigrated)
+- **Monitoring Context** - Tracks health and events (subscribes to topology events)
+- **Audit Context** - Records all topology changes (subscribes to all events)
+
+---
+
+## Quick Reference: Decision Trees
+
+### Is a node healthy?
+```
+Node found? → Check status (Active|Draining|Failed)
+             → Active/Draining? → YES
+             → Failed? → NO
+```
+
+### Should we rebalance?
+```
+Leader? → YES
+Active nodes? → YES
+Strategy.Rebalance() → returns new ShardMap
+Validate invariants? → YES
+Publish ShardMigrated events
+```
+
+### Can we assign shard to node?
+```
+Node exists? → YES
+Status active? → YES
+Replication < max? → YES
+Add node to shard's replica list
+```
+
+See [ARCHITECTURE.md](./ARCHITECTURE.md) for full decision trees.
+
+---
+
+## Testing Resources
+
+**Test Coverage Map:**
+- Unit tests: Commands, invariants, value objects
+- Integration tests: Full scenarios (node join, node fail, rebalance)
+- Chaos tests: Partitions, cascading failures, high churn
+
+See [PATTERNS.md](./PATTERNS.md) **Pattern 6** for testing patterns and mocks.
+
+---
+
+## Common Questions
+
+**Q: Why not use Raft for leader election?**
+A: Lease-based election is simpler and sufficient for our use case. Raft would be safer but more complex. See DOMAIN_MODEL.md **Design Decisions**.
+
+**Q: What if a leader fails mid-rebalance?**
+A: New leader will detect incomplete rebalancing and may redo it. This is acceptable (idempotent). See ARCHITECTURE.md **Failure Scenarios**.
+
+**Q: How many shards should we use?**
+A: Default 1024 provides good granularity. Tune based on your cluster size. See ARCHITECTURE.md **Configuration & Tuning**.
+
+**Q: Can actors be lost during rebalancing?**
+A: No, if the application correctly implements actor migration. See DOMAIN_MODEL.md **Gaps**.
+
+**Q: Is eventual consistency acceptable?**
+A: Yes for topology (replicas lag leader by ~100ms). Leadership is strongly consistent (atomic operations). See DOMAIN_MODEL.md **Policies**.
+
+---
+
+## Implementation Checklist
+
+- [ ] Read DOMAIN_MODEL.md Summary + Invariants
+- [ ] Read REFACTORING_SUMMARY.md Overview
+- [ ] Review PATTERNS.md for Phase 1
+- [ ] Implement Phase 1 commands (JoinCluster, MarkNodeFailed)
+- [ ] Add tests for Phase 1
+- [ ] Code review
+- [ ] Merge Phase 1
+- [ ] Repeat for Phases 2-4
+
+---
+
+## Document Version History
+
+| Version | Date | Changes |
+|---------|------|---------|
+| 1.0 | 2026-01-12 | Initial domain model created |
+
+---
+
+## Contact & Questions
+
+For questions about this domain model:
+- **Domain modeling:** Refer to DOMAIN_MODEL.md Invariants & Aggregates sections
+- **Implementation:** Refer to PATTERNS.md for code examples
+- **Architecture:** Refer to ARCHITECTURE.md for system design
+- **Refactoring plan:** Refer to REFACTORING_SUMMARY.md for priorities
+
+---
+
+## Additional Resources
+
+- [Vision](../vision.md) - Product vision for Aether
+- [Project Structure](../README.md) - How this repository is organized
+- [Event Sourcing Guide](../event.go) - Event and EventStore interface
+- [NATS Documentation](https://docs.nats.io) - NATS pub/sub and JetStream
+
--- a/.product-strategy/cluster/PATTERNS.md
+++ b/.product-strategy/cluster/PATTERNS.md
--- a/.product-strategy/cluster/REFACTORING_SUMMARY.md
+++ b/.product-strategy/cluster/REFACTORING_SUMMARY.md
@@ -0,0 +1,509 @@
+# Cluster Coordination: DDD Refactoring Summary
+
+## Overview
+
+The Cluster Coordination bounded context manages distributed topology (nodes, shards, leadership) for Aether's actor system. This document highlights gaps between the intended DDD model and current implementation, with prioritized refactoring recommendations.
+
+---
+
+## Current State: Code vs Domain Model
+
+### What's Working Well
+
+1. **LeaderElection aggregate** (✓)
+   - Correctly uses NATS KV atomic operations to enforce "single leader per term"
+   - Lease renewal every 3s + expiration after 10s prevents split-brain
+   - Lease-based approach simpler than Raft; good for this context
+
+2. **ConsistentHashRing utility** (✓)
+   - Properly implements consistent hashing with virtual nodes (150 per node)
+   - Minimizes shard reshuffling on topology changes
+   - Thread-safe via RWMutex
+
+3. **NodeDiscovery** (✓)
+   - Heartbeat mechanism (every 30s) for membership discovery
+   - Failure detection via absence (90s timeout in ClusterManager)
+   - Graceful shutdown signal (NodeLeft)
+
+4. **Architecture (interfaces)** (✓)
+   - VMRegistry interface decouples cluster package from runtime
+   - Runtime interface avoids import cycles
+   - PlacementStrategy pattern allows pluggable rebalancing algorithms
+
+---
+
+### What Needs Work
+
+#### Gap 1: Anemic Domain Model
+
+**Problem:** ClusterManager, ShardManager lack explicit commands and domain events; mostly data holders.
+
+**Evidence:**
+- ClusterManager: stores state (nodes, shardMap, hashRing) but no command handlers
+- Node updates handled via generic message dispatcher (handleClusterMessage), not domain commands
+- No event publishing; state changes are silent
+
+**Example:**
+```go
+// Current (anemic):
+cm.nodes[update.Node.ID] = update.Node
+cm.hashRing.AddNode(update.Node.ID)
+
+// Intended (DDD):
+event := cm.JoinCluster(nodeInfo)  // Command
+eventBus.Publish(event)            // Event: NodeJoined
+```
+
+**Refactoring:** Extract command methods with explicit intent language
+- [ ] Add JoinCluster(nodeInfo) command handler
+- [ ] Add MarkNodeFailed(nodeID, reason) command handler
+- [ ] Add AssignShards(shardMap) command handler
+- [ ] Publish NodeJoined, NodeFailed, ShardAssigned events
+
+---
+
+#### Gap 2: No Event Sourcing
+
+**Problem:** Topology changes don't produce events; impossible to audit "who owned shard 42 at 3pm?"
+
+**Evidence:**
+- No event store integration (events captured in code comments, not persisted)
+- LeaderElection uses callbacks instead of publishing events
+- No audit trail of topology decisions
+
+**Impact:** Can't rebuild topology state, can't debug rebalancing decisions, can't integrate with other contexts via events.
+
+**Refactoring:** Introduce event publishing
+- [ ] Add EventPublisher interface to aggregates
+- [ ] Publish LeaderElected, LeadershipLost, LeadershipRenewed events
+- [ ] Publish NodeJoined, NodeLeft, NodeFailed events
+- [ ] Publish ShardAssigned, ShardMigrated events
+- [ ] Store events in event store (optional: in-memory for now)
+
+---
+
+#### Gap 3: Responsibility Split (Cluster vs ShardAssignment)
+
+**Problem:** Cluster topology (ClusterManager) and shard assignment (ShardManager) are separate aggregates without clear ownership of invariants.
+
+**Evidence:**
+- ClusterManager decides "node failed, trigger rebalance"
+- ShardManager does "compute new assignments"
+- No one validates "new assignment only uses healthy nodes"
+
+**Risk:** Concurrent rebalancing from multiple nodes; stale assignments to failed nodes; orphaned shards.
+
+**Refactoring:** Unify under Cluster aggregate root (or establish clear interface)
+- [ ] ClusterManager owns Cluster aggregate (nodes, shards, leadership)
+- [ ] ShardManager becomes ShardAssignment aggregate (or ShardingPolicy utility)
+- [ ] Only Cluster can issue ShardAssigned commands
+- [ ] ShardManager validates invariants (all nodes healthy, all shards assigned)
+
+---
+
+#### Gap 4: Rebalancing Logic Incomplete
+
+**Problem:** PlacementStrategy.RebalanceShards is stubbed; actual rebalancing doesn't happen.
+
+**Evidence:** ConsistentHashPlacement.RebalanceShards returns currentMap unchanged (line 214, shard.go)
+
+**Impact:** Adding a node or removing a failed node doesn't actually redistribute shards to new nodes.
+
+**Refactoring:** Implement real rebalancing
+- [ ] Use ConsistentHashRing to compute new assignments
+- [ ] Minimize shard movement (virtual nodes help, but still need to compute delta)
+- [ ] Verify no shard orphaning after new topology
+- [ ] Test: adding node should redistribute ~1/N shards to it
+
+---
+
+#### Gap 5: Invariant Validation Scattered
+
+**Problem:** Invariants checked in multiple places; easy to miss a case.
+
+**Evidence:**
+- Node failure detection in handleNodeUpdate (line 191)
+- Duplicate check in checkNodeHealth (line 283)
+- No central validation that "all shards in [0, ShardCount) are assigned"
+
+**Refactoring:** Centralize invariant validation
+- [ ] Add Cluster.ValidateTopology() method
+- [ ] Add ShardAssignment.ValidateAssignments() method
+- [ ] Call validation after every topology change
+- [ ] Test: add node, verify all shards assigned and no orphans
+
+---
+
+#### Gap 6: LeaderElection Uses Callbacks, Not Events
+
+**Problem:** Leadership changes trigger callbacks (OnBecameLeader, OnNewLeader); no events for other contexts.
+
+**Evidence:**
+```go
+// Current (callbacks in manager.go line 54-63)
+callbacks := LeaderElectionCallbacks{
+    OnBecameLeader: func() { cm.logger.Printf("...") },
+    ...
+}
+
+// Intended (events published to event bus)
+eventBus.Publish(LeaderElected{LeaderID, Term, ExpiresAt})
+```
+
+**Refactoring:** Publish events instead of (or in addition to) callbacks
+- [ ] Publish LeaderElected event
+- [ ] Publish LeadershipLost event
+- [ ] Events captured in event store, enabling other contexts to react
+
+---
+
+## Refactoring Priority Matrix
+
+### High Priority (Blocks Event-Driven Integration)
+
+| ID | Issue | Effort | Impact | Reason |
+|----|-------|--------|--------|--------|
+| 1  | Extract Cluster aggregate with explicit commands | Med | High | Unblocks event publishing; enables other contexts to react |
+| 2  | Implement PlacementStrategy.RebalanceShards | Med | High | Rebalancing currently doesn't work; critical for node scaling |
+| 3  | Publish domain events (NodeJoined, ShardAssigned, etc.) | Med | High | Enables event sourcing, audit trail, inter-context communication |
+
+### Medium Priority (Improves Clarity & Robustness)
+
+| ID | Issue | Effort | Impact | Reason |
+|----|-------|--------|--------|--------|
+| 4  | Extract MarkNodeFailed command handler | Low | Med | Consolidates node failure logic; improves intent clarity |
+| 5  | Unify ShardAssignment invariant validation | Low | Med | Prevents orphaned shards; catches bugs early |
+| 6  | Add shard migration tracking | High | Med | Prevents rebalancing while migrations in flight |
+| 7  | Publish LeaderElection events | Low | Med | Improves observability; auditable leadership changes |
+
+### Low Priority (Nice to Have)
+
+| ID | Issue | Effort | Impact | Reason |
+|----|-------|--------|--------|--------|
+| 8  | Add GetNodeHealth read model | Low | Low | Monitoring/debugging; not core to coordination |
+| 9  | Add rebalancing status tracking | Low | Low | Observability; doesn't affect correctness |
+
+---
+
+## Refactoring Plan (First Sprint)
+
+### Phase 1: Extract Cluster Commands (Week 1)
+
+**Goal:** Make cluster topology changes explicit and intent-driven.
+
+```go
+// Add to ClusterManager
+
+// JoinCluster adds a node to the cluster
+func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
+    cm.mutex.Lock()
+    defer cm.mutex.Unlock()
+
+    // Validate
+    if nodeInfo.ID == "" {
+        return errors.New("node ID empty")
+    }
+    if nodeInfo.Capacity <= 0 {
+        return errors.New("node capacity must be > 0")
+    }
+
+    // Command execution
+    cm.nodes[nodeInfo.ID] = nodeInfo
+    cm.hashRing.AddNode(nodeInfo.ID)
+
+    // Event: publish NodeJoined
+    cm.publishEvent(&NodeJoined{
+        NodeID:    nodeInfo.ID,
+        Address:   nodeInfo.Address,
+        Capacity:  nodeInfo.Capacity,
+        Timestamp: time.Now(),
+    })
+
+    // Trigger rebalancing if leader
+    if cm.IsLeader() {
+        go cm.triggerShardRebalancing("node joined")
+    }
+
+    return nil
+}
+
+// MarkNodeFailed marks a node as failed
+func (cm *ClusterManager) MarkNodeFailed(nodeID string, reason string) error {
+    cm.mutex.Lock()
+    defer cm.mutex.Unlock()
+
+    node, exists := cm.nodes[nodeID]
+    if !exists {
+        return fmt.Errorf("node not found: %s", nodeID)
+    }
+    if node.Status == NodeStatusFailed {
+        return fmt.Errorf("node already failed: %s", nodeID)
+    }
+
+    // Command execution
+    node.Status = NodeStatusFailed
+    cm.hashRing.RemoveNode(nodeID)
+
+    // Event: publish NodeFailed
+    cm.publishEvent(&NodeFailed{
+        NodeID:    nodeID,
+        Reason:    reason,
+        Timestamp: time.Now(),
+    })
+
+    // Trigger rebalancing if leader
+    if cm.IsLeader() {
+        go cm.triggerShardRebalancing("node failed")
+    }
+
+    return nil
+}
+```
+
+**Deliverables:**
+- [ ] ClusterManager.JoinCluster(nodeInfo) command
+- [ ] ClusterManager.MarkNodeFailed(nodeID, reason) command
+- [ ] ClusterManager.publishEvent() helper
+- [ ] Events: NodeJoined, NodeFailed (defined but not yet stored)
+- [ ] Tests: verify commands validate invariants, trigger events
+
+**Blocking Dependency:** EventPublisher interface (phase 2)
+
+---
+
+### Phase 2: Publish Domain Events (Week 2)
+
+**Goal:** Make topology changes observable and auditable.
+
+```go
+// Add EventPublisher interface
+type EventPublisher interface {
+    Publish(event interface{}) error
+}
+
+// ClusterManager uses it
+type ClusterManager struct {
+    // ...
+    publisher EventPublisher
+}
+
+// Define domain events
+type NodeJoined struct {
+    NodeID    string
+    Address   string
+    Capacity  float64
+    Timestamp time.Time
+}
+
+type NodeFailed struct {
+    NodeID    string
+    Reason    string
+    Timestamp time.Time
+}
+
+type ShardAssigned struct {
+    ShardID   int
+    NodeIDs   []string
+    Version   uint64
+    Timestamp time.Time
+}
+
+type ShardMigrated struct {
+    ShardID    int
+    FromNodes  []string
+    ToNodes    []string
+    Timestamp  time.Time
+}
+```
+
+**Deliverables:**
+- [ ] EventPublisher interface
+- [ ] Domain events: NodeJoined, NodeFailed, ShardAssigned, ShardMigrated, RebalancingTriggered, RebalancingCompleted
+- [ ] LeaderElection publishes LeaderElected, LeadershipLost
+- [ ] Events published to NATS (via NATSEventBus) for cross-context communication
+- [ ] Tests: verify events published correctly
+
+---
+
+### Phase 3: Implement Real Rebalancing (Week 3)
+
+**Goal:** Make rebalancing actually redistribute shards to new nodes.
+
+```go
+// In ShardManager (or separate RebalancingStrategy)
+
+func (cp *ConsistentHashPlacement) RebalanceShards(
+    currentMap *ShardMap,
+    activeNodes map[string]*NodeInfo,
+) (*ShardMap, error) {
+    if len(activeNodes) == 0 {
+        return nil, errors.New("no active nodes")
+    }
+
+    // Build new hash ring from current nodes
+    ring := NewConsistentHashRingWithConfig(DefaultHashRingConfig())
+    for nodeID := range activeNodes {
+        ring.AddNode(nodeID)
+    }
+
+    // Reassign each shard via consistent hash
+    newAssignments := make(map[int][]string)
+    for shardID := 0; shardID < len(currentMap.Shards); shardID++ {
+        primaryNode := ring.GetNode(fmt.Sprintf("shard-%d", shardID))
+        newAssignments[shardID] = []string{primaryNode}
+
+        // TODO: add replicas based on replication factor
+    }
+
+    return &ShardMap{
+        Version:    currentMap.Version + 1,
+        Shards:     newAssignments,
+        Nodes:      activeNodes,
+        UpdateTime: time.Now(),
+    }, nil
+}
+```
+
+**Deliverables:**
+- [ ] ConsistentHashPlacement.RebalanceShards implemented (not stubbed)
+- [ ] Handles node addition (redistribute to new node)
+- [ ] Handles node removal (redistribute from failed node)
+- [ ] Tests: adding node redistributes ~1/N shards; removing node doesn't orphan shards
+
+---
+
+### Phase 4: Unify ShardAssignment Invariants (Week 4)
+
+**Goal:** Validate shard assignments are safe before applying.
+
+```go
+// In ClusterManager
+
+func (cm *ClusterManager) AssignShards(newShardMap *ShardMap) error {
+    cm.mutex.Lock()
+    defer cm.mutex.Unlock()
+
+    // Validate: all shards assigned
+    allShards := make(map[int]bool)
+    for shardID := range newShardMap.Shards {
+        allShards[shardID] = true
+    }
+    for i := 0; i < 1024; i++ {
+        if !allShards[i] {
+            return fmt.Errorf("shard %d not assigned", i)
+        }
+    }
+
+    // Validate: all nodes are healthy
+    for _, nodeList := range newShardMap.Shards {
+        for _, nodeID := range nodeList {
+            node := cm.nodes[nodeID]
+            if node.Status != NodeStatusActive {
+                return fmt.Errorf("shard assigned to unhealthy node: %s", nodeID)
+            }
+        }
+    }
+
+    // Apply new assignments
+    oldVersion := cm.shardMap.Version
+    cm.shardMap = newShardMap
+
+    // Publish events for each shard change
+    for shardID, nodeList := range newShardMap.Shards {
+        oldNodes := cm.shardMap.Shards[shardID]
+        if !stringSliceEqual(oldNodes, nodeList) {
+            cm.publishEvent(&ShardMigrated{
+                ShardID:    shardID,
+                FromNodes:  oldNodes,
+                ToNodes:    nodeList,
+                Timestamp:  time.Now(),
+            })
+        }
+    }
+
+    return nil
+}
+```
+
+**Deliverables:**
+- [ ] ShardAssignment invariant validation (all shards assigned, only healthy nodes)
+- [ ] AssignShards command handler in ClusterManager
+- [ ] Publish ShardMigrated events
+- [ ] Tests: reject assignment with orphaned shards; reject assignment to failed node
+
+---
+
+## Testing Checklist
+
+### Unit Tests (Phase 1-2)
+- [ ] JoinCluster command validates node ID is unique
+- [ ] MarkNodeFailed command validates node exists
+- [ ] Commands trigger events
+- [ ] Commands fail on invalid input (empty ID, negative capacity)
+- [ ] Commands fail if not leader (AssignShards, RebalanceShards)
+
+### Integration Tests (Phase 3-4)
+- [ ] Single leader election (3 nodes)
+- [ ] Leader failure → new leader elected within 10s
+- [ ] Node join → shards redistributed to new node
+- [ ] Node failure → shards reassigned from failed node
+- [ ] Graceful shutdown → no 90s timeout
+- [ ] No orphaned shards after rebalancing
+
+### Chaos Tests (Phase 4)
+- [ ] Leader fails mid-rebalance → new leader resumes
+- [ ] Network partition → split-brain prevented by lease
+- [ ] Cascading failures → cluster stabilizes
+- [ ] High churn (nodes join/leave rapidly) → topology converges
+
+---
+
+## Success Metrics
+
+### After Phase 1 (Explicit Commands)
+- ✓ ClusterManager has JoinCluster, MarkNodeFailed command methods
+- ✓ Commands validate preconditions
+- ✓ Commands trigger rebalancing if leader
+
+### After Phase 2 (Domain Events)
+- ✓ NodeJoined, NodeFailed, ShardAssigned events published
+- ✓ LeaderElection publishes LeaderElected, LeadershipLost events
+- ✓ Events visible in NATS pub/sub for other contexts
+
+### After Phase 3 (Real Rebalancing)
+- ✓ PlacementStrategy actually redistributes shards
+- ✓ Adding node → shards assigned to it
+- ✓ Removing node → shards reassigned elsewhere
+- ✓ No orphaned shards
+
+### After Phase 4 (Unified Invariants)
+- ✓ Invalid assignments rejected (unhealthy node, orphaned shard)
+- ✓ All shard changes trigger events
+- ✓ Cluster invariants validated before applying topology
+
+---
+
+## Integration with Other Contexts
+
+Once Cluster Coordination publishes domain events, other contexts can consume them:
+
+### Actor Runtime Context
+- Subscribes to: ShardMigrated
+- Actions: Migrate actors from old node to new node
+
+### Monitoring Context
+- Subscribes to: NodeJoined, NodeFailed, LeaderElected
+- Actions: Update cluster health dashboard
+
+### Audit Context
+- Subscribes to: NodeJoined, NodeFailed, ShardAssigned, LeaderElected
+- Actions: Record topology change log
+
+---
+
+## References
+
+- Domain Model: [DOMAIN_MODEL.md](./DOMAIN_MODEL.md)
+- Current Implementation: [manager.go](./manager.go), [leader.go](./leader.go), [shard.go](./shard.go)
+- Product Vision: [../vision.md](../vision.md)
+