aether/.product-strategy/cluster/EXECUTIVE_SUMMARY.md

# Cluster Coordination: Domain Model Executive Summary

## Overview

I have completed a comprehensive Domain-Driven Design (DDD) analysis of the **Cluster Coordination** bounded context in Aether. This analysis identifies the core business invariants, models the domain as aggregates/commands/events, compares the intended model against the current implementation, and provides a prioritized refactoring roadmap.

**Key Finding:** The Cluster Coordination context has good architectural foundations (LeaderElection, ConsistentHashRing, NodeDiscovery) but lacks proper DDD patterns (explicit commands, domain events, invariant validation). The refactoring is medium effort with high impact on event-driven integration and observability.

---

## Five Core Invariants

These are the non-negotiable business rules that must never break:

1. **Single Leader Per Term** - At most one node is leader; enforced via NATS KV atomic operations
2. **All Active Shards Have Owners** - Every shard ID [0, 1024) must be assigned to ≥1 healthy node
3. **Shards Only on Healthy Nodes** - A shard can only be assigned to nodes in Active status
4. **Assignments Stable During Lease** - Shard topology doesn't arbitrarily change; only rebalances on topology changes
5. **Leader Is Active Node** - If LeaderID is set, that node must be in Cluster.nodes with status=Active

---

## Three Root Aggregates

### Cluster (Root Aggregate)
Owns node topology, shard assignments, and rebalancing orchestration.

**Key Responsibility:** Maintain consistency of cluster topology; only leader can assign shards

**Commands:** JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards

**Events:** NodeJoined, NodeFailed, NodeLeft, ShardAssigned, ShardMigrated, RebalancingTriggered

### LeadershipLease (Root Aggregate)
Owns the leadership claim and ensures single leader per term via lease-based election.

**Key Responsibility:** Maintain exactly one leader; detect failure via lease expiration

**Commands:** ElectLeader, RenewLeadership

**Events:** LeaderElected, LeadershipRenewed, LeadershipLost

### ShardAssignment (Root Aggregate)
Owns shard-to-node mappings and validates assignments respect invariants.

**Key Responsibility:** Track which shards live on which nodes; validate healthy nodes only

**Commands:** AssignShard, RebalanceFromTopology

**Events:** ShardAssigned, ShardMigrated

---

## Code Analysis: What's Working & What Isn't

### What Works Well (✓)
- **LeaderElection** - Correctly implements lease-based election with NATS KV; enforces Invariant 1
- **ConsistentHashRing** - Proper consistent hashing with virtual nodes; minimizes shard reshuffling
- **NodeDiscovery** - Good heartbeat mechanism (30s interval) for membership discovery
- **Architecture** - Interfaces (VMRegistry, Runtime) properly decouple cluster from runtime

### What Needs Work (✗)
1. **Anemic aggregates** - ClusterManager, ShardManager are data holders, not behavior-enforcing aggregates
2. **No domain events** - Topology changes don't publish events; impossible to audit or integrate with other contexts
3. **Responsibility scattered** - Invariant validation in multiple places (handleNodeUpdate, checkNodeHealth)
4. **Rebalancing stubbed** - ConsistentHashPlacement.RebalanceShards returns unchanged map; doesn't actually redistribute shards
5. **Implicit commands** - Node updates via generic message handlers instead of explicit domain commands
6. **Leadership uses callbacks** - LeaderElection publishes via callbacks instead of domain events

**Example Gap:** When a node joins, the current code:
```go
cm.nodes[update.Node.ID] = update.Node    // Silent update
cm.hashRing.AddNode(update.Node.ID)       // No event
// No way for other contexts to learn "node-5 joined"
```

Should be:
```go
cm.JoinCluster(nodeInfo)                  // Explicit command
// Publishes: NodeJoined event
// Consumed by: Monitoring, Audit, Actor Runtime contexts
```

---

## Refactoring Impact & Effort

### Priority Ranking

**High Priority (Blocks Event-Driven Integration)**
1. Extract Cluster commands with invariant validation (Medium effort)
2. Implement real rebalancing strategy (Medium effort)
3. Publish domain events (Medium effort)

**Medium Priority (Improves Clarity)**
4. Extract MarkNodeFailed command (Low effort)
5. Centralize shard invariant validation (Low effort)
6. Add shard migration tracking (High effort, improves robustness)
7. Publish LeaderElection events (Low effort, improves observability)

**Total Effort:** ~4-6 weeks (2-3 dev sprints)

### Timeline
- **Phase 1 (Week 1):** Extract explicit commands (JoinCluster, MarkNodeFailed)
- **Phase 2 (Week 2):** Publish domain events (NodeJoined, ShardAssigned, ShardMigrated)
- **Phase 3 (Week 3):** Implement real rebalancing (ConsistentHashPlacement)
- **Phase 4 (Week 4):** Centralize invariant validation (ShardAssignment)

### Success Metrics
After Phase 1:
- ✓ ClusterManager has explicit command methods
- ✓ Commands validate preconditions
- ✓ Commands trigger events

After Phase 2:
- ✓ All topology changes publish events to NATS
- ✓ Other contexts can subscribe and react
- ✓ Full audit trail of topology decisions

After Phase 3:
- ✓ Adding node → shards actually redistribute to it
- ✓ Removing node → shards reassigned elsewhere
- ✓ No orphaned shards

After Phase 4:
- ✓ Invalid assignments rejected (unhealthy node, orphaned shard)
- ✓ Invariants validated before applying changes
- ✓ Cluster state always consistent

---

## Design Decisions

### Why Lease-Based Election Instead of Raft?
**Chosen:** Lease-based (NATS KV with atomic operations)

**Rationale:**
- Simpler to reason about and implement
- Detect failure in 10s (acceptable for coordination)
- Lower overhead
- Good enough for a library (not a mission-critical system)

**Trade-off:** Risk of split-brain if partition persists >10s and both sides have NATS access (mitigated by atomic operations and term incrementing)

### Why Consistent Hashing for Shard Assignment?
**Chosen:** Consistent hashing with virtual nodes (150 per node)

**Rationale:**
- Minimize shard movement on topology change (crucial for actor locality)
- Deterministic without central state (nodes can independently compute assignments)
- Well-proven in distributed systems (Dynamo, Cassandra)

**Trade-off:** May not achieve perfect load balance (mitigated by allowing custom PlacementStrategy)

### Why Leader-Only Rebalancing?
**Chosen:** Only leader can initiate shard rebalancing

**Rationale:**
- Prevent cascading rebalancing decisions from multiple nodes
- Single source of truth for topology
- Simplifies invariant enforcement

**Trade-off:** Leader is bottleneck if rebalancing is expensive (mitigated by leader delegating to algorithms)

---

## Key Policies (Automated Reactions)

The cluster enforces these policies to maintain invariants:

| Policy | Trigger | Action | Rationale |
|--------|---------|--------|-----------|
| Single Leader | LeadershipLost | ElectLeader | Ensure leadership is re-established |
| Lease Renewal | Every 3s | RenewLeadership | Detect leader failure after 10s |
| Node Failure Detection | Every 30s | Check LastSeen; if >90s, MarkNodeFailed | Detect crash/network partition |
| Rebalancing Trigger | NodeJoined/NodeFailed | RebalanceShards (if leader) | Redistribute load on topology change |
| Shard Coverage | Periodic + after failures | Validate all shards assigned | Prevent shard orphaning |
| Graceful Shutdown | NodeDiscovery.Stop() | Announce NodeLeft | Signal intentional leave (no 90s timeout) |

---

## Testing Strategy

### Unit Tests
- Commands validate invariants ✓
- Events publish correctly ✓
- Value objects enforce constraints ✓
- Strategies compute assignments ✓

### Integration Tests
- Single leader election (3 nodes) ✓
- Leader failure → new leader within 10s ✓
- Node join → shards redistributed ✓
- Node failure → shards reassigned ✓
- Graceful shutdown → no false failures ✓

### Chaos Tests
- Leader fails mid-rebalance → recovers ✓
- Network partition → no split-brain ✓
- Cascading failures → stabilizes ✓
- High churn → topology converges ✓

---

## Observability & Monitoring

### Key Metrics
```
# Topology
cluster.nodes.count [active|draining|failed]
cluster.shards.assigned [0, 1024]
cluster.shards.orphaned [0, 1024]  # RED if > 0

# Leadership
cluster.leader.is_leader [0|1]
cluster.leader.term
cluster.leader.lease_expires_in_seconds

# Rebalancing
cluster.rebalancing.triggered [reason]
cluster.rebalancing.active [0|1]
cluster.rebalancing.completed [shards_moved]

# Node Health
cluster.node.heartbeat_latency_ms [per node]
cluster.node.load [per node]
cluster.node.vm_count [per node]
```

### Alerts
- Leader heartbeat missing > 5s → election stuck
- Rebalancing > 5min → something wrong
- Orphaned shards > 0 → CRITICAL (invariant violation)
- Node failure > 50% → investigate

---

## Integration with Other Contexts

Once Cluster Coordination publishes domain events, other contexts can react:

### Actor Runtime Context
**Subscribes to:** ShardMigrated event
**Action:** Migrate actors from old node to new node
**Why:** When shards move, actors must follow

### Monitoring Context
**Subscribes to:** NodeJoined, NodeFailed, LeaderElected
**Action:** Update cluster health dashboard
**Why:** Operators need visibility into topology

### Audit Context
**Subscribes to:** NodeJoined, NodeFailed, ShardAssigned, LeaderElected
**Action:** Record topology change log
**Why:** Compliance, debugging, replaying state

---

## Known Limitations & Gaps

### Current Limitations
1. **No quorum-based election** - Single leader with lease; could add quorum for stronger consistency
2. **No actor migration semantics** - Cluster signals ShardMigrated, but application must implement migration
3. **No topology versioning** - ShardMap.Version exists but not enforced for consistency
4. **No leader handoff** - If leader fails mid-rebalance, new leader may redo migrations
5. **No split-brain detection** - Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but system doesn't validate)

### Acceptable for Now
- **Eventual consistency on topology** - Non-leaders lag by ~100ms (acceptable for routing)
- **90s failure detection** - Allows for network jitter; can be accelerated by application
- **No strong consistency** - Leadership is strongly consistent (atomic KV); topology is eventually consistent (NATS pub/sub)

---

## Deliverables

Five comprehensive documents have been created in `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/`:

1. **INDEX.md** (11 KB) - Navigation guide for all documents
2. **DOMAIN_MODEL.md** (43 KB) - Complete tactical DDD model with invariants, aggregates, commands, events, policies
3. **REFACTORING_SUMMARY.md** (16 KB) - Gap analysis and prioritized 4-phase implementation plan
4. **ARCHITECTURE.md** (37 KB) - Visual reference with diagrams, decision trees, state machines, failure scenarios
5. **PATTERNS.md** (30 KB) - Side-by-side code examples showing current vs intended implementations

**Total:** ~140 KB of documentation with detailed guidance for implementation

---

## Next Steps

### Immediate (This Sprint)
1. Review DOMAIN_MODEL.md with team (1 hour meeting)
2. Confirm invariants are correct (discussion)
3. Agree on Phase 1 priorities (which commands first?)

### Short-Term (Next Sprint)
1. Implement Phase 1: Extract explicit commands (JoinCluster, MarkNodeFailed)
2. Add unit tests for commands
3. Code review against PATTERNS.md examples

### Medium-Term (Following Sprints)
1. Phase 2: Publish domain events
2. Phase 3: Implement real rebalancing
3. Phase 4: Centralize invariant validation

### Integration
1. Once events are published, other contexts (Actor Runtime, Monitoring) can subscribe
2. Enables proper event-driven architecture
3. Full audit trail becomes available

---

## Questions & Discussion Points

1. **Are the 5 invariants correct?** Do we have all the non-negotiable rules captured?
2. **Are the aggregate boundaries clear?** Should Cluster own ShardAssignment, or is it independent?
3. **Is the 4-phase plan realistic?** Do we have capacity? Should we combine phases?
4. **Which contexts will consume events?** Who needs NodeJoined? ShardMigrated? LeaderElected?
5. **Do we need stronger consistency?** Should we add quorum-based election? Or is lease-based sufficient?

---

## Conclusion

The Cluster Coordination context has solid foundations but needs DDD patterns to reach its full potential:

- **Current state:** Functional but opaque (hard to audit, hard to integrate, hard to test)
- **Intended state:** Event-driven, auditable, testable, properly aggregated (medium effort)
- **Impact:** Enables event-sourced architecture, cross-context communication, observability

The refactoring is realistic and phased, allowing incremental value delivery. Phase 1 alone (explicit commands) provides immediate clarity. Phase 2 (events) unblocks other contexts.

**Recommendation:** Start with Phase 1 (Week 1) to validate the DDD approach. If the team finds value, continue to Phase 2-4. If not, we have clearer domain models for reference.

---

## Document References

| Document | Purpose | Best For | Size |
|----------|---------|----------|------|
| [INDEX.md](./INDEX.md) | Navigation guide | Quick start, finding what you need | 11 KB |
| [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) | Complete DDD model | Understanding the domain, design review | 43 KB |
| [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) | Implementation plan | Planning work, estimating effort | 16 KB |
| [ARCHITECTURE.md](./ARCHITECTURE.md) | System design & diagrams | Understanding behavior, debugging, tuning | 37 KB |
| [PATTERNS.md](./PATTERNS.md) | Code examples | Writing the refactoring code | 30 KB |

**Start:** [INDEX.md](./INDEX.md)

**For implementation:** [PATTERNS.md](./PATTERNS.md)

**For design review:** [DOMAIN_MODEL.md](./DOMAIN_MODEL.md)

**For planning:** [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md)

---

## About This Analysis

This domain model was created using systematic Domain-Driven Design analysis:

1. **Identified invariants first** - What business rules must never break?
2. **Modeled aggregates around invariants** - Which entities enforce which rules?
3. **Designed commands & events** - What intents and facts describe state changes?
4. **Compared with existing code** - What's intended vs actual?
5. **Prioritized refactoring** - What to fix first, second, third?

The approach follows Eric Evans' Domain-Driven Design (2003) and tactical patterns like aggregates, value objects, and event sourcing.

---

**Created:** January 12, 2026

**By:** Domain Modeling Analysis (Claude)

**For:** Aether Project - Cluster Coordination Bounded Context