Files
aether/.product-strategy/cluster/EXECUTIVE_SUMMARY.md
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

14 KiB

Cluster Coordination: Domain Model Executive Summary

Overview

I have completed a comprehensive Domain-Driven Design (DDD) analysis of the Cluster Coordination bounded context in Aether. This analysis identifies the core business invariants, models the domain as aggregates/commands/events, compares the intended model against the current implementation, and provides a prioritized refactoring roadmap.

Key Finding: The Cluster Coordination context has good architectural foundations (LeaderElection, ConsistentHashRing, NodeDiscovery) but lacks proper DDD patterns (explicit commands, domain events, invariant validation). The refactoring is medium effort with high impact on event-driven integration and observability.


Five Core Invariants

These are the non-negotiable business rules that must never break:

  1. Single Leader Per Term - At most one node is leader; enforced via NATS KV atomic operations
  2. All Active Shards Have Owners - Every shard ID [0, 1024) must be assigned to ≥1 healthy node
  3. Shards Only on Healthy Nodes - A shard can only be assigned to nodes in Active status
  4. Assignments Stable During Lease - Shard topology doesn't arbitrarily change; only rebalances on topology changes
  5. Leader Is Active Node - If LeaderID is set, that node must be in Cluster.nodes with status=Active

Three Root Aggregates

Cluster (Root Aggregate)

Owns node topology, shard assignments, and rebalancing orchestration.

Key Responsibility: Maintain consistency of cluster topology; only leader can assign shards

Commands: JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards

Events: NodeJoined, NodeFailed, NodeLeft, ShardAssigned, ShardMigrated, RebalancingTriggered

LeadershipLease (Root Aggregate)

Owns the leadership claim and ensures single leader per term via lease-based election.

Key Responsibility: Maintain exactly one leader; detect failure via lease expiration

Commands: ElectLeader, RenewLeadership

Events: LeaderElected, LeadershipRenewed, LeadershipLost

ShardAssignment (Root Aggregate)

Owns shard-to-node mappings and validates assignments respect invariants.

Key Responsibility: Track which shards live on which nodes; validate healthy nodes only

Commands: AssignShard, RebalanceFromTopology

Events: ShardAssigned, ShardMigrated


Code Analysis: What's Working & What Isn't

What Works Well (✓)

  • LeaderElection - Correctly implements lease-based election with NATS KV; enforces Invariant 1
  • ConsistentHashRing - Proper consistent hashing with virtual nodes; minimizes shard reshuffling
  • NodeDiscovery - Good heartbeat mechanism (30s interval) for membership discovery
  • Architecture - Interfaces (VMRegistry, Runtime) properly decouple cluster from runtime

What Needs Work (✗)

  1. Anemic aggregates - ClusterManager, ShardManager are data holders, not behavior-enforcing aggregates
  2. No domain events - Topology changes don't publish events; impossible to audit or integrate with other contexts
  3. Responsibility scattered - Invariant validation in multiple places (handleNodeUpdate, checkNodeHealth)
  4. Rebalancing stubbed - ConsistentHashPlacement.RebalanceShards returns unchanged map; doesn't actually redistribute shards
  5. Implicit commands - Node updates via generic message handlers instead of explicit domain commands
  6. Leadership uses callbacks - LeaderElection publishes via callbacks instead of domain events

Example Gap: When a node joins, the current code:

cm.nodes[update.Node.ID] = update.Node    // Silent update
cm.hashRing.AddNode(update.Node.ID)       // No event
// No way for other contexts to learn "node-5 joined"

Should be:

cm.JoinCluster(nodeInfo)                  // Explicit command
// Publishes: NodeJoined event
// Consumed by: Monitoring, Audit, Actor Runtime contexts

Refactoring Impact & Effort

Priority Ranking

High Priority (Blocks Event-Driven Integration)

  1. Extract Cluster commands with invariant validation (Medium effort)
  2. Implement real rebalancing strategy (Medium effort)
  3. Publish domain events (Medium effort)

Medium Priority (Improves Clarity) 4. Extract MarkNodeFailed command (Low effort) 5. Centralize shard invariant validation (Low effort) 6. Add shard migration tracking (High effort, improves robustness) 7. Publish LeaderElection events (Low effort, improves observability)

Total Effort: ~4-6 weeks (2-3 dev sprints)

Timeline

  • Phase 1 (Week 1): Extract explicit commands (JoinCluster, MarkNodeFailed)
  • Phase 2 (Week 2): Publish domain events (NodeJoined, ShardAssigned, ShardMigrated)
  • Phase 3 (Week 3): Implement real rebalancing (ConsistentHashPlacement)
  • Phase 4 (Week 4): Centralize invariant validation (ShardAssignment)

Success Metrics

After Phase 1:

  • ✓ ClusterManager has explicit command methods
  • ✓ Commands validate preconditions
  • ✓ Commands trigger events

After Phase 2:

  • ✓ All topology changes publish events to NATS
  • ✓ Other contexts can subscribe and react
  • ✓ Full audit trail of topology decisions

After Phase 3:

  • ✓ Adding node → shards actually redistribute to it
  • ✓ Removing node → shards reassigned elsewhere
  • ✓ No orphaned shards

After Phase 4:

  • ✓ Invalid assignments rejected (unhealthy node, orphaned shard)
  • ✓ Invariants validated before applying changes
  • ✓ Cluster state always consistent

Design Decisions

Why Lease-Based Election Instead of Raft?

Chosen: Lease-based (NATS KV with atomic operations)

Rationale:

  • Simpler to reason about and implement
  • Detect failure in 10s (acceptable for coordination)
  • Lower overhead
  • Good enough for a library (not a mission-critical system)

Trade-off: Risk of split-brain if partition persists >10s and both sides have NATS access (mitigated by atomic operations and term incrementing)

Why Consistent Hashing for Shard Assignment?

Chosen: Consistent hashing with virtual nodes (150 per node)

Rationale:

  • Minimize shard movement on topology change (crucial for actor locality)
  • Deterministic without central state (nodes can independently compute assignments)
  • Well-proven in distributed systems (Dynamo, Cassandra)

Trade-off: May not achieve perfect load balance (mitigated by allowing custom PlacementStrategy)

Why Leader-Only Rebalancing?

Chosen: Only leader can initiate shard rebalancing

Rationale:

  • Prevent cascading rebalancing decisions from multiple nodes
  • Single source of truth for topology
  • Simplifies invariant enforcement

Trade-off: Leader is bottleneck if rebalancing is expensive (mitigated by leader delegating to algorithms)


Key Policies (Automated Reactions)

The cluster enforces these policies to maintain invariants:

Policy Trigger Action Rationale
Single Leader LeadershipLost ElectLeader Ensure leadership is re-established
Lease Renewal Every 3s RenewLeadership Detect leader failure after 10s
Node Failure Detection Every 30s Check LastSeen; if >90s, MarkNodeFailed Detect crash/network partition
Rebalancing Trigger NodeJoined/NodeFailed RebalanceShards (if leader) Redistribute load on topology change
Shard Coverage Periodic + after failures Validate all shards assigned Prevent shard orphaning
Graceful Shutdown NodeDiscovery.Stop() Announce NodeLeft Signal intentional leave (no 90s timeout)

Testing Strategy

Unit Tests

  • Commands validate invariants ✓
  • Events publish correctly ✓
  • Value objects enforce constraints ✓
  • Strategies compute assignments ✓

Integration Tests

  • Single leader election (3 nodes) ✓
  • Leader failure → new leader within 10s ✓
  • Node join → shards redistributed ✓
  • Node failure → shards reassigned ✓
  • Graceful shutdown → no false failures ✓

Chaos Tests

  • Leader fails mid-rebalance → recovers ✓
  • Network partition → no split-brain ✓
  • Cascading failures → stabilizes ✓
  • High churn → topology converges ✓

Observability & Monitoring

Key Metrics

# Topology
cluster.nodes.count [active|draining|failed]
cluster.shards.assigned [0, 1024]
cluster.shards.orphaned [0, 1024]  # RED if > 0

# Leadership
cluster.leader.is_leader [0|1]
cluster.leader.term
cluster.leader.lease_expires_in_seconds

# Rebalancing
cluster.rebalancing.triggered [reason]
cluster.rebalancing.active [0|1]
cluster.rebalancing.completed [shards_moved]

# Node Health
cluster.node.heartbeat_latency_ms [per node]
cluster.node.load [per node]
cluster.node.vm_count [per node]

Alerts

  • Leader heartbeat missing > 5s → election stuck
  • Rebalancing > 5min → something wrong
  • Orphaned shards > 0 → CRITICAL (invariant violation)
  • Node failure > 50% → investigate

Integration with Other Contexts

Once Cluster Coordination publishes domain events, other contexts can react:

Actor Runtime Context

Subscribes to: ShardMigrated event Action: Migrate actors from old node to new node Why: When shards move, actors must follow

Monitoring Context

Subscribes to: NodeJoined, NodeFailed, LeaderElected Action: Update cluster health dashboard Why: Operators need visibility into topology

Audit Context

Subscribes to: NodeJoined, NodeFailed, ShardAssigned, LeaderElected Action: Record topology change log Why: Compliance, debugging, replaying state


Known Limitations & Gaps

Current Limitations

  1. No quorum-based election - Single leader with lease; could add quorum for stronger consistency
  2. No actor migration semantics - Cluster signals ShardMigrated, but application must implement migration
  3. No topology versioning - ShardMap.Version exists but not enforced for consistency
  4. No leader handoff - If leader fails mid-rebalance, new leader may redo migrations
  5. No split-brain detection - Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but system doesn't validate)

Acceptable for Now

  • Eventual consistency on topology - Non-leaders lag by ~100ms (acceptable for routing)
  • 90s failure detection - Allows for network jitter; can be accelerated by application
  • No strong consistency - Leadership is strongly consistent (atomic KV); topology is eventually consistent (NATS pub/sub)

Deliverables

Five comprehensive documents have been created in /Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/:

  1. INDEX.md (11 KB) - Navigation guide for all documents
  2. DOMAIN_MODEL.md (43 KB) - Complete tactical DDD model with invariants, aggregates, commands, events, policies
  3. REFACTORING_SUMMARY.md (16 KB) - Gap analysis and prioritized 4-phase implementation plan
  4. ARCHITECTURE.md (37 KB) - Visual reference with diagrams, decision trees, state machines, failure scenarios
  5. PATTERNS.md (30 KB) - Side-by-side code examples showing current vs intended implementations

Total: ~140 KB of documentation with detailed guidance for implementation


Next Steps

Immediate (This Sprint)

  1. Review DOMAIN_MODEL.md with team (1 hour meeting)
  2. Confirm invariants are correct (discussion)
  3. Agree on Phase 1 priorities (which commands first?)

Short-Term (Next Sprint)

  1. Implement Phase 1: Extract explicit commands (JoinCluster, MarkNodeFailed)
  2. Add unit tests for commands
  3. Code review against PATTERNS.md examples

Medium-Term (Following Sprints)

  1. Phase 2: Publish domain events
  2. Phase 3: Implement real rebalancing
  3. Phase 4: Centralize invariant validation

Integration

  1. Once events are published, other contexts (Actor Runtime, Monitoring) can subscribe
  2. Enables proper event-driven architecture
  3. Full audit trail becomes available

Questions & Discussion Points

  1. Are the 5 invariants correct? Do we have all the non-negotiable rules captured?
  2. Are the aggregate boundaries clear? Should Cluster own ShardAssignment, or is it independent?
  3. Is the 4-phase plan realistic? Do we have capacity? Should we combine phases?
  4. Which contexts will consume events? Who needs NodeJoined? ShardMigrated? LeaderElected?
  5. Do we need stronger consistency? Should we add quorum-based election? Or is lease-based sufficient?

Conclusion

The Cluster Coordination context has solid foundations but needs DDD patterns to reach its full potential:

  • Current state: Functional but opaque (hard to audit, hard to integrate, hard to test)
  • Intended state: Event-driven, auditable, testable, properly aggregated (medium effort)
  • Impact: Enables event-sourced architecture, cross-context communication, observability

The refactoring is realistic and phased, allowing incremental value delivery. Phase 1 alone (explicit commands) provides immediate clarity. Phase 2 (events) unblocks other contexts.

Recommendation: Start with Phase 1 (Week 1) to validate the DDD approach. If the team finds value, continue to Phase 2-4. If not, we have clearer domain models for reference.


Document References

Document Purpose Best For Size
INDEX.md Navigation guide Quick start, finding what you need 11 KB
DOMAIN_MODEL.md Complete DDD model Understanding the domain, design review 43 KB
REFACTORING_SUMMARY.md Implementation plan Planning work, estimating effort 16 KB
ARCHITECTURE.md System design & diagrams Understanding behavior, debugging, tuning 37 KB
PATTERNS.md Code examples Writing the refactoring code 30 KB

Start: INDEX.md

For implementation: PATTERNS.md

For design review: DOMAIN_MODEL.md

For planning: REFACTORING_SUMMARY.md


About This Analysis

This domain model was created using systematic Domain-Driven Design analysis:

  1. Identified invariants first - What business rules must never break?
  2. Modeled aggregates around invariants - Which entities enforce which rules?
  3. Designed commands & events - What intents and facts describe state changes?
  4. Compared with existing code - What's intended vs actual?
  5. Prioritized refactoring - What to fix first, second, third?

The approach follows Eric Evans' Domain-Driven Design (2003) and tactical patterns like aggregates, value objects, and event sourcing.


Created: January 12, 2026

By: Domain Modeling Analysis (Claude)

For: Aether Project - Cluster Coordination Bounded Context