flowmade-one/aether

Fork 0

Files

Hugo Nijhuis 271f5db444

CI / build (push) Successful in 21s

Details

CI / integration (push) Failing after 2m1s

Details

Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-12 23:57:20 +01:00

14 KiB

Raw Blame History

Cluster Coordination: Domain Model Executive Summary

Overview

I have completed a comprehensive Domain-Driven Design (DDD) analysis of the Cluster Coordination bounded context in Aether. This analysis identifies the core business invariants, models the domain as aggregates/commands/events, compares the intended model against the current implementation, and provides a prioritized refactoring roadmap.

Key Finding: The Cluster Coordination context has good architectural foundations (LeaderElection, ConsistentHashRing, NodeDiscovery) but lacks proper DDD patterns (explicit commands, domain events, invariant validation). The refactoring is medium effort with high impact on event-driven integration and observability.

Five Core Invariants

These are the non-negotiable business rules that must never break:

Single Leader Per Term - At most one node is leader; enforced via NATS KV atomic operations
All Active Shards Have Owners - Every shard ID [0, 1024) must be assigned to ≥1 healthy node
Shards Only on Healthy Nodes - A shard can only be assigned to nodes in Active status
Assignments Stable During Lease - Shard topology doesn't arbitrarily change; only rebalances on topology changes
Leader Is Active Node - If LeaderID is set, that node must be in Cluster.nodes with status=Active

Three Root Aggregates

Cluster (Root Aggregate)

Owns node topology, shard assignments, and rebalancing orchestration.

Key Responsibility: Maintain consistency of cluster topology; only leader can assign shards

Commands: JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards

Events: NodeJoined, NodeFailed, NodeLeft, ShardAssigned, ShardMigrated, RebalancingTriggered

LeadershipLease (Root Aggregate)

Owns the leadership claim and ensures single leader per term via lease-based election.

Key Responsibility: Maintain exactly one leader; detect failure via lease expiration

Commands: ElectLeader, RenewLeadership

Events: LeaderElected, LeadershipRenewed, LeadershipLost

ShardAssignment (Root Aggregate)

Owns shard-to-node mappings and validates assignments respect invariants.

Key Responsibility: Track which shards live on which nodes; validate healthy nodes only

Commands: AssignShard, RebalanceFromTopology

Events: ShardAssigned, ShardMigrated

Code Analysis: What's Working & What Isn't

What Works Well (✓)

LeaderElection - Correctly implements lease-based election with NATS KV; enforces Invariant 1
ConsistentHashRing - Proper consistent hashing with virtual nodes; minimizes shard reshuffling
NodeDiscovery - Good heartbeat mechanism (30s interval) for membership discovery
Architecture - Interfaces (VMRegistry, Runtime) properly decouple cluster from runtime

What Needs Work (✗)

Anemic aggregates - ClusterManager, ShardManager are data holders, not behavior-enforcing aggregates
No domain events - Topology changes don't publish events; impossible to audit or integrate with other contexts
Responsibility scattered - Invariant validation in multiple places (handleNodeUpdate, checkNodeHealth)
Rebalancing stubbed - ConsistentHashPlacement.RebalanceShards returns unchanged map; doesn't actually redistribute shards
Implicit commands - Node updates via generic message handlers instead of explicit domain commands
Leadership uses callbacks - LeaderElection publishes via callbacks instead of domain events

Example Gap: When a node joins, the current code:

cm.nodes[update.Node.ID] = update.Node    // Silent update
cm.hashRing.AddNode(update.Node.ID)       // No event
// No way for other contexts to learn "node-5 joined"

Should be:

cm.JoinCluster(nodeInfo)                  // Explicit command
// Publishes: NodeJoined event
// Consumed by: Monitoring, Audit, Actor Runtime contexts

Refactoring Impact & Effort

Priority Ranking

High Priority (Blocks Event-Driven Integration)

Extract Cluster commands with invariant validation (Medium effort)
Implement real rebalancing strategy (Medium effort)
Publish domain events (Medium effort)

Medium Priority (Improves Clarity) 4. Extract MarkNodeFailed command (Low effort) 5. Centralize shard invariant validation (Low effort) 6. Add shard migration tracking (High effort, improves robustness) 7. Publish LeaderElection events (Low effort, improves observability)

Total Effort: ~4-6 weeks (2-3 dev sprints)

Timeline

Phase 1 (Week 1): Extract explicit commands (JoinCluster, MarkNodeFailed)
Phase 2 (Week 2): Publish domain events (NodeJoined, ShardAssigned, ShardMigrated)
Phase 3 (Week 3): Implement real rebalancing (ConsistentHashPlacement)
Phase 4 (Week 4): Centralize invariant validation (ShardAssignment)

Success Metrics

After Phase 1:

✓ ClusterManager has explicit command methods
✓ Commands validate preconditions
✓ Commands trigger events

After Phase 2:

✓ All topology changes publish events to NATS
✓ Other contexts can subscribe and react
✓ Full audit trail of topology decisions

After Phase 3:

✓ Adding node → shards actually redistribute to it
✓ Removing node → shards reassigned elsewhere
✓ No orphaned shards

After Phase 4:

✓ Invalid assignments rejected (unhealthy node, orphaned shard)
✓ Invariants validated before applying changes
✓ Cluster state always consistent

Design Decisions

Why Lease-Based Election Instead of Raft?

Chosen: Lease-based (NATS KV with atomic operations)

Rationale:

Simpler to reason about and implement
Detect failure in 10s (acceptable for coordination)
Lower overhead
Good enough for a library (not a mission-critical system)

Trade-off: Risk of split-brain if partition persists >10s and both sides have NATS access (mitigated by atomic operations and term incrementing)

Why Consistent Hashing for Shard Assignment?

Chosen: Consistent hashing with virtual nodes (150 per node)

Rationale:

Minimize shard movement on topology change (crucial for actor locality)
Deterministic without central state (nodes can independently compute assignments)
Well-proven in distributed systems (Dynamo, Cassandra)

Trade-off: May not achieve perfect load balance (mitigated by allowing custom PlacementStrategy)

Why Leader-Only Rebalancing?

Chosen: Only leader can initiate shard rebalancing

Rationale:

Prevent cascading rebalancing decisions from multiple nodes
Single source of truth for topology
Simplifies invariant enforcement

Trade-off: Leader is bottleneck if rebalancing is expensive (mitigated by leader delegating to algorithms)

Key Policies (Automated Reactions)

The cluster enforces these policies to maintain invariants:

Policy	Trigger	Action	Rationale
Single Leader	LeadershipLost	ElectLeader	Ensure leadership is re-established
Lease Renewal	Every 3s	RenewLeadership	Detect leader failure after 10s
Node Failure Detection	Every 30s	Check LastSeen; if >90s, MarkNodeFailed	Detect crash/network partition
Rebalancing Trigger	NodeJoined/NodeFailed	RebalanceShards (if leader)	Redistribute load on topology change
Shard Coverage	Periodic + after failures	Validate all shards assigned	Prevent shard orphaning
Graceful Shutdown	NodeDiscovery.Stop()	Announce NodeLeft	Signal intentional leave (no 90s timeout)

Testing Strategy

Unit Tests

Commands validate invariants ✓
Events publish correctly ✓
Value objects enforce constraints ✓
Strategies compute assignments ✓

Integration Tests

Single leader election (3 nodes) ✓
Leader failure → new leader within 10s ✓
Node join → shards redistributed ✓
Node failure → shards reassigned ✓
Graceful shutdown → no false failures ✓

Chaos Tests

Leader fails mid-rebalance → recovers ✓
Network partition → no split-brain ✓
Cascading failures → stabilizes ✓
High churn → topology converges ✓

Observability & Monitoring

Key Metrics

# Topology
cluster.nodes.count [active|draining|failed]
cluster.shards.assigned [0, 1024]
cluster.shards.orphaned [0, 1024]  # RED if > 0

# Leadership
cluster.leader.is_leader [0|1]
cluster.leader.term
cluster.leader.lease_expires_in_seconds

# Rebalancing
cluster.rebalancing.triggered [reason]
cluster.rebalancing.active [0|1]
cluster.rebalancing.completed [shards_moved]

# Node Health
cluster.node.heartbeat_latency_ms [per node]
cluster.node.load [per node]
cluster.node.vm_count [per node]

Alerts

Leader heartbeat missing > 5s → election stuck
Rebalancing > 5min → something wrong
Orphaned shards > 0 → CRITICAL (invariant violation)
Node failure > 50% → investigate

Integration with Other Contexts

Once Cluster Coordination publishes domain events, other contexts can react:

Actor Runtime Context

Subscribes to: ShardMigrated event Action: Migrate actors from old node to new node Why: When shards move, actors must follow

Monitoring Context

Subscribes to: NodeJoined, NodeFailed, LeaderElected Action: Update cluster health dashboard Why: Operators need visibility into topology

Audit Context

Subscribes to: NodeJoined, NodeFailed, ShardAssigned, LeaderElected Action: Record topology change log Why: Compliance, debugging, replaying state

Known Limitations & Gaps

Current Limitations

No quorum-based election - Single leader with lease; could add quorum for stronger consistency
No actor migration semantics - Cluster signals ShardMigrated, but application must implement migration
No topology versioning - ShardMap.Version exists but not enforced for consistency
No leader handoff - If leader fails mid-rebalance, new leader may redo migrations
No split-brain detection - Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but system doesn't validate)

Acceptable for Now

Eventual consistency on topology - Non-leaders lag by ~100ms (acceptable for routing)
90s failure detection - Allows for network jitter; can be accelerated by application
No strong consistency - Leadership is strongly consistent (atomic KV); topology is eventually consistent (NATS pub/sub)

Deliverables

Five comprehensive documents have been created in /Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/:

INDEX.md (11 KB) - Navigation guide for all documents
DOMAIN_MODEL.md (43 KB) - Complete tactical DDD model with invariants, aggregates, commands, events, policies
REFACTORING_SUMMARY.md (16 KB) - Gap analysis and prioritized 4-phase implementation plan
ARCHITECTURE.md (37 KB) - Visual reference with diagrams, decision trees, state machines, failure scenarios
PATTERNS.md (30 KB) - Side-by-side code examples showing current vs intended implementations

Total: ~140 KB of documentation with detailed guidance for implementation

Next Steps

Immediate (This Sprint)

Review DOMAIN_MODEL.md with team (1 hour meeting)
Confirm invariants are correct (discussion)
Agree on Phase 1 priorities (which commands first?)

Short-Term (Next Sprint)

Implement Phase 1: Extract explicit commands (JoinCluster, MarkNodeFailed)
Add unit tests for commands
Code review against PATTERNS.md examples

Medium-Term (Following Sprints)

Phase 2: Publish domain events
Phase 3: Implement real rebalancing
Phase 4: Centralize invariant validation

Integration

Once events are published, other contexts (Actor Runtime, Monitoring) can subscribe
Enables proper event-driven architecture
Full audit trail becomes available

Questions & Discussion Points

Are the 5 invariants correct? Do we have all the non-negotiable rules captured?
Are the aggregate boundaries clear? Should Cluster own ShardAssignment, or is it independent?
Is the 4-phase plan realistic? Do we have capacity? Should we combine phases?
Which contexts will consume events? Who needs NodeJoined? ShardMigrated? LeaderElected?
Do we need stronger consistency? Should we add quorum-based election? Or is lease-based sufficient?

Conclusion

The Cluster Coordination context has solid foundations but needs DDD patterns to reach its full potential:

Current state: Functional but opaque (hard to audit, hard to integrate, hard to test)
Intended state: Event-driven, auditable, testable, properly aggregated (medium effort)
Impact: Enables event-sourced architecture, cross-context communication, observability

The refactoring is realistic and phased, allowing incremental value delivery. Phase 1 alone (explicit commands) provides immediate clarity. Phase 2 (events) unblocks other contexts.

Recommendation: Start with Phase 1 (Week 1) to validate the DDD approach. If the team finds value, continue to Phase 2-4. If not, we have clearer domain models for reference.

Document References

Document	Purpose	Best For	Size
INDEX.md	Navigation guide	Quick start, finding what you need	11 KB
DOMAIN_MODEL.md	Complete DDD model	Understanding the domain, design review	43 KB
REFACTORING_SUMMARY.md	Implementation plan	Planning work, estimating effort	16 KB
ARCHITECTURE.md	System design & diagrams	Understanding behavior, debugging, tuning	37 KB
PATTERNS.md	Code examples	Writing the refactoring code	30 KB

Start: INDEX.md

For implementation: PATTERNS.md

For design review: DOMAIN_MODEL.md

For planning: REFACTORING_SUMMARY.md

About This Analysis

This domain model was created using systematic Domain-Driven Design analysis:

Identified invariants first - What business rules must never break?
Modeled aggregates around invariants - Which entities enforce which rules?
Designed commands & events - What intents and facts describe state changes?
Compared with existing code - What's intended vs actual?
Prioritized refactoring - What to fix first, second, third?

The approach follows Eric Evans' Domain-Driven Design (2003) and tactical patterns like aggregates, value objects, and event sourcing.

Created: January 12, 2026

By: Domain Modeling Analysis (Claude)

For: Aether Project - Cluster Coordination Bounded Context

14 KiB Raw Blame History

Cluster Coordination: Domain Model Executive Summary

Overview

Five Core Invariants

Three Root Aggregates

Cluster (Root Aggregate)

LeadershipLease (Root Aggregate)

ShardAssignment (Root Aggregate)

Code Analysis: What's Working & What Isn't

What Works Well (✓)

What Needs Work (✗)

Refactoring Impact & Effort

Priority Ranking

Timeline

Success Metrics

Design Decisions

Why Lease-Based Election Instead of Raft?

Why Consistent Hashing for Shard Assignment?

Why Leader-Only Rebalancing?

Key Policies (Automated Reactions)

Testing Strategy

Unit Tests

Integration Tests

Chaos Tests

Observability & Monitoring

Key Metrics

Alerts

Integration with Other Contexts

Actor Runtime Context

Monitoring Context

Audit Context

Known Limitations & Gaps

Current Limitations

Acceptable for Now

Deliverables

Next Steps

Immediate (This Sprint)

Short-Term (Next Sprint)

Medium-Term (Following Sprints)

Integration

Questions & Discussion Points

Conclusion

Document References

About This Analysis

14 KiB

Raw Blame History