Files
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

11 KiB

Cluster Coordination: Domain Model Index

This directory contains a complete Domain-Driven Design model for the Cluster Coordination bounded context in Aether. Use this index to navigate the documentation.


Quick Start

Start here if you're new to this analysis:

  1. Read DOMAIN_MODEL.md Summary section (1-2 min)
  2. Skim the Invariants section to understand the constraints (2 min)
  3. Read REFACTORING_SUMMARY.md Overview: Code vs Domain Model (5 min)
  4. Choose your next step based on your role (see below)

Documents Overview

DOMAIN_MODEL.md - Comprehensive DDD Model

What: Complete tactical DDD model with aggregates, commands, events, policies, read models

Contains:

  • Cluster Coordination context summary
  • 5 core invariants (single leader, shard coverage, etc.)
  • 3 root aggregates: Cluster, LeadershipLease, ShardAssignment
  • 6 commands: JoinCluster, ElectLeader, MarkNodeFailed, etc.
  • 11 events: NodeJoined, LeaderElected, ShardMigrated, etc.
  • 10 policies: Single Leader Policy, Lease Renewal Policy, etc.
  • 5 read models: GetClusterTopology, GetLeader, GetShardAssignments, etc.
  • 4 value objects: NodeInfo, ShardMap, LeadershipLease, Term
  • Code analysis comparing intended vs actual implementation
  • 7 refactoring issues with impact assessment
  • Testing strategy (unit, integration, chaos tests)
  • Boundary conditions and limitations
  • Alignment with product vision

Best for: Understanding the complete domain model, identifying what needs to change

Time: 30-40 minutes for thorough read


REFACTORING_SUMMARY.md - Implementation Roadmap

What: Prioritized refactoring plan with 4-phase implementation strategy

Contains:

  • Current state vs intended state (what's working, what's broken)
  • Gap analysis (6 major gaps identified)
  • Priority matrix (High/Medium/Low priority issues)
  • 4-phase refactoring plan:
    • Phase 1: Extract cluster commands (Week 1)
    • Phase 2: Publish domain events (Week 2)
    • Phase 3: Implement real rebalancing (Week 3)
    • Phase 4: Unify shard invariants (Week 4)
  • Code examples for each phase
  • Testing checklist
  • Success metrics
  • Integration with other contexts

Best for: Planning implementation, deciding what to do first, estimating effort

Time: 20-30 minutes for full review


ARCHITECTURE.md - Visual Reference & Decision Trees

What: Diagrams, flowcharts, and decision trees for understanding cluster behavior

Contains:

  • High-level architecture diagram
  • Aggregate boundaries diagram
  • 3 command flow diagrams with decision points
  • 3 decision trees (Is node healthy? Should rebalance? Can assign shard?)
  • State transition diagrams (cluster, node, leadership)
  • Concurrency model and thread safety explanation
  • Event sequences with timelines
  • Configuration parameters and tuning guide
  • Failure scenarios & recovery procedures
  • Monitoring & observability metrics
  • Alerts and SLOs

Best for: Understanding how the system works, debugging issues, planning changes

Time: 20-30 minutes; skim decision trees as needed


PATTERNS.md - Code Patterns & Examples

What: Side-by-side code comparisons showing how to evolve the implementation

Contains:

  • 6 refactoring patterns with current vs intended code:
    1. Commands vs Message Handlers
    2. Value Objects vs Primitives
    3. Event Publishing (no events → explicit events)
    4. Invariant Validation (scattered → centralized)
    5. Rebalancing Strategy (stubbed → real implementation)
    6. Testing Aggregates (hard to test → testable with mocks)
  • Full code examples for each pattern
  • Benefits of each approach
  • Mock implementations for testing

Best for: Developers writing the refactoring code, understanding specific patterns

Time: 30-40 minutes to read all examples


Navigation by Role

Product Manager / Tech Lead

Goal: Understand what needs to change and why

  1. Read REFACTORING_SUMMARY.md Overview (5 min)
  2. Read REFACTORING_SUMMARY.md Refactoring Priority Matrix (3 min)
  3. Read REFACTORING_SUMMARY.md Refactoring Plan - Phase 1 only (5 min)
  4. Decide: Which phases to commit to? Which timeline?

Time: 15 minutes


Developer (Implementing Refactoring)

Goal: Understand how to write the code

  1. Skim DOMAIN_MODEL.md Summary (2 min)
  2. Read DOMAIN_MODEL.md Invariants (5 min) - what must never break?
  3. Read DOMAIN_MODEL.md Aggregates (5 min) - who owns what?
  4. Read DOMAIN_MODEL.md Commands (5 min) - what actions are there?
  5. Read PATTERNS.md sections relevant to your phase (10-20 min)
  6. Refer to ARCHITECTURE.md Decision Trees as you code (on-demand)

Time: 30-50 minutes of reading; then 2-8 hours of coding per phase


Architect / Design Reviewer

Goal: Validate the domain model and refactoring plan

  1. Read DOMAIN_MODEL.md completely (40 min)
  2. Review REFACTORING_SUMMARY.md Current State (10 min)
  3. Scan ARCHITECTURE.md diagrams (10 min)
  4. Review PATTERNS.md for code quality (15 min)
  5. Provide feedback on:
    • Are the invariants correct and complete?
    • Are the aggregate boundaries clear?
    • Is the refactoring plan realistic?
    • Are we missing any patterns?

Time: 60-90 minutes


QA / Tester

Goal: Understand what to test

  1. Read DOMAIN_MODEL.md Testing Strategy (5 min)
  2. Read REFACTORING_SUMMARY.md Testing Checklist (5 min)
  3. Read ARCHITECTURE.md Failure Scenarios (10 min)
  4. Read PATTERNS.md Pattern 6: Testing Aggregates (15 min)
  5. Create test plan covering:
    • Unit tests for commands
    • Integration tests for full scenarios
    • Chaos tests for resilience

Time: 40 minutes of planning; then test writing


Operator / DevOps

Goal: Understand how to monitor and operate

  1. Read ARCHITECTURE.md Monitoring & Observability (10 min)
  2. Read ARCHITECTURE.md Configuration & Tuning (10 min)
  3. Read ARCHITECTURE.md Failure Scenarios (15 min)
  4. Plan:
    • Which metrics to export?
    • Which alerts to set?
    • How to detect issues?
    • How to recover?

Time: 35 minutes


Key Concepts

Invariants

Business rules that must NEVER be violated. The core of the domain model.

  • I1: At most one leader per term
  • I2: All active shards have owners
  • I3: Shards only assigned to healthy nodes
  • I4: Shard assignments stable during lease
  • I5: Leader is an active node

Aggregates

Clusters of entities enforcing invariants. Root aggregates own state changes.

  • Cluster (root) - owns topology, shard assignments
  • LeadershipLease (root) - owns leadership
  • ShardAssignment (root) - owns shard-to-node mappings

Commands

Explicit intent to change state. Named with domain language.

  • JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards

Events

Facts that happened. Published after successful commands.

  • NodeJoined, NodeFailed, LeaderElected, ShardAssigned, ShardMigrated

Policies

Automated reactions. Connect events to commands.

  • "When NodeJoined then RebalanceShards"
  • "When LeadershipLost then ElectLeader"

Glossary

Term Definition
Bounded Context A boundary within which a domain model is consistent (Cluster Coordination)
Aggregate A cluster of entities enforcing business invariants; transactional boundary
Aggregate Root The only entity in an aggregate that external code references
Invariant A business rule that must always be true
Command A request to change state (intent-driven)
Event A fact that happened in the past (immutable)
Policy An automated reaction to events; connects contexts
Read Model A projection of state optimized for queries (no invariants)
Value Object Immutable object defined by attributes, not identity
CQRS Command Query Responsibility Segregation (commands change state; queries read state)
Event Sourcing Storing events as source of truth; state is derived by replay

Upstream (External Dependencies):

  • NATS - Provides pub/sub, KV store, JetStream
  • Local Runtime - Executes actors on this node
  • Event Store - Persists cluster events (optional)

Downstream (Consumers):

  • Actor Runtime Context - Migrates actors when shards move (reacts to ShardMigrated)
  • Monitoring Context - Tracks health and events (subscribes to topology events)
  • Audit Context - Records all topology changes (subscribes to all events)

Quick Reference: Decision Trees

Is a node healthy?

Node found? → Check status (Active|Draining|Failed)
             → Active/Draining? → YES
             → Failed? → NO

Should we rebalance?

Leader? → YES
Active nodes? → YES
Strategy.Rebalance() → returns new ShardMap
Validate invariants? → YES
Publish ShardMigrated events

Can we assign shard to node?

Node exists? → YES
Status active? → YES
Replication < max? → YES
Add node to shard's replica list

See ARCHITECTURE.md for full decision trees.


Testing Resources

Test Coverage Map:

  • Unit tests: Commands, invariants, value objects
  • Integration tests: Full scenarios (node join, node fail, rebalance)
  • Chaos tests: Partitions, cascading failures, high churn

See PATTERNS.md Pattern 6 for testing patterns and mocks.


Common Questions

Q: Why not use Raft for leader election? A: Lease-based election is simpler and sufficient for our use case. Raft would be safer but more complex. See DOMAIN_MODEL.md Design Decisions.

Q: What if a leader fails mid-rebalance? A: New leader will detect incomplete rebalancing and may redo it. This is acceptable (idempotent). See ARCHITECTURE.md Failure Scenarios.

Q: How many shards should we use? A: Default 1024 provides good granularity. Tune based on your cluster size. See ARCHITECTURE.md Configuration & Tuning.

Q: Can actors be lost during rebalancing? A: No, if the application correctly implements actor migration. See DOMAIN_MODEL.md Gaps.

Q: Is eventual consistency acceptable? A: Yes for topology (replicas lag leader by ~100ms). Leadership is strongly consistent (atomic operations). See DOMAIN_MODEL.md Policies.


Implementation Checklist

  • Read DOMAIN_MODEL.md Summary + Invariants
  • Read REFACTORING_SUMMARY.md Overview
  • Review PATTERNS.md for Phase 1
  • Implement Phase 1 commands (JoinCluster, MarkNodeFailed)
  • Add tests for Phase 1
  • Code review
  • Merge Phase 1
  • Repeat for Phases 2-4

Document Version History

Version Date Changes
1.0 2026-01-12 Initial domain model created

Contact & Questions

For questions about this domain model:

  • Domain modeling: Refer to DOMAIN_MODEL.md Invariants & Aggregates sections
  • Implementation: Refer to PATTERNS.md for code examples
  • Architecture: Refer to ARCHITECTURE.md for system design
  • Refactoring plan: Refer to REFACTORING_SUMMARY.md for priorities

Additional Resources