Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
Cluster Coordination: Domain Model Index
This directory contains a complete Domain-Driven Design model for the Cluster Coordination bounded context in Aether. Use this index to navigate the documentation.
Quick Start
Start here if you're new to this analysis:
- Read DOMAIN_MODEL.md Summary section (1-2 min)
- Skim the Invariants section to understand the constraints (2 min)
- Read REFACTORING_SUMMARY.md Overview: Code vs Domain Model (5 min)
- Choose your next step based on your role (see below)
Documents Overview
DOMAIN_MODEL.md - Comprehensive DDD Model
What: Complete tactical DDD model with aggregates, commands, events, policies, read models
Contains:
- Cluster Coordination context summary
- 5 core invariants (single leader, shard coverage, etc.)
- 3 root aggregates: Cluster, LeadershipLease, ShardAssignment
- 6 commands: JoinCluster, ElectLeader, MarkNodeFailed, etc.
- 11 events: NodeJoined, LeaderElected, ShardMigrated, etc.
- 10 policies: Single Leader Policy, Lease Renewal Policy, etc.
- 5 read models: GetClusterTopology, GetLeader, GetShardAssignments, etc.
- 4 value objects: NodeInfo, ShardMap, LeadershipLease, Term
- Code analysis comparing intended vs actual implementation
- 7 refactoring issues with impact assessment
- Testing strategy (unit, integration, chaos tests)
- Boundary conditions and limitations
- Alignment with product vision
Best for: Understanding the complete domain model, identifying what needs to change
Time: 30-40 minutes for thorough read
REFACTORING_SUMMARY.md - Implementation Roadmap
What: Prioritized refactoring plan with 4-phase implementation strategy
Contains:
- Current state vs intended state (what's working, what's broken)
- Gap analysis (6 major gaps identified)
- Priority matrix (High/Medium/Low priority issues)
- 4-phase refactoring plan:
- Phase 1: Extract cluster commands (Week 1)
- Phase 2: Publish domain events (Week 2)
- Phase 3: Implement real rebalancing (Week 3)
- Phase 4: Unify shard invariants (Week 4)
- Code examples for each phase
- Testing checklist
- Success metrics
- Integration with other contexts
Best for: Planning implementation, deciding what to do first, estimating effort
Time: 20-30 minutes for full review
ARCHITECTURE.md - Visual Reference & Decision Trees
What: Diagrams, flowcharts, and decision trees for understanding cluster behavior
Contains:
- High-level architecture diagram
- Aggregate boundaries diagram
- 3 command flow diagrams with decision points
- 3 decision trees (Is node healthy? Should rebalance? Can assign shard?)
- State transition diagrams (cluster, node, leadership)
- Concurrency model and thread safety explanation
- Event sequences with timelines
- Configuration parameters and tuning guide
- Failure scenarios & recovery procedures
- Monitoring & observability metrics
- Alerts and SLOs
Best for: Understanding how the system works, debugging issues, planning changes
Time: 20-30 minutes; skim decision trees as needed
PATTERNS.md - Code Patterns & Examples
What: Side-by-side code comparisons showing how to evolve the implementation
Contains:
- 6 refactoring patterns with current vs intended code:
- Commands vs Message Handlers
- Value Objects vs Primitives
- Event Publishing (no events → explicit events)
- Invariant Validation (scattered → centralized)
- Rebalancing Strategy (stubbed → real implementation)
- Testing Aggregates (hard to test → testable with mocks)
- Full code examples for each pattern
- Benefits of each approach
- Mock implementations for testing
Best for: Developers writing the refactoring code, understanding specific patterns
Time: 30-40 minutes to read all examples
Navigation by Role
Product Manager / Tech Lead
Goal: Understand what needs to change and why
- Read REFACTORING_SUMMARY.md Overview (5 min)
- Read REFACTORING_SUMMARY.md Refactoring Priority Matrix (3 min)
- Read REFACTORING_SUMMARY.md Refactoring Plan - Phase 1 only (5 min)
- Decide: Which phases to commit to? Which timeline?
Time: 15 minutes
Developer (Implementing Refactoring)
Goal: Understand how to write the code
- Skim DOMAIN_MODEL.md Summary (2 min)
- Read DOMAIN_MODEL.md Invariants (5 min) - what must never break?
- Read DOMAIN_MODEL.md Aggregates (5 min) - who owns what?
- Read DOMAIN_MODEL.md Commands (5 min) - what actions are there?
- Read PATTERNS.md sections relevant to your phase (10-20 min)
- Refer to ARCHITECTURE.md Decision Trees as you code (on-demand)
Time: 30-50 minutes of reading; then 2-8 hours of coding per phase
Architect / Design Reviewer
Goal: Validate the domain model and refactoring plan
- Read DOMAIN_MODEL.md completely (40 min)
- Review REFACTORING_SUMMARY.md Current State (10 min)
- Scan ARCHITECTURE.md diagrams (10 min)
- Review PATTERNS.md for code quality (15 min)
- Provide feedback on:
- Are the invariants correct and complete?
- Are the aggregate boundaries clear?
- Is the refactoring plan realistic?
- Are we missing any patterns?
Time: 60-90 minutes
QA / Tester
Goal: Understand what to test
- Read DOMAIN_MODEL.md Testing Strategy (5 min)
- Read REFACTORING_SUMMARY.md Testing Checklist (5 min)
- Read ARCHITECTURE.md Failure Scenarios (10 min)
- Read PATTERNS.md Pattern 6: Testing Aggregates (15 min)
- Create test plan covering:
- Unit tests for commands
- Integration tests for full scenarios
- Chaos tests for resilience
Time: 40 minutes of planning; then test writing
Operator / DevOps
Goal: Understand how to monitor and operate
- Read ARCHITECTURE.md Monitoring & Observability (10 min)
- Read ARCHITECTURE.md Configuration & Tuning (10 min)
- Read ARCHITECTURE.md Failure Scenarios (15 min)
- Plan:
- Which metrics to export?
- Which alerts to set?
- How to detect issues?
- How to recover?
Time: 35 minutes
Key Concepts
Invariants
Business rules that must NEVER be violated. The core of the domain model.
- I1: At most one leader per term
- I2: All active shards have owners
- I3: Shards only assigned to healthy nodes
- I4: Shard assignments stable during lease
- I5: Leader is an active node
Aggregates
Clusters of entities enforcing invariants. Root aggregates own state changes.
- Cluster (root) - owns topology, shard assignments
- LeadershipLease (root) - owns leadership
- ShardAssignment (root) - owns shard-to-node mappings
Commands
Explicit intent to change state. Named with domain language.
- JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards
Events
Facts that happened. Published after successful commands.
- NodeJoined, NodeFailed, LeaderElected, ShardAssigned, ShardMigrated
Policies
Automated reactions. Connect events to commands.
- "When NodeJoined then RebalanceShards"
- "When LeadershipLost then ElectLeader"
Glossary
| Term | Definition |
|---|---|
| Bounded Context | A boundary within which a domain model is consistent (Cluster Coordination) |
| Aggregate | A cluster of entities enforcing business invariants; transactional boundary |
| Aggregate Root | The only entity in an aggregate that external code references |
| Invariant | A business rule that must always be true |
| Command | A request to change state (intent-driven) |
| Event | A fact that happened in the past (immutable) |
| Policy | An automated reaction to events; connects contexts |
| Read Model | A projection of state optimized for queries (no invariants) |
| Value Object | Immutable object defined by attributes, not identity |
| CQRS | Command Query Responsibility Segregation (commands change state; queries read state) |
| Event Sourcing | Storing events as source of truth; state is derived by replay |
Related Context Maps
Upstream (External Dependencies):
- NATS - Provides pub/sub, KV store, JetStream
- Local Runtime - Executes actors on this node
- Event Store - Persists cluster events (optional)
Downstream (Consumers):
- Actor Runtime Context - Migrates actors when shards move (reacts to ShardMigrated)
- Monitoring Context - Tracks health and events (subscribes to topology events)
- Audit Context - Records all topology changes (subscribes to all events)
Quick Reference: Decision Trees
Is a node healthy?
Node found? → Check status (Active|Draining|Failed)
→ Active/Draining? → YES
→ Failed? → NO
Should we rebalance?
Leader? → YES
Active nodes? → YES
Strategy.Rebalance() → returns new ShardMap
Validate invariants? → YES
Publish ShardMigrated events
Can we assign shard to node?
Node exists? → YES
Status active? → YES
Replication < max? → YES
Add node to shard's replica list
See ARCHITECTURE.md for full decision trees.
Testing Resources
Test Coverage Map:
- Unit tests: Commands, invariants, value objects
- Integration tests: Full scenarios (node join, node fail, rebalance)
- Chaos tests: Partitions, cascading failures, high churn
See PATTERNS.md Pattern 6 for testing patterns and mocks.
Common Questions
Q: Why not use Raft for leader election? A: Lease-based election is simpler and sufficient for our use case. Raft would be safer but more complex. See DOMAIN_MODEL.md Design Decisions.
Q: What if a leader fails mid-rebalance? A: New leader will detect incomplete rebalancing and may redo it. This is acceptable (idempotent). See ARCHITECTURE.md Failure Scenarios.
Q: How many shards should we use? A: Default 1024 provides good granularity. Tune based on your cluster size. See ARCHITECTURE.md Configuration & Tuning.
Q: Can actors be lost during rebalancing? A: No, if the application correctly implements actor migration. See DOMAIN_MODEL.md Gaps.
Q: Is eventual consistency acceptable? A: Yes for topology (replicas lag leader by ~100ms). Leadership is strongly consistent (atomic operations). See DOMAIN_MODEL.md Policies.
Implementation Checklist
- Read DOMAIN_MODEL.md Summary + Invariants
- Read REFACTORING_SUMMARY.md Overview
- Review PATTERNS.md for Phase 1
- Implement Phase 1 commands (JoinCluster, MarkNodeFailed)
- Add tests for Phase 1
- Code review
- Merge Phase 1
- Repeat for Phases 2-4
Document Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-01-12 | Initial domain model created |
Contact & Questions
For questions about this domain model:
- Domain modeling: Refer to DOMAIN_MODEL.md Invariants & Aggregates sections
- Implementation: Refer to PATTERNS.md for code examples
- Architecture: Refer to ARCHITECTURE.md for system design
- Refactoring plan: Refer to REFACTORING_SUMMARY.md for priorities
Additional Resources
- Vision - Product vision for Aether
- Project Structure - How this repository is organized
- Event Sourcing Guide - Event and EventStore interface
- NATS Documentation - NATS pub/sub and JetStream