Files
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

353 lines
11 KiB
Markdown

# Cluster Coordination: Domain Model Index
This directory contains a complete Domain-Driven Design model for the Cluster Coordination bounded context in Aether. Use this index to navigate the documentation.
---
## Quick Start
**Start here if you're new to this analysis:**
1. Read [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) **Summary** section (1-2 min)
2. Skim the **Invariants** section to understand the constraints (2 min)
3. Read [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) **Overview: Code vs Domain Model** (5 min)
4. Choose your next step based on your role (see below)
---
## Documents Overview
### [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Comprehensive DDD Model
**What:** Complete tactical DDD model with aggregates, commands, events, policies, read models
**Contains:**
- Cluster Coordination context summary
- 5 core invariants (single leader, shard coverage, etc.)
- 3 root aggregates: Cluster, LeadershipLease, ShardAssignment
- 6 commands: JoinCluster, ElectLeader, MarkNodeFailed, etc.
- 11 events: NodeJoined, LeaderElected, ShardMigrated, etc.
- 10 policies: Single Leader Policy, Lease Renewal Policy, etc.
- 5 read models: GetClusterTopology, GetLeader, GetShardAssignments, etc.
- 4 value objects: NodeInfo, ShardMap, LeadershipLease, Term
- Code analysis comparing intended vs actual implementation
- 7 refactoring issues with impact assessment
- Testing strategy (unit, integration, chaos tests)
- Boundary conditions and limitations
- Alignment with product vision
**Best for:** Understanding the complete domain model, identifying what needs to change
**Time:** 30-40 minutes for thorough read
---
### [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation Roadmap
**What:** Prioritized refactoring plan with 4-phase implementation strategy
**Contains:**
- Current state vs intended state (what's working, what's broken)
- Gap analysis (6 major gaps identified)
- Priority matrix (High/Medium/Low priority issues)
- 4-phase refactoring plan:
- Phase 1: Extract cluster commands (Week 1)
- Phase 2: Publish domain events (Week 2)
- Phase 3: Implement real rebalancing (Week 3)
- Phase 4: Unify shard invariants (Week 4)
- Code examples for each phase
- Testing checklist
- Success metrics
- Integration with other contexts
**Best for:** Planning implementation, deciding what to do first, estimating effort
**Time:** 20-30 minutes for full review
---
### [ARCHITECTURE.md](./ARCHITECTURE.md) - Visual Reference & Decision Trees
**What:** Diagrams, flowcharts, and decision trees for understanding cluster behavior
**Contains:**
- High-level architecture diagram
- Aggregate boundaries diagram
- 3 command flow diagrams with decision points
- 3 decision trees (Is node healthy? Should rebalance? Can assign shard?)
- State transition diagrams (cluster, node, leadership)
- Concurrency model and thread safety explanation
- Event sequences with timelines
- Configuration parameters and tuning guide
- Failure scenarios & recovery procedures
- Monitoring & observability metrics
- Alerts and SLOs
**Best for:** Understanding how the system works, debugging issues, planning changes
**Time:** 20-30 minutes; skim decision trees as needed
---
### [PATTERNS.md](./PATTERNS.md) - Code Patterns & Examples
**What:** Side-by-side code comparisons showing how to evolve the implementation
**Contains:**
- 6 refactoring patterns with current vs intended code:
1. Commands vs Message Handlers
2. Value Objects vs Primitives
3. Event Publishing (no events → explicit events)
4. Invariant Validation (scattered → centralized)
5. Rebalancing Strategy (stubbed → real implementation)
6. Testing Aggregates (hard to test → testable with mocks)
- Full code examples for each pattern
- Benefits of each approach
- Mock implementations for testing
**Best for:** Developers writing the refactoring code, understanding specific patterns
**Time:** 30-40 minutes to read all examples
---
## Navigation by Role
### Product Manager / Tech Lead
**Goal:** Understand what needs to change and why
1. Read REFACTORING_SUMMARY.md **Overview** (5 min)
2. Read REFACTORING_SUMMARY.md **Refactoring Priority Matrix** (3 min)
3. Read REFACTORING_SUMMARY.md **Refactoring Plan** - Phase 1 only (5 min)
4. Decide: Which phases to commit to? Which timeline?
**Time:** 15 minutes
---
### Developer (Implementing Refactoring)
**Goal:** Understand how to write the code
1. Skim DOMAIN_MODEL.md **Summary** (2 min)
2. Read DOMAIN_MODEL.md **Invariants** (5 min) - what must never break?
3. Read DOMAIN_MODEL.md **Aggregates** (5 min) - who owns what?
4. Read DOMAIN_MODEL.md **Commands** (5 min) - what actions are there?
5. Read PATTERNS.md sections relevant to your phase (10-20 min)
6. Refer to ARCHITECTURE.md **Decision Trees** as you code (on-demand)
**Time:** 30-50 minutes of reading; then 2-8 hours of coding per phase
---
### Architect / Design Reviewer
**Goal:** Validate the domain model and refactoring plan
1. Read DOMAIN_MODEL.md completely (40 min)
2. Review REFACTORING_SUMMARY.md **Current State** (10 min)
3. Scan ARCHITECTURE.md diagrams (10 min)
4. Review PATTERNS.md for code quality (15 min)
5. Provide feedback on:
- Are the invariants correct and complete?
- Are the aggregate boundaries clear?
- Is the refactoring plan realistic?
- Are we missing any patterns?
**Time:** 60-90 minutes
---
### QA / Tester
**Goal:** Understand what to test
1. Read DOMAIN_MODEL.md **Testing Strategy** (5 min)
2. Read REFACTORING_SUMMARY.md **Testing Checklist** (5 min)
3. Read ARCHITECTURE.md **Failure Scenarios** (10 min)
4. Read PATTERNS.md **Pattern 6: Testing Aggregates** (15 min)
5. Create test plan covering:
- Unit tests for commands
- Integration tests for full scenarios
- Chaos tests for resilience
**Time:** 40 minutes of planning; then test writing
---
### Operator / DevOps
**Goal:** Understand how to monitor and operate
1. Read ARCHITECTURE.md **Monitoring & Observability** (10 min)
2. Read ARCHITECTURE.md **Configuration & Tuning** (10 min)
3. Read ARCHITECTURE.md **Failure Scenarios** (15 min)
4. Plan:
- Which metrics to export?
- Which alerts to set?
- How to detect issues?
- How to recover?
**Time:** 35 minutes
---
## Key Concepts
### Invariants
Business rules that must NEVER be violated. The core of the domain model.
- **I1:** At most one leader per term
- **I2:** All active shards have owners
- **I3:** Shards only assigned to healthy nodes
- **I4:** Shard assignments stable during lease
- **I5:** Leader is an active node
### Aggregates
Clusters of entities enforcing invariants. Root aggregates own state changes.
- **Cluster** (root) - owns topology, shard assignments
- **LeadershipLease** (root) - owns leadership
- **ShardAssignment** (root) - owns shard-to-node mappings
### Commands
Explicit intent to change state. Named with domain language.
- JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards
### Events
Facts that happened. Published after successful commands.
- NodeJoined, NodeFailed, LeaderElected, ShardAssigned, ShardMigrated
### Policies
Automated reactions. Connect events to commands.
- "When NodeJoined then RebalanceShards"
- "When LeadershipLost then ElectLeader"
---
## Glossary
| Term | Definition |
|------|-----------|
| Bounded Context | A boundary within which a domain model is consistent (Cluster Coordination) |
| Aggregate | A cluster of entities enforcing business invariants; transactional boundary |
| Aggregate Root | The only entity in an aggregate that external code references |
| Invariant | A business rule that must always be true |
| Command | A request to change state (intent-driven) |
| Event | A fact that happened in the past (immutable) |
| Policy | An automated reaction to events; connects contexts |
| Read Model | A projection of state optimized for queries (no invariants) |
| Value Object | Immutable object defined by attributes, not identity |
| CQRS | Command Query Responsibility Segregation (commands change state; queries read state) |
| Event Sourcing | Storing events as source of truth; state is derived by replay |
---
## Related Context Maps
**Upstream (External Dependencies):**
- **NATS** - Provides pub/sub, KV store, JetStream
- **Local Runtime** - Executes actors on this node
- **Event Store** - Persists cluster events (optional)
**Downstream (Consumers):**
- **Actor Runtime Context** - Migrates actors when shards move (reacts to ShardMigrated)
- **Monitoring Context** - Tracks health and events (subscribes to topology events)
- **Audit Context** - Records all topology changes (subscribes to all events)
---
## Quick Reference: Decision Trees
### Is a node healthy?
```
Node found? → Check status (Active|Draining|Failed)
→ Active/Draining? → YES
→ Failed? → NO
```
### Should we rebalance?
```
Leader? → YES
Active nodes? → YES
Strategy.Rebalance() → returns new ShardMap
Validate invariants? → YES
Publish ShardMigrated events
```
### Can we assign shard to node?
```
Node exists? → YES
Status active? → YES
Replication < max? → YES
Add node to shard's replica list
```
See [ARCHITECTURE.md](./ARCHITECTURE.md) for full decision trees.
---
## Testing Resources
**Test Coverage Map:**
- Unit tests: Commands, invariants, value objects
- Integration tests: Full scenarios (node join, node fail, rebalance)
- Chaos tests: Partitions, cascading failures, high churn
See [PATTERNS.md](./PATTERNS.md) **Pattern 6** for testing patterns and mocks.
---
## Common Questions
**Q: Why not use Raft for leader election?**
A: Lease-based election is simpler and sufficient for our use case. Raft would be safer but more complex. See DOMAIN_MODEL.md **Design Decisions**.
**Q: What if a leader fails mid-rebalance?**
A: New leader will detect incomplete rebalancing and may redo it. This is acceptable (idempotent). See ARCHITECTURE.md **Failure Scenarios**.
**Q: How many shards should we use?**
A: Default 1024 provides good granularity. Tune based on your cluster size. See ARCHITECTURE.md **Configuration & Tuning**.
**Q: Can actors be lost during rebalancing?**
A: No, if the application correctly implements actor migration. See DOMAIN_MODEL.md **Gaps**.
**Q: Is eventual consistency acceptable?**
A: Yes for topology (replicas lag leader by ~100ms). Leadership is strongly consistent (atomic operations). See DOMAIN_MODEL.md **Policies**.
---
## Implementation Checklist
- [ ] Read DOMAIN_MODEL.md Summary + Invariants
- [ ] Read REFACTORING_SUMMARY.md Overview
- [ ] Review PATTERNS.md for Phase 1
- [ ] Implement Phase 1 commands (JoinCluster, MarkNodeFailed)
- [ ] Add tests for Phase 1
- [ ] Code review
- [ ] Merge Phase 1
- [ ] Repeat for Phases 2-4
---
## Document Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-01-12 | Initial domain model created |
---
## Contact & Questions
For questions about this domain model:
- **Domain modeling:** Refer to DOMAIN_MODEL.md Invariants & Aggregates sections
- **Implementation:** Refer to PATTERNS.md for code examples
- **Architecture:** Refer to ARCHITECTURE.md for system design
- **Refactoring plan:** Refer to REFACTORING_SUMMARY.md for priorities
---
## Additional Resources
- [Vision](../vision.md) - Product vision for Aether
- [Project Structure](../README.md) - How this repository is organized
- [Event Sourcing Guide](../event.go) - Event and EventStore interface
- [NATS Documentation](https://docs.nats.io) - NATS pub/sub and JetStream